MyArxiv
Computer Vision and Pattern Recognition 90
☆ Odd-One-Out: Anomaly Detection by Comparing with Neighbors
This paper introduces a novel anomaly detection (AD) problem that focuses on identifying `odd-looking' objects relative to the other instances within a scene. Unlike the traditional AD benchmarks, in our setting, anomalies in this context are scene-specific, defined by the regular instances that make up the majority. Since object instances are often partly visible from a single viewpoint, our setting provides multiple views of each scene as input. To provide a testbed for future research in this task, we introduce two benchmarks, ToysAD-8K and PartsAD-15K. We propose a novel method that generates 3D object-centric representations for each instance and detects the anomalous ones through a cross-examination between the instances. We rigorously analyze our method quantitatively and qualitatively in the presented benchmarks.
comment: Codes & Dataset at https://github.com/VICO-UoE/OddOneOutAD
☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.
comment: Website at https://mbzuai-llm.github.io/webpage2code/
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
☆ LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression
While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in large multi-modal models (LMMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens during training to enhance training efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a lite training scheme. LLaVolta incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly, and finally no compression at the end of training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs. Code is available at https://github.com/Beckschen/LLaVolta
comment: Code is available at https://github.com/Beckschen/LLaVolta
☆ Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language
Diffusion-based models have shown great potential in generating high-quality images with various layouts, which can benefit downstream perception tasks. However, a fully automatic layout generation driven only by language and a suitable metric for measuring multiple generated instances has not been well explored. In this work, we present Auto Cherry-Picker (ACP), a novel framework that generates high-quality multi-modal training examples to augment perception and multi-modal training. Starting with a simple list of natural language concepts, we prompt large language models (LLMs) to generate a detailed description and design reasonable layouts. Next, we use an off-the-shelf text-to-image model to generate multiple images. Then, the generated data are refined using a comprehensively designed metric to ensure quality. In particular, we present a new metric, Composite Layout and Image Score (CLIS), to evaluate the generated images fairly. Our synthetic high-quality examples boost performance in various scenarios by customizing the initial concept list, especially in addressing challenges associated with long-tailed distribution and imbalanced datasets. Experiment results on downstream tasks demonstrate that Auto Cherry-Picker can significantly improve the performance of existing models. In addition, we have thoroughly investigated the correlation between CLIS and performance gains in downstream tasks, and we find that a better CLIS score results in better performance. This finding shows the potential for evaluation metrics as the role for various visual perception and MLLM tasks. Code will be available.
comment: 19 pages, 7 figures
☆ PoliFormer: Scaling On-Policy RL with Transformers Results in Masterful Navigators
We present PoliFormer (Policy Transformer), an RGB-only indoor navigation agent trained end-to-end with reinforcement learning at scale that generalizes to the real-world without adaptation despite being trained purely in simulation. PoliFormer uses a foundational vision transformer encoder with a causal transformer decoder enabling long-term memory and reasoning. It is trained for hundreds of millions of interactions across diverse environments, leveraging parallelized, multi-machine rollouts for efficient training with high throughput. PoliFormer is a masterful navigator, producing state-of-the-art results across two distinct embodiments, the LoCoBot and Stretch RE-1 robots, and four navigation benchmarks. It breaks through the plateaus of previous work, achieving an unprecedented 85.5% success rate in object goal navigation on the CHORES-S benchmark, a 28.5% absolute improvement. PoliFormer can also be trivially extended to a variety of downstream applications such as object tracking, multi-object navigation, and open-vocabulary navigation with no finetuning.
☆ Segment Anything without Supervision
The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B's ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP by 3.9% on SA-1B.
comment: Code: https://github.com/frank-xwang/UnSAM
☆ GM-DF: Generalized Multi-Scenario Deepfake Detection
Existing face forgery detection usually follows the paradigm of training models in a single domain, which leads to limited generalization capacity when unseen scenarios and unknown attacks occur. In this paper, we elaborately investigate the generalization capacity of deepfake detection models when jointly trained on multiple face forgery detection datasets. We first find a rapid degradation of detection accuracy when models are directly trained on combined datasets due to the discrepancy across collection scenarios and generation methods. To address the above issue, a Generalized Multi-Scenario Deepfake Detection framework (GM-DF) is proposed to serve multiple real-world scenarios by a unified model. First, we propose a hybrid expert modeling approach for domain-specific real/forgery feature extraction. Besides, as for the commonality representation, we use CLIP to extract the common features for better aligning visual and textual features across domains. Meanwhile, we introduce a masked image reconstruction mechanism to force models to capture rich forged details. Finally, we supervise the models via a domain-aware meta-learning strategy to further enhance their generalization capacities. Specifically, we design a novel domain alignment loss to strongly align the distributions of the meta-test domains and meta-train domains. Thus, the updated models are able to represent both specific and common real/forgery features across multiple datasets. In consideration of the lack of study of multi-dataset training, we establish a new benchmark leveraging multi-source data to fairly evaluate the models' generalization capacity on unseen scenarios. Both qualitative and quantitative experiments on five datasets conducted on traditional protocols as well as the proposed benchmark demonstrate the effectiveness of our approach.
☆ HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model
We introduce HouseCrafter, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene. Specifically, the RGB-D images are generated autoregressively in a batch-wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations. The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed. Through extensive evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices. We will release our code and model weights. Project page: https://neu-vi.github.io/houseCrafter/
☆ EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Segment Anything Model (SAM) has attracted widespread attention for its superior interactive segmentation capabilities with visual prompts while lacking further exploration of text prompts. In this paper, we empirically investigate what text prompt encoders (e.g., CLIP or LLM) are good for adapting SAM for referring expression segmentation and introduce the Early Vision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effective referring segmentation method which exploits multimodal prompts (i.e., image and text) and comprises a pre-trained vision-language model to generate referring prompts and a SAM model for segmentation. Surprisingly, we observe that: (1) multimodal prompts and (2) vision-language models with early fusion (e.g., BEIT-3) are beneficial for prompting SAM for accurate referring segmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3 can obtain state-of-the-art performance on RefCOCO/+/g for referring expression segmentation and demonstrate the superiority of prompting SAM with early vision-language fusion. In addition, the proposed EVF-SAM with 1.32B parameters achieves remarkably higher performance while reducing nearly 82% of parameters compared to previous SAM methods based on large multimodal models.
comment: Preprint
☆ ASSR-NeRF: Arbitrary-Scale Super-Resolution on Voxel Grid for High-Quality Radiance Fields Reconstruction
NeRF-based methods reconstruct 3D scenes by building a radiance field with implicit or explicit representations. While NeRF-based methods can perform novel view synthesis (NVS) at arbitrary scale, the performance in high-resolution novel view synthesis (HRNVS) with low-resolution (LR) optimization often results in oversmoothing. On the other hand, single-image super-resolution (SR) aims to enhance LR images to HR counterparts but lacks multi-view consistency. To address these challenges, we propose Arbitrary-Scale Super-Resolution NeRF (ASSR-NeRF), a novel framework for super-resolution novel view synthesis (SRNVS). We propose an attention-based VoxelGridSR model to directly perform 3D super-resolution (SR) on the optimized volume. Our model is trained on diverse scenes to ensure generalizability. For unseen scenes trained with LR views, we then can directly apply our VoxelGridSR to further refine the volume and achieve multi-view consistent SR. We demonstrate quantitative and qualitatively that the proposed method achieves significant performance in SRNVS.
☆ SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world captures problematic. We present SpotlessSplats, an approach that leverages pre-trained and general-purpose features coupled with robust optimization to effectively ignore transient distractors. Our method achieves state-of-the-art reconstruction quality both visually and quantitatively, on casual captures.
☆ HAITCH: A Framework for Distortion and Motion Correction in Fetal Multi-Shell Diffusion-Weighted MRI
Diffusion magnetic resonance imaging (dMRI) is pivotal for probing the microstructure of the rapidly-developing fetal brain. However, fetal motion during scans and its interaction with magnetic field inhomogeneities result in artifacts and data scattering across spatial and angular domains. The effects of those artifacts are more pronounced in high-angular resolution fetal dMRI, where signal-to-noise ratio is very low. Those effects lead to biased estimates and compromise the consistency and reliability of dMRI analysis. This work presents HAITCH, the first and the only publicly available tool to correct and reconstruct multi-shell high-angular resolution fetal dMRI data. HAITCH offers several technical advances that include a blip-reversed dual-echo acquisition for dynamic distortion correction, advanced motion correction for model-free and robust reconstruction, optimized multi-shell design for enhanced information capture and increased tolerance to motion, and outlier detection for improved reconstruction fidelity. The framework is open-source, flexible, and can be used to process any type of fetal dMRI data including single-echo or single-shell acquisitions, but is most effective when used with multi-shell multi-echo fetal dMRI data that cannot be processed with any of the existing tools. Validation experiments on real fetal dMRI scans demonstrate significant improvements and accurate correction across diverse fetal ages and motion levels. HAITCH successfully removes artifacts and reconstructs high-fidelity fetal dMRI data suitable for advanced diffusion modeling, including fiber orientation distribution function estimation. These advancements pave the way for more reliable analysis of the fetal brain microstructure and tractography under challenging imaging conditions.
☆ eMoE-Tracker: Environmental MoE-based Transformer for Robust Event-guided Object Tracking
The unique complementarity of frame-based and event cameras for high frame rate object tracking has recently inspired some research attempts to develop multi-modal fusion approaches. However, these methods directly fuse both modalities and thus ignore the environmental attributes, e.g., motion blur, illumination variance, occlusion, scale variation, etc. Meanwhile, no interaction between search and template features makes distinguishing target objects and backgrounds difficult. As a result, performance degradation is induced especially in challenging conditions. This paper proposes a novel and effective Transformer-based event-guided tracking framework, called eMoE-Tracker, which achieves new SOTA performance under various conditions. Our key idea is to disentangle the environment into several learnable attributes to dynamically learn the attribute-specific features for better interaction and discriminability between the target information and background. To achieve the goal, we first propose an environmental Mix-of-Experts (eMoE) module that is built upon the environmental Attributes Disentanglement to learn attribute-specific features and environmental Attributes Gating to assemble the attribute-specific features by the learnable attribute scores dynamically. The eMoE module is a subtle router that fine-tunes the transformer backbone more efficiently. We then introduce a contrastive relation modeling (CRM) module to improve interaction and discriminability between the target information and background. Extensive experiments on diverse event-based benchmark datasets showcase the superior performance of our eMoE-Tracker compared to the prior arts.
comment: RGB-event single object tracking
☆ Malaria Cell Detection Using Deep Neural Networks
Malaria remains one of the most pressing public health concerns globally, causing significant morbidity and mortality, especially in sub-Saharan Africa. Rapid and accurate diagnosis is crucial for effective treatment and disease management. Traditional diagnostic methods, such as microscopic examination of blood smears, are labor-intensive and require significant expertise, which may not be readily available in resource-limited settings. This project aims to automate the detection of malaria-infected cells using a deep learning approach. We employed a convolutional neural network (CNN) based on the ResNet50 architecture, leveraging transfer learning to enhance performance. The Malaria Cell Images Dataset from Kaggle, containing 27,558 images categorized into infected and uninfected cells, was used for training and evaluation. Our model demonstrated high accuracy, precision, and recall, indicating its potential as a reliable tool for assisting in malaria diagnosis. Additionally, a web application was developed using Streamlit to allow users to upload cell images and receive predictions about malaria infection, making the technology accessible and user-friendly. This paper provides a comprehensive overview of the methodology, experiments, and results, highlighting the effectiveness of deep learning in medical image analysis.
☆ Wavelets Are All You Need for Autoregressive Image Generation
In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
comment: 16 pages, 10 figures
☆ STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.
comment: 10 pages, 6 figures
☆ Impact of Initialization on Intra-subject Pediatric Brain MR Image Registration: A Comparative Analysis between SyN ANTs and Deep Learning-Based Approaches
This study evaluates the performance of conventional SyN ANTs and learning-based registration methods in the context of pediatric neuroimaging, specifically focusing on intrasubject deformable registration. The comparison involves three approaches: without (NR), with rigid (RR), and with rigid and affine (RAR) initializations. In addition to initialization, performances are evaluated in terms of accuracy, speed, and the impact of age intervals and sex per pair. Data consists of the publicly available MRI scans from the Calgary Preschool dataset, which includes 63 children aged 2-7 years, allowing for 431 registration pairs. We implemented the unsupervised DL framework with a U-Net architecture using DeepReg and it was 5-fold cross-validated. Evaluation includes Dice scores for tissue segmentation from 18 smaller regions obtained by SynthSeg, analysis of log Jacobian determinants, and registration pro-rated training and inference times. Learning-based approaches, with or without linear initializations, exhibit slight superiority over SyN ANTs in terms of Dice scores. Indeed, DL-based implementations with RR and RAR initializations significantly outperform SyN ANTs. Both SyN ANTs and DL-based registration involve parameter optimization, but the choice between these methods depends on the scale of registration: network-based for broader coverage or SyN ANTs for specific structures. Both methods face challenges with larger age intervals due to greater growth changes. The main takeaway is that while DL-based methods show promise with faster and more accurate registrations, SyN ANTs remains robust and generalizable without the need for extensive training, highlighting the importance of method selection based on specific registration needs in the pediatric context. Our code is available at https://github.com/neuropoly/pediatric-DL-registration
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:013
☆ GRACE: Graph-Regularized Attentive Convolutional Entanglement with Laplacian Smoothing for Robust DeepFake Video Detection
As DeepFake video manipulation techniques escalate, posing profound threats, the urgent need to develop efficient detection strategies is underscored. However, one particular issue lies with facial images being mis-detected, often originating from degraded videos or adversarial attacks, leading to unexpected temporal artifacts that can undermine the efficacy of DeepFake video detection techniques. This paper introduces a novel method for robust DeepFake video detection, harnessing the power of the proposed Graph-Regularized Attentive Convolutional Entanglement (GRACE) based on the graph convolutional network with graph Laplacian to address the aforementioned challenges. First, conventional Convolution Neural Networks are deployed to perform spatiotemporal features for the entire video. Then, the spatial and temporal features are mutually entangled by constructing a graph with sparse constraint, enforcing essential features of valid face images in the noisy face sequences remaining, thus augmenting stability and performance for DeepFake video detection. Furthermore, the Graph Laplacian prior is proposed in the graph convolutional network to remove the noise pattern in the feature space to further improve the performance. Comprehensive experiments are conducted to illustrate that our proposed method delivers state-of-the-art performance in DeepFake video detection under noisy face sequences. The source code is available at https://github.com/ming053l/GRACE.
comment: Submitted to TPAMI 2024
☆ Parallax-tolerant Image Stitching via Segmentation-guided Multi-homography Warping
Large parallax between images is an intractable issue in image stitching. Various warping-based methods are proposed to address it, yet the results are unsatisfactory. In this paper, we propose a novel image stitching method using multi-homography warping guided by image segmentation. Specifically, we leverage the Segment Anything Model to segment the target image into numerous contents and partition the feature points into multiple subsets via the energy-based multi-homography fitting algorithm. The multiple subsets of feature points are used to calculate the corresponding multiple homographies. For each segmented content in the overlapping region, we select its best-fitting homography with the lowest photometric error. For each segmented content in the non-overlapping region, we calculate a weighted combination of the linearized homographies. Finally, the target image is warped via the best-fitting homographies to align with the reference image, and the final panorama is generated via linear blending. Comprehensive experimental results on the public datasets demonstrate that our method provides the best alignment accuracy by a large margin, compared with the state-of-the-art methods. The source code is available at https://github.com/tlliao/multi-homo-warp.
comment: 11 pages, 9 figures
☆ Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model
The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token. However, the predictions are based solely on sample features and do not truly reveal the optimization direction of tokens. This can lead to severe optimization conflicts between different tokens within an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis. Specifically, we first use token-level gradients to identify conflicting tokens in experts. Then, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate the effectiveness of our method. The code will be publicly available at https://github.com/longrongyang/STGC.
☆ On the Value of PHH3 for Mitotic Figure Detection on H&E-stained Images
The count of mitotic figures (MFs) observed in hematoxylin and eosin (H&E)-stained slides is an important prognostic marker as it is a measure for tumor cell proliferation. However, the identification of MFs has a known low inter-rater agreement. Deep learning algorithms can standardize this task, but they require large amounts of annotated data for training and validation. Furthermore, label noise introduced during the annotation process may impede the algorithm's performance. Unlike H&E, the mitosis-specific antibody phospho-histone H3 (PHH3) specifically highlights MFs. Counting MFs on slides stained against PHH3 leads to higher agreement among raters and has therefore recently been used as a ground truth for the annotation of MFs in H&E. However, as PHH3 facilitates the recognition of cells indistinguishable from H&E stain alone, the use of this ground truth could potentially introduce noise into the H&E-related dataset, impacting model performance. This study analyzes the impact of PHH3-assisted MF annotation on inter-rater reliability and object level agreement through an extensive multi-rater experiment. We found that the annotators' object-level agreement increased when using PHH3-assisted labeling. Subsequently, MF detectors were evaluated on the resulting datasets to investigate the influence of PHH3-assisted labeling on the models' performance. Additionally, a novel dual-stain MF detector was developed to investigate the interpretation-shift of PHH3-assisted labels used in H&E, which clearly outperformed single-stain detectors. However, the PHH3-assisted labels did not have a positive effect on solely H&E-based models. The high performance of our dual-input detector reveals an information mismatch between the H&E and PHH3-stained images as the cause of this effect.
comment: 10 pages, 5 figures, 1 Table
☆ InfiniBench: A Comprehensive Benchmark for Large Multimodal Models in Very Long Video Understanding
Understanding long videos, ranging from tens of minutes to several hours, presents unique challenges in video comprehension. Despite the increasing importance of long-form video content, existing benchmarks primarily focus on shorter clips. To address this gap, we introduce InfiniBench a comprehensive benchmark for very long video understanding which presents 1)The longest video duration, averaging 76.34 minutes; 2) The largest number of question-answer pairs, 108.2K; 3) Diversity in questions that examine nine different skills and include both multiple-choice questions and open-ended questions; 4) Humancentric, as the video sources come from movies and daily TV shows, with specific human-level question designs such as Movie Spoiler Questions that require critical thinking and comprehensive understanding. Using InfiniBench, we comprehensively evaluate existing Large MultiModality Models (LMMs) on each skill, including the commercial model Gemini 1.5 Flash and the open-source models. The evaluation shows significant challenges in our benchmark.Our results show that the best AI models such Gemini struggles to perform well with 42.72% average accuracy and 2.71 out of 5 average score. We hope this benchmark will stimulate the LMMs community towards long video and human-level understanding. Our benchmark can be accessed at https://vision-cair.github.io/InfiniBench/
comment: 16 page ,17 figures
☆ FootBots: A Transformer-based Architecture for Motion Prediction in Soccer ICIP 2024
Motion prediction in soccer involves capturing complex dynamics from player and ball interactions. We present FootBots, an encoder-decoder transformer-based architecture addressing motion prediction and conditioned motion prediction through equivariance properties. FootBots captures temporal and social dynamics using set attention blocks and multi-attention block decoder. Our evaluation utilizes two datasets: a real soccer dataset and a tailored synthetic one. Insights from the synthetic dataset highlight the effectiveness of FootBots' social attention mechanism and the significance of conditioned motion prediction. Empirical results on real soccer data demonstrate that FootBots outperforms baselines in motion prediction and excels in conditioned tasks, such as predicting the players based on the ball position, predicting the offensive (defensive) team based on the ball and the defensive (offensive) team, and predicting the ball position based on all players. Our evaluation connects quantitative and qualitative findings. https://youtu.be/9kaEkfzG3L8
comment: Published as a conference paper at IEEE ICIP 2024
☆ StreamMOTP: Streaming and Unified Framework for Joint 3D Multi-Object Tracking and Trajectory Prediction
3D multi-object tracking and trajectory prediction are two crucial modules in autonomous driving systems. Generally, the two tasks are handled separately in traditional paradigms and a few methods have started to explore modeling these two tasks in a joint manner recently. However, these approaches suffer from the limitations of single-frame training and inconsistent coordinate representations between tracking and prediction tasks. In this paper, we propose a streaming and unified framework for joint 3D Multi-Object Tracking and trajectory Prediction (StreamMOTP) to address the above challenges. Firstly, we construct the model in a streaming manner and exploit a memory bank to preserve and leverage the long-term latent features for tracked objects more effectively. Secondly, a relative spatio-temporal positional encoding strategy is introduced to bridge the gap of coordinate representations between the two tasks and maintain the pose-invariance for trajectory prediction. Thirdly, we further improve the quality and consistency of predicted trajectories with a dual-stream predictor. We conduct extensive experiments on popular nuSences dataset and the experimental results demonstrate the effectiveness and superiority of StreamMOTP, which outperforms previous methods significantly on both tasks. Furthermore, we also prove that the proposed framework has great potential and advantages in actual applications of autonomous driving.
☆ LightStereo: Channel Boost Is All Your Need for Efficient 2D Cost Aggregation
We present LightStereo, a cutting-edge stereo-matching network crafted to accelerate the matching process. Departing from conventional methodologies that rely on aggregating computationally intensive 4D costs, LightStereo adopts the 3D cost volume as a lightweight alternative. While similar approaches have been explored previously, our breakthrough lies in enhancing performance through a dedicated focus on the channel dimension of the 3D cost volume, where the distribution of matching costs is encapsulated. Our exhaustive exploration has yielded plenty of strategies to amplify the capacity of the pivotal dimension, ensuring both precision and efficiency. We compare the proposed LightStereo with existing state-of-the-art methods across various benchmarks, which demonstrate its superior performance in speed, accuracy, and resource utilization. LightStereo achieves a competitive EPE metric in the SceneFlow datasets while demanding a minimum of only 22 GFLOPs, with an inference time of just 17 ms. Our comprehensive analysis reveals the effect of 2D cost aggregation for stereo matching, paving the way for real-world applications of efficient stereo systems. Code will be available at \url{https://github.com/XiandaGuo/OpenStereo}.
comment: Code will be available at \url{https://github.com/XiandaGuo/OpenStereo}
☆ Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features
Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What's more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.
☆ Extract More from Less: Efficient Fine-Grained Visual Recognition in Low-Data Regimes
The emerging task of fine-grained image classification in low-data regimes assumes the presence of low inter-class variance and large intra-class variation along with a highly limited amount of training samples per class. However, traditional ways of separately dealing with fine-grained categorisation and extremely scarce data may be inefficient under both these harsh conditions presented together. In this paper, we present a novel framework, called AD-Net, aiming to enhance deep neural network performance on this challenge by leveraging the power of Augmentation and Distillation techniques. Specifically, our approach is designed to refine learned features through self-distillation on augmented samples, mitigating harmful overfitting. We conduct comprehensive experiments on popular fine-grained image classification benchmarks where our AD-Net demonstrates consistent improvement over traditional fine-tuning and state-of-the-art low-data techniques. Remarkably, with the smallest data available, our framework shows an outstanding relative accuracy increase of up to 45 % compared to standard ResNet-50 and up to 27 % compared to the closest SOTA runner-up. We emphasise that our approach is practically architecture-independent and adds zero extra cost at inference time. Additionally, we provide an extensive study on the impact of every framework's component, highlighting the importance of each in achieving optimal performance. Source code and trained models are publicly available at github.com/demidovd98/fgic_lowd.
comment: Main paper and Appendices
☆ EgoGaussian: Dynamic Scene Understanding from Egocentric Video with 3D Gaussian Splatting
Human activities are inherently complex, and even simple household tasks involve numerous object interactions. To better understand these activities and behaviors, it is crucial to model their dynamic interactions with the environment. The recent availability of affordable head-mounted cameras and egocentric data offers a more accessible and efficient means to understand dynamic human-object interactions in 3D environments. However, most existing methods for human activity modeling either focus on reconstructing 3D models of hand-object or human-scene interactions or on mapping 3D scenes, neglecting dynamic interactions with objects. The few existing solutions often require inputs from multiple sources, including multi-camera setups, depth-sensing cameras, or kinesthetic sensors. To this end, we introduce EgoGaussian, the first method capable of simultaneously reconstructing 3D scenes and dynamically tracking 3D object motion from RGB egocentric input alone. We leverage the uniquely discrete nature of Gaussian Splatting and segment dynamic interactions from the background. Our approach employs a clip-level online learning pipeline that leverages the dynamic nature of human activities, allowing us to reconstruct the temporal evolution of the scene in chronological order and track rigid object motion. Additionally, our method automatically segments object and background Gaussians, providing 3D representations for both static scenes and dynamic objects. EgoGaussian outperforms previous NeRF and Dynamic Gaussian methods in challenging in-the-wild videos and we also qualitatively demonstrate the high quality of the reconstructed models.
☆ Comprehensive Generative Replay for Task-Incremental Segmentation with Concurrent Appearance and Semantic Forgetting MICCAI24
Generalist segmentation models are increasingly favored for diverse tasks involving various objects from different image sources. Task-Incremental Learning (TIL) offers a privacy-preserving training paradigm using tasks arriving sequentially, instead of gathering them due to strict data sharing policies. However, the task evolution can span a wide scope that involves shifts in both image appearance and segmentation semantics with intricate correlation, causing concurrent appearance and semantic forgetting. To solve this issue, we propose a Comprehensive Generative Replay (CGR) framework that restores appearance and semantic knowledge by synthesizing image-mask pairs to mimic past task data, which focuses on two aspects: modeling image-mask correspondence and promoting scalability for diverse tasks. Specifically, we introduce a novel Bayesian Joint Diffusion (BJD) model for high-quality synthesis of image-mask pairs with their correspondence explicitly preserved by conditional denoising. Furthermore, we develop a Task-Oriented Adapter (TOA) that recalibrates prompt embeddings to modulate the diffusion model, making the data synthesis compatible with different tasks. Experiments on incremental tasks (cardiac, fundus and prostate segmentation) show its clear advantage for alleviating concurrent appearance and semantic forgetting. Code is available at https://github.com/jingyzhang/CGR.
comment: Accepted by MICCAI24
☆ Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train
The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.
comment: Technical report
☆ SPIRONet: Spatial-Frequency Learning and Topological Channel Interaction Network for Vessel Segmentation
Automatic vessel segmentation is paramount for developing next-generation interventional navigation systems. However, current approaches suffer from suboptimal segmentation performances due to significant challenges in intraoperative images (i.e., low signal-to-noise ratio, small or slender vessels, and strong interference). In this paper, a novel spatial-frequency learning and topological channel interaction network (SPIRONet) is proposed to address the above issues. Specifically, dual encoders are utilized to comprehensively capture local spatial and global frequency vessel features. Then, a cross-attention fusion module is introduced to effectively fuse spatial and frequency features, thereby enhancing feature discriminability. Furthermore, a topological channel interaction module is designed to filter out task-irrelevant responses based on graph neural networks. Extensive experimental results on several challenging datasets (CADSA, CAXF, DCA1, and XCAD) demonstrate state-of-the-art performances of our method. Moreover, the inference speed of SPIRONet is 21 FPS with a 512x512 input size, surpassing clinical real-time requirements (6~12FPS). These promising outcomes indicate SPIRONet's potential for integration into vascular interventional navigation systems. Code is available at https://github.com/Dxhuang-CASIA/SPIRONet.
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at https://github.com/jihaonew/MM-Instruct
☆ EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans
Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].
comment: 17 pages, 7 figures
☆ Vision Transformer with Key-select Routing Attention for Single Image Dehazing
We present Ksformer, utilizing Multi-scale Key-select Routing Attention (MKRA) for intelligent selection of key areas through multi-channel, multi-scale windows with a top-k operator, and Lightweight Frequency Processing Module (LFPM) to enhance high-frequency features, outperforming other dehazing methods in tests.
comment: 5 pages,4 figures,IEICE Trans. Information and Systems
☆ MMRo: Are Multimodal LLMs Eligible as the Brain for In-Home Robotics?
It is fundamentally challenging for robots to serve as useful assistants in human environments because this requires addressing a spectrum of sub-problems across robotics, including perception, language understanding, reasoning, and planning. The recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated their exceptional abilities in solving complex mathematical problems, mastering commonsense and abstract reasoning. This has led to the recent utilization of MLLMs as the brain in robotic systems, enabling these models to conduct high-level planning prior to triggering low-level control actions for task execution. However, it remains uncertain whether existing MLLMs are reliable in serving the brain role of robots. In this study, we introduce the first benchmark for evaluating Multimodal LLM for Robotic (MMRo) benchmark, which tests the capability of MLLMs for robot applications. Specifically, we identify four essential capabilities perception, task planning, visual reasoning, and safety measurement that MLLMs must possess to qualify as the robot's central processing unit. We have developed several scenarios for each capability, resulting in a total of 14 metrics for evaluation. We present experimental results for various MLLMs, including both commercial and open-source models, to assess the performance of existing systems. Our findings indicate that no single model excels in all areas, suggesting that current MLLMs are not yet trustworthy enough to serve as the cognitive core for robots. Our data can be found in https://mm-robobench.github.io/.
☆ Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation
Brain tumors are one of the most common diseases that lead to early death if not diagnosed at an early stage. Traditional diagnostic approaches are extremely time-consuming and prone to errors. In this context, computer vision-based approaches have emerged as an effective tool for accurate brain tumor classification. While some of the existing solutions demonstrate noteworthy accuracy, the models become infeasible to deploy in areas where computational resources are limited. This research addresses the need for accurate and fast classification of brain tumors with a priority of deploying the model in technologically underdeveloped regions. The research presents a novel architecture for precise brain tumor classification fusing pretrained ResNet152V2 and modified VGG16 models. The proposed architecture undergoes a diligent fine-tuning process that ensures fine gradients are preserved in deep neural networks, which are essential for effective brain tumor classification. The proposed solution incorporates various image processing techniques to improve image quality and achieves an astounding accuracy of 98.36% and 98.04% in Figshare and Kaggle datasets respectively. This architecture stands out for having a streamlined profile, with only 2.8 million trainable parameters. We have leveraged 8-bit quantization to produce a model of size 73.881 MB, significantly reducing it from the previous size of 289.45 MB, ensuring smooth deployment in edge devices even in resource-constrained areas. Additionally, the use of Grad-CAM improves the interpretability of the model, offering insightful information regarding its decision-making process. Owing to its high discriminative ability, this model can be a reliable option for accurate brain tumor classification.
☆ Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction
Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX, employing a large multimodal model to analyze image embeddings, eye gaze data, and radiology reports. The system's effectiveness was evaluated based on its referral-making process, the quality of referrals, and performance in collaborative diagnostic settings. CoRaX was tested on a simulated error dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved. The Referral-Usefulness score, indicating the accuracy of predicted regions for all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score, reflecting the diagnostic accuracy of CoRaX's interactions with radiologists, showed that 84% (237 of 280) of these interactions had a score above 0.40. In conclusion, CoRaX efficiently collaborates with radiologists to address perceptual errors across various abnormalities, with potential applications in the education and training of novice radiologists.
comment: Under Review in Journal
☆ MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .
☆ Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey
Estimating depth from single RGB images and videos is of widespread interest due to its applications in many areas, including autonomous driving, 3D reconstruction, digital entertainment, and robotics. More than 500 deep learning-based papers have been published in the past 10 years, which indicates the growing interest in the task. This paper presents a comprehensive survey of the existing deep learning-based methods, the challenges they address, and how they have evolved in their architecture and supervision methods. It provides a taxonomy for classifying the current work based on their input and output modalities, network architectures, and learning methods. It also discusses the major milestones in the history of monocular depth estimation, and different pipelines, datasets, and evaluation metrics used in existing methods.
comment: 46 pages, 10 figures, The paper has been accepted for publication in ACM Computing Surveys 2024
☆ Beyond First-Order: A Multi-Scale Approach to Finger Knuckle Print Biometrics
Recently, finger knuckle prints (FKPs) have gained attention due to their rich textural patterns, positioning them as a promising biometric for identity recognition. Prior FKP recognition methods predominantly leverage first-order feature descriptors, which capture intricate texture details but fail to account for structural information. Emerging research, however, indicates that second-order textures, which describe the curves and arcs of the textures, encompass this overlooked structural information. This paper introduces a novel FKP recognition approach, the Dual-Order Texture Competition Network (DOTCNet), designed to capture texture information in FKP images comprehensively. DOTCNet incorporates three dual-order texture competitive modules (DTCMs), each targeting textures at different scales. Each DTCM employs a learnable texture descriptor, specifically a learnable Gabor filter (LGF), to extract texture features. By leveraging LGFs, the network extracts first and second order textures to describe fine textures and structural features thoroughly. Furthermore, an attention mechanism enhances relevant features in the first-order features, thereby highlighting significant texture details. For second-order features, a competitive mechanism emphasizes structural information while reducing noise from higher-order features. Extensive experimental results reveal that DOTCNet significantly outperforms several standard algorithms on the publicly available PolyU-FKP dataset.
☆ PopAlign: Population-Level Alignment for Fair Text-to-Image Generation
Text-to-image (T2I) models achieve high-fidelity generation through extensive training on large datasets. However, these models may unintentionally pick up undesirable biases of their training data, such as over-representation of particular identities in gender or ethnicity neutral prompts. Existing alignment methods such as Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) fail to address this problem effectively because they operate on pairwise preferences consisting of individual samples, while the aforementioned biases can only be measured at a population level. For example, a single sample for the prompt "doctor" could be male or female, but a model generating predominantly male doctors even with repeated sampling reflects a gender bias. To address this limitation, we introduce PopAlign, a novel approach for population-level preference optimization, while standard optimization would prefer entire sets of samples over others. We further derive a stochastic lower bound that directly optimizes for individual samples from preferred populations over others for scalable training. Using human evaluation and standard image quality and bias metrics, we show that PopAlign significantly mitigates the bias of pretrained T2I models while largely preserving the generation quality. Code is available at https://github.com/jacklishufan/PopAlignSDXL.
comment: 18 pages, 10 figures
☆ CSAKD: Knowledge Distillation with Cross Self-Attention for Hyperspectral and Multispectral Image Fusion
Hyperspectral imaging, capturing detailed spectral information for each pixel, is pivotal in diverse scientific and industrial applications. Yet, the acquisition of high-resolution (HR) hyperspectral images (HSIs) often needs to be addressed due to the hardware limitations of existing imaging systems. A prevalent workaround involves capturing both a high-resolution multispectral image (HR-MSI) and a low-resolution (LR) HSI, subsequently fusing them to yield the desired HR-HSI. Although deep learning-based methods have shown promising in HR-MSI/LR-HSI fusion and LR-HSI super-resolution (SR), their substantial model complexities hinder deployment on resource-constrained imaging devices. This paper introduces a novel knowledge distillation (KD) framework for HR-MSI/LR-HSI fusion to achieve SR of LR-HSI. Our KD framework integrates the proposed Cross-Layer Residual Aggregation (CLRA) block to enhance efficiency for constructing Dual Two-Streamed (DTS) network structure, designed to extract joint and distinct features from LR-HSI and HR-MSI simultaneously. To fully exploit the spatial and spectral feature representations of LR-HSI and HR-MSI, we propose a novel Cross Self-Attention (CSA) fusion module to adaptively fuse those features to improve the spatial and spectral quality of the reconstructed HR-HSI. Finally, the proposed KD-based joint loss function is employed to co-train the teacher and student networks. Our experimental results demonstrate that the student model not only achieves comparable or superior LR-HSI SR performance but also significantly reduces the model-size and computational requirements. This marks a substantial advancement over existing state-of-the-art methods. The source code is available at https://github.com/ming053l/CSAKD.
comment: Submitted to TIP 2024
☆ PM-VIS+: High-Performance Video Instance Segmentation without Video Annotation
Video instance segmentation requires detecting, segmenting, and tracking objects in videos, typically relying on costly video annotations. This paper introduces a method that eliminates video annotations by utilizing image datasets. The PM-VIS algorithm is adapted to handle both bounding box and instance-level pixel annotations dynamically. We introduce ImageNet-bbox to supplement missing categories in video datasets and propose the PM-VIS+ algorithm to adjust supervision based on annotation types. To enhance accuracy, we use pseudo masks and semi-supervised optimization techniques on unannotated video data. This method achieves high video instance segmentation performance without manual video annotations, offering a cost-effective solution and new perspectives for video instance segmentation applications. The code will be available in https://github.com/ldknight/PM-VIS-plus
comment: MIPR 2024
☆ Basketball-SORT: An Association Method for Complex Multi-object Occlusion Problems in Basketball Multi-object Tracking
Recent deep learning-based object detection approaches have led to significant progress in multi-object tracking (MOT) algorithms. The current MOT methods mainly focus on pedestrian or vehicle scenes, but basketball sports scenes are usually accompanied by three or more object occlusion problems with similar appearances and high-intensity complex motions, which we call complex multi-object occlusion (CMOO). Here, we propose an online and robust MOT approach, named Basketball-SORT, which focuses on the CMOO problems in basketball videos. To overcome the CMOO problem, instead of using the intersection-over-union-based (IoU-based) approach, we use the trajectories of neighboring frames based on the projected positions of the players. Our method designs the basketball game restriction (BGR) and reacquiring Long-Lost IDs (RLLI) based on the characteristics of basketball scenes, and we also solve the occlusion problem based on the player trajectories and appearance features. Experimental results show that our method achieves a Higher Order Tracking Accuracy (HOTA) score of 63.48$\%$ on the basketball fixed video dataset and outperforms other recent popular approaches. Overall, our approach solved the CMOO problem more effectively than recent MOT algorithms.
☆ AstMatch: Adversarial Self-training Consistency Framework for Semi-Supervised Medical Image Segmentation
Semi-supervised learning (SSL) has shown considerable potential in medical image segmentation, primarily leveraging consistency regularization and pseudo-labeling. However, many SSL approaches only pay attention to low-level consistency and overlook the significance of pseudo-label reliability. Therefore, in this work, we propose an adversarial self-training consistency framework (AstMatch). Firstly, we design an adversarial consistency regularization (ACR) approach to enhance knowledge transfer and strengthen prediction consistency under varying perturbation intensities. Second, we apply a feature matching loss for adversarial training to incorporate high-level consistency regularization. Additionally, we present the pyramid channel attention (PCA) and efficient channel and spatial attention (ECSA) modules to improve the discriminator's performance. Finally, we propose an adaptive self-training (AST) approach to ensure the pseudo-labels' quality. The proposed AstMatch has been extensively evaluated with cutting-edge SSL methods on three public-available datasets. The experimental results under different labeled ratios indicate that AstMatch outperforms other existing methods, achieving new state-of-the-art performance. Our code will be available at https://github.com/GuanghaoZhu663/AstMatch.
☆ Efficient Event Stream Super-Resolution with Recursive Multi-Branch Fusion
Current Event Stream Super-Resolution (ESR) methods overlook the redundant and complementary information present in positive and negative events within the event stream, employing a direct mixing approach for super-resolution, which may lead to detail loss and inefficiency. To address these issues, we propose an efficient Recursive Multi-Branch Information Fusion Network (RMFNet) that separates positive and negative events for complementary information extraction, followed by mutual supplementation and refinement. Particularly, we introduce Feature Fusion Modules (FFM) and Feature Exchange Modules (FEM). FFM is designed for the fusion of contextual information within neighboring event streams, leveraging the coupling relationship between positive and negative events to alleviate the misleading of noises in the respective branches. FEM efficiently promotes the fusion and exchange of information between positive and negative branches, enabling superior local information enhancement and global information complementation. Experimental results demonstrate that our approach achieves over 17% and 31% improvement on synthetic and real datasets, accompanied by a 2.3X acceleration. Furthermore, we evaluate our method on two downstream event-driven applications, \emph{i.e.}, object recognition and video reconstruction, achieving remarkable results that outperform existing methods. Our code and Supplementary Material are available at https://github.com/Lqm26/RMFNet.
☆ Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation AAAI 2024
Weakly Supervised Semantic Segmentation (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. Despite the impressive achievement in recent WSSS methods, we identify that introducing weak labels with high mean Intersection of Union (mIoU) does not guarantee high segmentation performance. Existing studies have emphasized the importance of prioritizing precision and reducing noise to improve overall performance. In the same vein, we propose ORANDNet, an advanced ensemble approach tailored for WSSS. ORANDNet combines Class Activation Maps (CAMs) from two different classifiers to increase the precision of pseudo-masks (PMs). To further mitigate small noise in the PMs, we incorporate curriculum learning. This involves training the segmentation model initially with pairs of smaller-sized images and corresponding PMs, gradually transitioning to the original-sized pairs. By combining the original CAMs of ResNet-50 and ViT, we significantly improve the segmentation performance over the single-best model and the naive ensemble model, respectively. We further extend our ensemble method to CAMs from AMN (ResNet-like) and MCTformer (ViT-like) models, achieving performance benefits in advanced WSSS models. It highlights the potential of our ORANDNet as a final add-on module for WSSS models.
comment: 5 pages, 5 figures, accepted in AAAI 2024 Edge Intelligence Workshop
☆ Model Predictive Simulation Using Structured Graphical Models and Transformers
We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model, which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample $K=32$ trajectories for each of the $N \sim 100$ agents for the next $T=8 \Delta$ time steps, where $\Delta=10$ is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach "Model Predictive Simulation" or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.
comment: Special Mention at the Waymo Sim Agents Challenge 2024
☆ PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation IJCAI 2024
The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel \textbf{P}seudo Multi-\textbf{P}erspective \textbf{T}rans\textbf{former} network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Decomposition, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.
comment: IJCAI 2024
☆ Optimal Video Compression using Pixel Shift Tracking
The Video comprises approximately ~85\% of all internet traffic, but video encoding/compression is being historically done with hard coded rules, which has worked well but only to a certain limit. We have seen a surge in video compression algorithms using ML-based models in the last few years and many of them have outperformed several legacy codecs. The models range from encoding video end to end using an ML approach or replacing some intermediate steps in legacy codecs using ML models to increase the efficiency of those steps. Optimizing video storage is an essential aspect of video processing, so we are proposing one of the possible approaches to achieve it is by avoiding redundant data at each frame. In this paper, we want to introduce the approach of redundancies removal in subsequent frames for a given video as a main approach for video compression. We call this method Redundancy Removal using Shift (R\textsuperscript2S). This method can be utilized across various Machine Learning model algorithms, and make the compression more accessible and adaptable. In this study, we have utilized a computer vision-based pixel point tracking method to identify redundant pixels to encode video for optimal storage.
☆ A Survey on Deep Clustering: From the Prior Perspective
Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.
☆ SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
♻ ☆ Exploiting Diffusion Prior for Real-World Image Super-Resolution
We present a novel approach to leverage prior knowledge encapsulated in pre-trained text-to-image diffusion models for blind super-resolution (SR). Specifically, by employing our time-aware encoder, we can achieve promising restoration results without altering the pre-trained synthesis model, thereby preserving the generative prior and minimizing training cost. To remedy the loss of fidelity caused by the inherent stochasticity of diffusion models, we employ a controllable feature wrapping module that allows users to balance quality and fidelity by simply adjusting a scalar value during the inference process. Moreover, we develop a progressive aggregation sampling strategy to overcome the fixed-size constraints of pre-trained diffusion models, enabling adaptation to resolutions of any size. A comprehensive evaluation of our method using both synthetic and real-world benchmarks demonstrates its superiority over current state-of-the-art approaches. Code and models are available at https://github.com/IceClear/StableSR.
comment: Accepted by IJCV'2024. Some Figs are compressed due to size limits. Uncompressed ver.: https://github.com/IceClear/StableSR/releases/download/UncompressedPDF/StableSR_IJCV_Uncompressed.pdf. Project page: https://iceclear.github.io/projects/stablesr/
♻ ☆ EnSolver: Uncertainty-Aware Ensemble CAPTCHA Solvers with Theoretical Guarantees UAI 2023
The popularity of text-based CAPTCHA as a security mechanism to protect websites from automated bots has prompted researches in CAPTCHA solvers, with the aim of understanding its failure cases and subsequently making CAPTCHAs more secure. Recently proposed solvers, built on advances in deep learning, are able to crack even the very challenging CAPTCHAs with high accuracy. However, these solvers often perform poorly on out-of-distribution samples that contain visual features different from those in the training set. Furthermore, they lack the ability to detect and avoid such samples, making them susceptible to being locked out by defense systems after a certain number of failed attempts. In this paper, we propose EnSolver, a family of CAPTCHA solvers that use deep ensemble uncertainty to detect and skip out-of-distribution CAPTCHAs, making it harder to be detected. We prove novel theoretical bounds on the effectiveness of our solvers and demonstrate their use with state-of-the-art CAPTCHA solvers. Our experiments show that the proposed approaches perform well when cracking CAPTCHA datasets that contain both in-distribution and out-of-distribution samples.
comment: A previous version of this paper was presented at the Epistemic Uncertainty - E-pi UAI 2023 Workshop
♻ ☆ Robustness Assessment of a Runway Object Classifier for Safe Aircraft Taxiing SC
As deep neural networks (DNNs) are becoming the prominent solution for many computational problems, the aviation industry seeks to explore their potential in alleviating pilot workload and in improving operational safety. However, the use of DNNs in this type of safety-critical applications requires a thorough certification process. This need can be addressed through formal verification, which provides rigorous assurances -- e.g.,~by proving the absence of certain mispredictions. In this case-study paper, we demonstrate this process using an image-classifier DNN currently under development at Airbus and intended for use during the aircraft taxiing phase. We use formal methods to assess this DNN's robustness to three common image perturbation types: noise, brightness and contrast, and some of their combinations. This process entails multiple invocations of the underlying verifier, which might be computationally expensive; and we therefore propose a method that leverages the monotonicity of these robustness properties, as well as the results of past verification queries, in order to reduce the overall number of verification queries required by nearly 60%. Our results provide an indication of the level of robustness achieved by the DNN classifier under study, and indicate that it is considerably more vulnerable to noise than to brightness or contrast perturbations.
comment: This is a preprint version of the paper in the proceedings of 43rd Digital Avionics Systems Conference (DASC)
♻ ☆ Learning to utilize image second-order derivative information for crisp edge detection
Edge detection is a fundamental task in computer vision. It has made great progress under the development of deep convolutional neural networks (DCNNs), some of which have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and noisy edge lines. In this work, we solve this problem from two aspects: (1) the lack of prior knowledge regarding image edges, and (2) the issue of imbalanced pixel distribution. We propose a second-order derivative-based multi-scale contextual enhancement module (SDMCM) to help the model locate true edge pixels accurately by introducing the edge prior knowledge. We also construct a hybrid focal loss function (HFL) to alleviate the imbalanced distribution issue. In addition, we employ the conditionally parameterized convolution (CondConv) to develop a novel boundary refinement module (BRM), which can further refine the final output edge maps. In the end, we propose a U-shape network named LUS-Net which is based on the SDMCM and BRM for crisp edge detection. We perform extensive experiments on three standard benchmarks, and the experiment results illustrate that our method can predict crisp and clean edge maps and achieves state-of-the-art performance on the BSDS500 dataset (ODS=0.829), NYUD-V2 dataset (ODS=0.768), and BIPED dataset (ODS=0.903).
♻ ☆ DWARF: Disease-weighted network for attention map refinement
The interpretability of deep learning is crucial for evaluating the reliability of medical imaging models and reducing the risks of inaccurate patient recommendations. This study addresses the "human out of the loop" and "trustworthiness" issues in medical image analysis by integrating medical professionals into the interpretability process. We propose a disease-weighted attention map refinement network (DWARF) that leverages expert feedback to enhance model relevance and accuracy. Our method employs cyclic training to iteratively improve diagnostic performance, generating precise and interpretable feature maps. Experimental results demonstrate significant improvements in interpretability and diagnostic accuracy across multiple medical imaging datasets. This approach fosters effective collaboration between AI systems and healthcare professionals, ultimately aiming to improve patient outcomes
♻ ☆ Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking
This paper tackles the problem of passive gaze estimation using both event and frame data. Considering the inherently different physiological structures, it is intractable to accurately estimate gaze purely based on a given state. Thus, we reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. Specifically, we propose a two-stage learning-based gaze estimation framework that divides the whole gaze estimation process into a coarse-to-fine approach involving anchor state selection and final gaze location. Moreover, to improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network, where a novel denoising distillation algorithm is introduced to utilize denoising diffusion techniques to iteratively remove inherent noise in event data. Extensive experiments demonstrate the effectiveness of the proposed method, which surpasses state-of-the-art methods by a large margin of 15$\%$. The code will be publicly available at https://github.com/jdjdli/Denoise_distill_EF_gazetracker.
♻ ☆ Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.
comment: 19 pages, 12 figures
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning
Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of downstream task). However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Specifically, we adopt a transformer module which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchor and target classes as "Value". In this way, the output of such a transformer module represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. We conduct extensive experiments on eleven representative classification datasets. The results show that our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
♻ ☆ Kandinsky 3.0 Technical Report
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3
comment: Project page: https://ai-forever.github.io/Kandinsky-3
♻ ☆ Deformable MRI Sequence Registration for AI-based Prostate Cancer Diagnosis
The PI-CAI (Prostate Imaging: Cancer AI) challenge led to expert-level diagnostic algorithms for clinically significant prostate cancer detection. The algorithms receive biparametric MRI scans as input, which consist of T2-weighted and diffusion-weighted scans. These scans can be misaligned due to multiple factors in the scanning process. Image registration can alleviate this issue by predicting the deformation between the sequences. We investigate the effect of image registration on the diagnostic performance of AI-based prostate cancer diagnosis. First, the image registration algorithm, developed in MeVisLab, is analyzed using a dataset with paired lesion annotations. Second, the effect on diagnosis is evaluated by comparing case-level cancer diagnosis performance between using the original dataset, rigidly aligned diffusion-weighted scans, or deformably aligned diffusion-weighted scans. Rigid registration showed no improvement. Deformable registration demonstrated a substantial improvement in lesion overlap (+10% median Dice score) and a positive yet non-significant improvement in diagnostic performance (+0.3% AUROC, p=0.18). Our investigation shows that a substantial improvement in lesion alignment does not directly lead to a significant improvement in diagnostic performance. Qualitative analysis indicated that jointly developing image registration methods and diagnostic AI algorithms could enhance diagnostic accuracy and patient outcomes.
♻ ☆ Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
♻ ☆ LiverUSRecon: Automatic 3D Reconstruction and Volumetry of the Liver with a Few Partial Ultrasound Scans MICCAI 2024
3D reconstruction of the liver for volumetry is important for qualitative analysis and disease diagnosis. Liver volumetry using ultrasound (US) scans, although advantageous due to less acquisition time and safety, is challenging due to the inherent noisiness in US scans, blurry boundaries, and partial liver visibility. We address these challenges by using the segmentation masks of a few incomplete sagittal-plane US scans of the liver in conjunction with a statistical shape model (SSM) built using a set of CT scans of the liver. We compute the shape parameters needed to warp this canonical SSM to fit the US scans through a parametric regression network. The resulting 3D liver reconstruction is accurate and leads to automatic liver volume calculation. We evaluate the accuracy of the estimated liver volumes with respect to CT segmentation volumes using RMSE. Our volume computation is statistically much closer to the volume estimated using CT scans than the volume computed using Childs' method by radiologists: p-value of 0.094 (>0.05) says that there is no significant difference between CT segmentation volumes and ours in contrast to Childs' method. We validate our method using investigations (ablation studies) on the US image resolution, the number of CT scans used for SSM, the number of principal components, and the number of input US scans. To the best of our knowledge, this is the first automatic liver volumetry system using a few incomplete US scans given a set of CT scans of livers for SSM.
comment: 10 pages, Accepted to MICCAI 2024
♻ ☆ Cross-domain Denoising for Low-dose Multi-frame Spiral Computed Tomography
Computed tomography (CT) has been used worldwide as a non-invasive test to assist in diagnosis. However, the ionizing nature of X-ray exposure raises concerns about potential health risks such as cancer. The desire for lower radiation doses has driven researchers to improve reconstruction quality. Although previous studies on low-dose computed tomography (LDCT) denoising have demonstrated the effectiveness of learning-based methods, most were developed on the simulated data. However, the real-world scenario differs significantly from the simulation domain, especially when using the multi-slice spiral scanner geometry. This paper proposes a two-stage method for the commercially available multi-slice spiral CT scanners that better exploits the complete reconstruction pipeline for LDCT denoising across different domains. Our approach makes good use of the high redundancy of multi-slice projections and the volumetric reconstructions while leveraging the over-smoothing problem in conventional cascaded frameworks caused by aggressive denoising. The dedicated design also provides a more explicit interpretation of the data flow. Extensive experiments on various datasets showed that the proposed method could remove up to 70\% of noise without compromised spatial resolution, and subjective evaluations by two experienced radiologists further supported its superior performance against state-of-the-art methods in clinical practice.
♻ ☆ Defect Detection in Synthetic Fibre Ropes using Detectron2 Framework
Fibre ropes with the latest technology have emerged as an appealing alternative to steel ropes for offshore industries due to their lightweight and high tensile strength. At the same time, frequent inspection of these ropes is essential to ensure the proper functioning and safety of the entire system. The development of deep learning (DL) models in condition monitoring (CM) applications offers a simpler and more effective approach for defect detection in synthetic fibre ropes (SFRs). The present paper investigates the performance of Detectron2, a state-of-the-art library for defect detection and instance segmentation. Detectron2 with Mask R-CNN architecture is used for segmenting defects in SFRs. Mask R-CNN with various backbone configurations has been trained and tested on an experimentally obtained dataset comprising 1,803 high-dimensional images containing seven damage classes (placking high, placking medium, placking low, compression, core out, chafing, and normal respectively) for SFRs. By leveraging the capabilities of Detectron2, this study aims to develop an automated and efficient method for detecting defects in SFRs, enhancing the inspection process, and ensuring the safety of the fibre ropes.
comment: 12 pages, 8 figures, 4 tables
♻ ☆ Assessment of Sentinel-2 spatial and temporal coverage based on the scene classification layer
Since the launch of the Sentinel-2 (S2) satellites, many ML models have used the data for diverse applications. The scene classification layer (SCL) inside the S2 product provides rich information for training, such as filtering images with high cloud coverage. However, there is more potential in this. We propose a technique to assess the clean optical coverage of a region, expressed by a SITS and calculated with the S2-based SCL data. With a manual threshold and specific labels in the SCL, the proposed technique assigns a percentage of spatial and temporal coverage across the time series and a high/low assessment. By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the assessment is correlated to the predictive results of ML models. The classification results in a region with low spatial and temporal coverage is worse than in a region with high coverage. Finally, we applied the technique across all continents of the global dataset LandCoverNet.
comment: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information
Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.
♻ ☆ FAGhead: Fully Animate Gaussian Head from Monocular Videos
High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.
♻ ☆ A Refer-and-Ground Multimodal Large Language Model for Biomedicine MICCAI2024
With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.
comment: Accepted by MICCAI2024
♻ ☆ ProbRadarM3F: mmWave Radar based Human Skeletal Pose Estimation with Probability Map Guided Multi-Format Feature Fusion
Millimeter wave (mmWave) radar is a non-intrusive privacy and relatively convenient and inexpensive device, which has been demonstrated to be applicable in place of RGB cameras in human indoor pose estimation tasks. However, mmWave radar relies on the collection of reflected signals from the target, and the radar signals containing information is difficult to be fully applied. This has been a long-standing hindrance to the improvement of pose estimation accuracy. To address this major challenge, this paper introduces a probability map guided multi-format feature fusion model, ProbRadarM3F. This is a novel radar feature extraction framework using a traditional FFT method in parallel with a probability map based positional encoding method. ProbRadarM3F fuses the traditional heatmap features and the positional features, then effectively achieves the estimation of 14 keypoints of the human body. Experimental evaluation on the HuPR dataset proves the effectiveness of the model proposed in this paper, outperforming other methods experimented on this dataset with an AP of 69.9 %. The emphasis of our study is focusing on the position information that is not exploited before in radar singal. This provides direction to investigate other potential non-redundant information from mmWave rader.
♻ ☆ SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues MICCAI 2024
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.
comment: accepted by MICCAI 2024
♻ ☆ Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency
Skin cancer is a major concern to public health, accounting for one-third of the reported cancers. If not detected early, the cancer has the potential for severe consequences. Recognizing the critical need for effective skin cancer classification, we address the limitations of existing models, which are often too large to deploy in areas with limited computational resources. In response, we present a knowledge distillation based approach for creating a lightweight yet high-performing classifier. The proposed solution involves fusing three models, namely ResNet152V2, ConvNeXtBase, and ViT Base, to create an effective teacher model. The teacher model is then employed to guide a lightweight student model of size 2.03 MB. This student model is further compressed to 469.77 KB using 16-bit quantization, enabling smooth incorporation into edge devices. With six-stage image preprocessing, data augmentation, and a rigorous ablation study, the model achieves an impressive accuracy of 98.75% on the HAM10000 dataset and 98.94% on the Kaggle dataset in classifying benign and malignant skin cancers. With its high accuracy and compact size, our model appears to be a potential choice for accurate skin cancer classification, particularly in resource-constrained settings.
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ CSI4Free: GAN-Augmented mmWave CSI for Improved Pose Classification
In recent years, Joint Communication and Sensing (JC&S), has demonstrated significant success, particularly in utilizing sub-6 GHz frequencies with commercial-off-the-shelf (COTS) Wi-Fi devices for applications such as localization, gesture recognition, and pose classification. Deep learning and the existence of large public datasets has been pivotal in achieving such results. However, at mmWave frequencies (30-300 GHz), which has shown potential for more accurate sensing performance, there is a noticeable lack of research in the domain of COTS Wi-Fi sensing. Challenges such as limited research hardware, the absence of large datasets, limited functionality in COTS hardware, and the complexities of data collection present obstacles to a comprehensive exploration of this field. In this work, we aim to address these challenges by developing a method that can generate synthetic mmWave channel state information (CSI) samples. In particular, we use a generative adversarial network (GAN) on an existing dataset, to generate 30,000 additional CSI samples. The augmented samples exhibit a remarkable degree of consistency with the original data, as indicated by the notably high GAN-train and GAN-test scores. Furthermore, we integrate the augmented samples in training a pose classification model. We observe that the augmented samples complement the real data and improve the generalization of the classification model.
♻ ☆ All-In-One Medical Image Restoration via Task-Adaptive Routing MICCAI 2024
Although single-task medical image restoration (MedIR) has witnessed remarkable success, the limited generalizability of these methods poses a substantial obstacle to wider application. In this paper, we focus on the task of all-in-one medical image restoration, aiming to address multiple distinct MedIR tasks with a single universal model. Nonetheless, due to significant differences between different MedIR tasks, training a universal model often encounters task interference issues, where different tasks with shared parameters may conflict with each other in the gradient update direction. This task interference leads to deviation of the model update direction from the optimal path, thereby affecting the model's performance. To tackle this issue, we propose a task-adaptive routing strategy, allowing conflicting tasks to select different network paths in spatial and channel dimensions, thereby mitigating task interference. Experimental results demonstrate that our proposed \textbf{A}ll-in-one \textbf{M}edical \textbf{I}mage \textbf{R}estoration (\textbf{AMIR}) network achieves state-of-the-art performance in three MedIR tasks: MRI super-resolution, CT denoising, and PET synthesis, both in single-task and all-in-one settings. The code and data will be available at \href{https://github.com/Yaziwel/All-In-One-Medical-Image-Restoration-via-Task-Adaptive-Routing.git}{https://github.com/Yaziwel/AMIR}.
comment: This article has been early accepted by MICCAI 2024
♻ ☆ Revisiting Backdoor Attacks against Large Vision-Language Models
Instruction tuning enhances large vision-language models (LVLMs) but raises security risks through potential backdoor attacks due to their openness. Previous backdoor studies focus on enclosed scenarios with consistent training and testing instructions, neglecting the practical domain gaps that could affect attack effectiveness. This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs for the first time, revealing certain limitations of most backdoor strategies in practical scenarios. We quantitatively evaluate the generalizability of six typical backdoor attacks on image caption benchmarks across multiple LVLMs, considering both visual and textual domain offsets. Our findings indicate that attack generalizability is positively correlated with the backdoor trigger's irrelevance to specific images/models and the preferential correlation of the trigger pattern. Additionally, we modify existing backdoor attacks based on the above key observations, demonstrating significant improvements in cross-domain scenario generalizability (+86% attack success rate). Notably, even without access to the instruction datasets, a multimodal instruction set can be successfully poisoned with a very low poisoning rate (0.2%), achieving an attack success rate of over 97%. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs, necessitating more attention and in-depth research.
comment: 23 pages, 8 figures
♻ ☆ Generative Autoencoding of Dropout Patterns
We propose a generative model termed Deciphering Autoencoders. In this model, we assign a unique random dropout pattern to each data point in the training dataset and then train an autoencoder to reconstruct the corresponding data point using this pattern as information to be encoded. Even if a completely random dropout pattern is assigned to each data point regardless of their similarities, a sufficiently large encoder can smoothly map them to a low-dimensional latent space to reconstruct individual training data points. During inference, using a dropout pattern different from those used during training allows the model to function as a generator. Since the training of Deciphering Autoencoders relies solely on reconstruction error, it offers more stable training compared to other generative models. Despite their simplicity, Deciphering Autoencoders show sampling quality comparable to DCGAN on the CIFAR-10 dataset.
♻ ☆ EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation CVPR 2024
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
comment: Champion solutions in the EgoVis CVPR 2024 workshop
♻ ☆ AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models
We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process. The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt. Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, AlignIT, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged. Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality.
comment: 10 pages, 9 figures
♻ ☆ Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods. Our code will be released at https://github.com/Character-Adapter/Character-Adapte
♻ ☆ MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension
Recently, Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving professional molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e., SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by designing and equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. Moreover, a human-defined molecular fingerprint is incorporated to leverage its embedded domain knowledge. Then, to establish an alignment between MolX and the LLM's textual input space, the whole model in which the LLM is frozen, is pre-trained with a versatile strategy including a diverse set of tasks. Extensive experimental evaluations demonstrate that our proposed method only introduces a small number of trainable parameters while outperforming baselines on various downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM.
♻ ☆ AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and edge maps, into pre-trained T2I models through extra encoding. However, multi-control image synthesis still faces several challenges. Specifically, current approaches are limited in handling free combinations of diverse input control signals, overlook the complex relationships among multiple spatial conditions, and often fail to maintain semantic alignment with provided textual prompts. This can lead to suboptimal user experiences. To address these challenges, we propose AnyControl, a multi-control image synthesis framework that supports arbitrary combinations of diverse control signals. AnyControl develops a novel Multi-Control Encoder that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals, as demonstrated by extensive quantitative and qualitative evaluations. Our project page is available in https://any-control.github.io.
♻ ☆ Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies. We believe Manipulate-Anything can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.
comment: Project page: https://robot-ma.github.io/
♻ ☆ Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration MICCAI 2024
A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.
comment: This paper has been accepted by MICCAI 2024
♻ ☆ Solving the Inverse Problem of Electrocardiography for Cardiac Digital Twins: A Survey
Cardiac digital twins are personalized virtual representations used to understand complex heart mechanisms. Solving the ECG inverse problem is crucial for accurate virtual heart modelling, enabling the derivation of internal electrical activity information from recorded surface potentials. Despite challenges from cardiac complexity, noisy ECG data, and computational efficiency, recent advancements hold significant promise for enhancing virtual heart modelling, ultimately advancing precision medicine in cardiology. This paper aims to provide a comprehensive review of the methods of solving ECG inverse problem, the validation strategies, the clinical applications, and future perspectives. For the computing methodologies, we broadly classify state-of-the-art approaches into two categories: deterministic and probabilistic methods, including conventional and deep learning-based techniques. Integrating physics laws with deep learning models holds promise, but challenges such as capturing dynamic electrophysiology accurately, accessing accurate domain knowledge, and quantifying prediction uncertainty persist. Integrating models into clinical workflows while ensuring interpretability and usability for healthcare professionals is essential. Overcoming these challenges will drive further research in cardiac digital twins.
♻ ☆ Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID CVPR 2024
Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming, restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result, we study the transferable text-to-image ReID problem, where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover, we identify and address two key challenges in utilizing the obtained textual descriptions. First, an MLLM tends to generate descriptions with similar structures, causing the model to overfit specific sentence patterns. Thus, we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore, we can build a large-scale dataset with diverse textual descriptions. Second, an MLLM may produce incorrect descriptions. Hence, we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then, we mask these words with a larger probability in the subsequent training epoch, alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights, we also achieve state-of-the-art performance in the traditional evaluation settings.
comment: CVPR 2024
Computation and Language 88
☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.
comment: Website at https://mbzuai-llm.github.io/webpage2code/
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
☆ Scaling Synthetic Data Creation with 1,000,000,000 Personas
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
comment: Work in progress
☆ ProgressGym: Alignment with a Millennium of Moral Progress
Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.
☆ Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
comment: 13 pages, 14 figures. Code and data at https://footprints.baulab.info/
☆ Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.
☆ Applying RLAIF for Code Generation with API-usage in Lightweight LLMs
Reinforcement Learning from AI Feedback (RLAIF) has demonstrated significant potential across various domains, including mitigating harm in LLM outputs, enhancing text summarization, and mathematical reasoning. This paper introduces an RLAIF framework for improving the code generation abilities of lightweight (<1B parameters) LLMs. We specifically focus on code generation tasks that require writing appropriate API calls, which is challenging due to the well-known issue of hallucination in LLMs. Our framework extracts AI feedback from a larger LLM (e.g., GPT-3.5) through a specialized prompting strategy and uses this data to train a reward model towards better alignment from smaller LLMs. We run our experiments on the Gorilla dataset and meticulously assess the quality of the model-generated code across various metrics, including AST, ROUGE, and Code-BLEU, and develop a pipeline to compute its executability rate accurately. Our approach significantly enhances the fine-tuned LLM baseline's performance, achieving a 4.5% improvement in executability rate. Notably, a smaller LLM model (780M parameters) trained with RLAIF surpasses a much larger fine-tuned baseline with 7B parameters, achieving a 1.0% higher code executability rate.
☆ To Word Senses and Beyond: Inducing Concepts with Contextualized Language Models
Polysemy and synonymy are two crucial interrelated facets of lexical ambiguity. While both phenomena have been studied extensively in NLP, leading to dedicated systems, they are often been considered independently. While many tasks dealing with polysemy (e.g. Word Sense Disambiguiation or Induction) highlight the role of a word's senses, the study of synonymy is rooted in the study of concepts, i.e. meaning shared across the lexicon. In this paper, we introduce Concept Induction, the unsupervised task of learning a soft clustering among words that defines a set of concepts directly from data. This task generalizes that of Word Sense Induction. We propose a bi-level approach to Concept Induction that leverages both a local lemma-centric view and a global cross-lexicon perspective to induce concepts. We evaluate the obtained clustering on SemCor's annotated data and obtain good performances (BCubed F1 above 0.60). We find that the local and the global levels are mutually beneficial to induce concepts and also senses in our setting. Finally, we create static embeddings representing our induced concepts and use them on the Word-in-Context task, obtaining competitive performances with the State-of-the-Art.
☆ Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
comment: 22 pages
☆ Understanding and Mitigating Language Confusion in LLMs
We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our language confusion benchmark, which serves as a first layer of efficient, scalable multilingual evaluation at https://github.com/for-ai/language-confusion.
☆ BioMNER: A Dataset for Biomedical Method Entity Recognition
Named entity recognition (NER) stands as a fundamental and pivotal task within the realm of Natural Language Processing. Particularly within the domain of Biomedical Method NER, this task presents notable challenges, stemming from the continual influx of domain-specific terminologies in scholarly literature. Current research in Biomedical Method (BioMethod) NER suffers from a scarcity of resources, primarily attributed to the intricate nature of methodological concepts, which necessitate a profound understanding for precise delineation. In this study, we propose a novel dataset for biomedical method entity recognition, employing an automated BioMethod entity recognition and information retrieval system to assist human annotation. Furthermore, we comprehensively explore a range of conventional and contemporary open-domain NER methodologies, including the utilization of cutting-edge large-scale language models (LLMs) customised to our dataset. Our empirical findings reveal that the large parameter counts of language models surprisingly inhibit the effective assimilation of entity extraction patterns pertaining to biomedical methods. Remarkably, the approach, leveraging the modestly sized ALBERT model (only 11MB), in conjunction with conditional random fields (CRF), achieves state-of-the-art (SOTA) performance.
☆ LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models
Large language models (LLMs) require continual knowledge updates to stay abreast of the ever-changing world facts, prompting the formulation of lifelong model editing task. While recent years have witnessed the development of various techniques for single and batch editing, these methods either fail to apply or perform sub-optimally when faced with lifelong editing. In this paper, we introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong model editing. We first analyze the factors influencing the effectiveness of conventional MoE adaptor in lifelong editing, including catastrophic forgetting, inconsistent routing and order sensitivity. Based on these insights, we propose a tailored module insertion method to achieve lifelong editing, incorporating a novel KV anchor routing to enhance routing consistency between training and inference stage, along with a concise yet effective clustering-based editing order planning. Experimental results demonstrate the effectiveness of our method in lifelong editing, surpassing previous model editing techniques while maintaining outstanding performance in batch editing task. Our code will be available.
☆ ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a total score of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play a crucial role in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.
☆ The SIFo Benchmark: Investigating the Sequential Instruction Following Ability of Large Language Models
Following multiple instructions is a crucial ability for large language models (LLMs). Evaluating this ability comes with significant challenges: (i) limited coherence between multiple instructions, (ii) positional bias where the order of instructions affects model performance, and (iii) a lack of objectively verifiable tasks. To address these issues, we introduce a benchmark designed to evaluate models' abilities to follow multiple instructions through sequential instruction following (SIFo) tasks. In SIFo, the successful completion of multiple instructions is verifiable by examining only the final instruction. Our benchmark evaluates instruction following using four tasks (text modification, question answering, mathematics, and security rule following), each assessing different aspects of sequential instruction following. Our evaluation of popular LLMs, both closed-source and open-source, shows that more recent and larger models significantly outperform their older and smaller counterparts on the SIFo tasks, validating the benchmark's effectiveness. All models struggle with following sequences of instructions, hinting at an important lack of robustness of today's language models.
☆ Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.
☆ Into the Unknown: Generating Geospatial Descriptions for New Environments
Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer's viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training data. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text resulting in low geolocation resolution. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments using readily available geospatial data. Our method constructs a grounded knowledge-graph, capturing entity relationships. Sampled entities and relations (`shop north of school') generate navigation instructions via (i) generating numerous templates using context-free grammar (CFG) to embed specific entities and relations; (ii) feeding the entities and relation into a large language model (LLM) for instruction generation. A comprehensive evaluation on RVS, showed that our approach improves the 100-meter accuracy by 45.83% on unseen environments. Furthermore, we demonstrate that models trained with CFG-based augmentation achieve superior performance compared with those trained with LLM-based augmentation, both in unseen and seen environments. These findings suggest that the potential advantages of explicitly structuring spatial information for text-based geospatial reasoning in previously unknown, can unlock data-scarce scenarios.
☆ Simulating Financial Market via Large Language Model based Agents
Most economic theories typically assume that financial market participants are fully rational individuals and use mathematical models to simulate human behavior in financial markets. However, human behavior is often not entirely rational and is challenging to predict accurately with mathematical models. In this paper, we propose \textbf{A}gent-based \textbf{S}imulated \textbf{F}inancial \textbf{M}arket (ASFM), which first constructs a simulated stock market with a real order matching system. Then, we propose a large language model based agent as the stock trader, which contains the profile, observation, and tool-learning based action module. The trading agent can comprehensively understand current market dynamics and financial policy information, and make decisions that align with their trading strategy. In the experiments, we first verify that the reactions of our ASFM are consistent with the real stock market in two controllable scenarios. In addition, we also conduct experiments in two popular economics research directions, and we find that conclusions drawn in our \model align with the preliminary findings in economics research. Based on these observations, we believe our proposed ASFM provides a new paradigm for economic research.
☆ BESTOW: Efficient and Streamable Speech Language Model with the Best of Two Worlds in GPT and T5
Incorporating speech understanding capabilities into pretrained large-language models has become a vital research direction (SpeechLLM). The previous architectures can be categorized as: i) GPT-style, prepend speech prompts to the text prompts as a sequence of LLM inputs like a decoder-only model; ii) T5-style, introduce speech cross-attention to each layer of the pretrained LLMs. We propose BESTOW architecture to bring the BESt features from TwO Worlds into a single model that is highly efficient and has strong multitask capabilities. Moreover, there is no clear streaming solution for either style, especially considering the solution should generalize to speech multitask. We reformulate streamable SpeechLLM as a read-write policy problem and unifies the offline and streaming research with BESTOW architecture. Hence we demonstrate the first open-source SpeechLLM solution that enables Streaming and Multitask at scale (beyond ASR) at the same time. This streamable solution achieves very strong performance on a wide range of speech tasks (ASR, AST, SQA, unseen DynamicSuperb). It is end-to-end optimizable, with lower training/inference cost, and demonstrates LLM knowledge transferability to speech.
☆ Mining Reasons For And Against Vaccination From Unstructured Data Using Nichesourcing and AI Data Augmentation
We present Reasons For and Against Vaccination (RFAV), a dataset for predicting reasons for and against vaccination, and scientific authorities used to justify them, annotated through nichesourcing and augmented using GPT4 and GPT3.5-Turbo. We show how it is possible to mine these reasons in non-structured text, under different task definitions, despite the high level of subjectivity involved and explore the impact of artificially augmented data using in-context learning with GPT4 and GPT3.5-Turbo. We publish the dataset and the trained models along with the annotation manual used to train annotators and define the task.
comment: 8 pages + references and appendix
☆ Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths.
☆ From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.
☆ Interactive Topic Models with Optimal Transport
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
comment: Pre-print; Work in progress
☆ Paraphrase Types Elicit Prompt Engineering Capabilities
Much of the success of modern language models depends on finding a suitable prompt to instruct the model. Until now, it has been largely unknown how variations in the linguistic expression of prompts affect these models. This study systematically and empirically evaluates which linguistic features influence models through paraphrase types, i.e., different linguistic changes at particular positions. We measure behavioral changes for five models across 120 tasks and six families of paraphrases (i.e., morphology, syntax, lexicon, lexico-syntax, discourse, and others). We also control for other prompt engineering factors (e.g., prompt length, lexical diversity, and proximity to training data). Our results show a potential for language models to improve tasks when their prompts are adapted in specific paraphrase types (e.g., 6.7% median gain in Mixtral 8x7B; 5.5% in LLaMA 3 8B). In particular, changes in morphology and lexicon, i.e., the vocabulary used, showed promise in improving prompts. These findings contribute to developing more robust language models capable of handling variability in linguistic expression.
☆ Untangling the Unrestricted Web: Automatic Identification of Multilingual Registers
This article explores deep learning models for the automatic identification of registers - text varieties such as news reports and discussion forums - in web-based datasets across 16 languages. Web register (or genre) identification would provide a robust solution for understanding the content of web-scale datasets, which have become crucial in computational linguistics. Despite recent advances, the potential of register classifiers on the noisy web remains largely unexplored, particularly in multilingual settings and when targeting the entire unrestricted web. We experiment with a range of deep learning models using the new Multilingual CORE corpora, which includes 16 languages annotated using a detailed, hierarchical taxonomy of 25 registers designed to cover the entire unrestricted web. Our models achieve state-of-the-art results, showing that a detailed taxonomy in a hierarchical multi-label setting can yield competitive classification performance. However, all models hit a glass ceiling at approximately 80% F1 score, which we attribute to the non-discrete nature of web registers and the inherent uncertainty in labeling some documents. By pruning ambiguous examples, we improve model performance to over 90%. Finally, multilingual models outperform monolingual ones, particularly benefiting languages with fewer training examples and smaller registers. Although a zero-shot setting decreases performance by an average of 7%, these drops are not linked to specific registers or languages. Instead, registers show surprising similarity across languages.
☆ Investigating the Timescales of Language Processing with EEG and Language Models
This study explores the temporal dynamics of language processing by examining the alignment between word representations from a pre-trained transformer-based language model, and EEG data. Using a Temporal Response Function (TRF) model, we investigate how neural activity corresponds to model representations across different layers, revealing insights into the interaction between artificial language models and brain responses during language comprehension. Our analysis reveals patterns in TRFs from distinct layers, highlighting varying contributions to lexical and compositional processing. Additionally, we used linear discriminant analysis (LDA) to isolate part-of-speech (POS) representations, offering insights into their influence on neural responses and the underlying mechanisms of syntactic processing. These findings underscore EEG's utility for probing language processing dynamics with high temporal resolution. By bridging artificial language models and neural activity, this study advances our understanding of their interaction at fine timescales.
comment: Accepted at the 2024 Conference on Cognitive Computational Neuroscience (CCN 2024)
☆ Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming increasingly difficult as language model's capabilities of generating human-like texts keep evolving. This study provides a new perspective by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respectively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection. Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies. Our code is available at https://github.com/CLCS-SUSTech/FourierGPT
comment: 13 pages, 12 figures
☆ YuLan: An Open-source Large Language Model
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
☆ AnomaLLMy -- Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions
This paper introduces AnomaLLMy, a novel technique for the automatic detection of anomalous tokens in black-box Large Language Models (LLMs) with API-only access. Utilizing low-confidence single-token predictions as a cost-effective indicator, AnomaLLMy identifies irregularities in model behavior, addressing the issue of anomalous tokens degrading the quality and reliability of models. Validated on the cl100k_base dataset, the token set of GPT-4, AnomaLLMy detected 413 major and 65 minor anomalies, demonstrating the method's efficiency with just \$24.39 spent in API credits. The insights from this research are expected to be beneficial for enhancing the robustness of and accuracy of LLMs, particularly in the development and assessment of tokenizers.
comment: 6 pages
☆ BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering ACL 2024
Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
comment: Accepted to ACL 2024
☆ Scalable and Domain-General Abstractive Proposition Segmentation
Segmenting text into fine-grained units of meaning is important to a wide range of NLP applications. The default approach of segmenting text into sentences is often insufficient, especially since sentences are usually complex enough to include multiple units of meaning that merit separate treatment in the downstream task. We focus on the task of abstractive proposition segmentation: transforming text into simple, self-contained, well-formed sentences. Several recent works have demonstrated the utility of proposition segmentation with few-shot prompted LLMs for downstream tasks such as retrieval-augmented grounding and fact verification. However, this approach does not scale to large amounts of text and may not always extract all the facts from the input text. In this paper, we first introduce evaluation metrics for the task to measure several dimensions of quality. We then propose a scalable, yet accurate, proposition segmentation model. We model proposition segmentation as a supervised task by training LLMs on existing annotated datasets and show that training yields significantly improved results. We further show that by using the fine-tuned LLMs as teachers for annotating large amounts of multi-domain synthetic distillation data, we can train smaller student models with results similar to the teacher LLMs. We then demonstrate that our technique leads to effective domain generalization, by annotating data in two domains outside the original training data and evaluating on them. Finally, as a key contribution of the paper, we share an easy-to-use API for NLP practitioners to use.
☆ NLPerturbator: Studying the Robustness of Code LLMs to Natural Language Variations
Large language models (LLMs) achieve promising results in code generation based on a given natural language description. They have been integrated into open-source projects and commercial products to facilitate daily coding activities. The natural language description in the prompt is crucial for LLMs to comprehend users' requirements. Prior studies uncover that LLMs are sensitive to the changes in the prompts, including slight changes that look inconspicuous. However, the natural language descriptions often vary in real-world scenarios (e.g., different formats, grammar, and wording). Prior studies on the robustness of LLMs are often based on random perturbations and such perturbations may not actually happen. In this paper, we conduct a comprehensive study to investigate how are code LLMs robust to variations of natural language description in real-world scenarios. We summarize 18 categories of perturbations of natural language and 3 combinations of co-occurred categories based on our literature review and an online survey with practitioners. We propose an automated framework, NLPerturbator, which can perform perturbations of each category given a set of prompts. Through a series of experiments on code generation using six code LLMs, we find that the perturbed prompts can decrease the performance of code generation by a considerable margin (e.g., up to 21.2%, and 4.8% to 6.1% on average). Our study highlights the importance of enhancing the robustness of LLMs to real-world variations in the prompts, as well as the essentiality of attentively constructing the prompts.
☆ Direct Preference Knowledge Distillation for Large Language Models
In the field of large language models (LLMs), Knowledge Distillation (KD) is a critical technique for transferring capabilities from teacher models to student models. However, existing KD methods face limitations and challenges in distillation of LLMs, including efficiency and insufficient measurement capabilities of traditional KL divergence. It is shown that LLMs can serve as an implicit reward function, which we define as a supplement to KL divergence. In this work, we propose Direct Preference Knowledge Distillation (DPKD) for LLMs. DPKD utilizes distribution divergence to represent the preference loss and implicit reward function. We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence and then improving the preference probability of teacher outputs over student outputs. We conducted experiments and analysis on various datasets with LLM parameters ranging from 120M to 13B and demonstrate the broad applicability and effectiveness of our DPKD approach. Meanwhile, we prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis. The DPKD method outperforms the baseline method in both output response precision and exact match percentage. Code and data are available at https://aka.ms/dpkd.
☆ Belief Revision: The Adaptability of Large Language Models Reasoning
The capability to reason from text is crucial for real-world NLP applications. Real-world scenarios often involve incomplete or evolving data. In response, individuals update their beliefs and understandings accordingly. However, most existing evaluations assume that language models (LMs) operate with consistent information. We introduce Belief-R, a new dataset designed to test LMs' belief revision ability when presented with new evidence. Inspired by how humans suppress prior inferences, this task assesses LMs within the newly proposed delta reasoning ($\Delta R$) framework. Belief-R features sequences of premises designed to simulate scenarios where additional information could necessitate prior conclusions drawn by LMs. We evaluate $\sim$30 LMs across diverse prompting strategies and found that LMs generally struggle to appropriately revise their beliefs in response to new information. Further, models adept at updating often underperformed in scenarios without necessary updates, highlighting a critical trade-off. These insights underscore the importance of improving LMs' adaptiveness to changing information, a step toward more reliable AI systems.
☆ Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation
Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.
☆ Breaking the Script Barrier in Multilingual Pre-Trained Language Models with Transliteration-Based Post-Training Alignment
Multilingual pre-trained models (mPLMs) have shown impressive performance on cross-lingual transfer tasks. However, the transfer performance is often hindered when a low-resource target language is written in a different script than the high-resource source language, even though the two languages may be related or share parts of their vocabularies. Inspired by recent work that uses transliteration to address this problem, our paper proposes a transliteration-based post-pretraining alignment (PPA) method aiming to improve the cross-lingual alignment between languages using diverse scripts. We select two areal language groups, $\textbf{Mediterranean-Amharic-Farsi}$ and $\textbf{South+East Asian Languages}$, wherein the languages are mutually influenced but use different scripts. We apply our method to these language groups and conduct extensive experiments on a spectrum of downstream tasks. The results show that after PPA, models consistently outperform the original model (up to 50% for some tasks) in English-centric transfer. In addition, when we use languages other than English as sources in transfer, our method obtains even larger improvements. We will make our code and models publicly available at \url{https://github.com/cisnlp/Transliteration-PPA}.
comment: preprint
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at https://github.com/jihaonew/MM-Instruct
☆ Message du troisi{è}me type : irruption d'un tiers dans un dialogue en ligne
Our study focuses on Wikipedia talk pages, from a global perspective analyzing contributors' behaviors in online interactions. Using a corpus comprising all Wikipedia talk pages in French, totaling more than 300,000 discussion threads, we examine how discussions with more than two participants (multiparty conversation) unfold and we specifically investigate the role of a third participant's intervention when two Wikipedians have already initiated an exchange. In this regard, we concentrate on the sequential structure of these interactions in terms of articulation among different participants and aim to specify this third message by exploring its lexical particularities, while also proposing an initial typology of the third participant's message role and how it aligns with preceding messages.
comment: in French language. JADT 2024 - 17es Journ{\'e}es internationales d'Analyse statistique des Donn{\'e}es Textuelles, SeSLa (S{\'e}minaire des Sciences du Langage de l'UCLouvain -- Site Saint-Louis); LASLA (Laboratoire d'Analyse statistique des Langues anciennes de l'Universit{\'e} de Li{\`e}ge), 2024, Bruxelles, Belgique
☆ Le sens de la famille : analyse du vocabulaire de la parent{é} par les plongements de mots
In this study, we propose a corpus analysis of an area of the French lexicon that is both dense and highly structured: the vocabulary of family relationships. Starting with a lexicon of 25 nouns designating the main relationships (son, cousin, mother, grandfather, sister-in-law etc.), we examine how these terms are positioned in relation to each other through distributional analyses based on the use of these terms in corpora. We show that distributional information can capture certain features that organize this vocabulary (descent, alliance, siblings, genre), in ways that vary according to the different corpora compared.
comment: in French language. JADT 2024 - 17es Journ{\'e}es internationales d'Analyse statistique des Donn{\'e}es Textuelles, SeSLa (S{\'e}minaire des Sciences du Langage de l'UCLouvain -- Site Saint-Louis), 2024, Bruxelles, Belgique
☆ Uncertainty Quantification in Large Language Models Through Convex Hull Analysis
Uncertainty quantification approaches have been more critical in large language models (LLMs), particularly high-risk applications requiring reliable outputs. However, traditional methods for uncertainty quantification, such as probabilistic models and ensemble techniques, face challenges when applied to the complex and high-dimensional nature of LLM-generated outputs. This study proposes a novel geometric approach to uncertainty quantification using convex hull analysis. The proposed method leverages the spatial properties of response embeddings to measure the dispersion and variability of model outputs. The prompts are categorized into three types, i.e., `easy', `moderate', and `confusing', to generate multiple responses using different LLMs at varying temperature settings. The responses are transformed into high-dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using Principal Component Analysis (PCA). The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster the embeddings and compute the convex hull for each selected cluster. The experimental results indicate that the uncertainty of the model for LLMs depends on the prompt complexity, the model, and the temperature setting.
comment: 17 pages
☆ Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
comment: Accepted at Interspeech-2024
☆ DECOR: Improving Coherence in L2 English Writing with a Novel Benchmark for Incoherence Detection, Reasoning, and Rewriting
Coherence in writing, an aspect that second-language (L2) English learners often struggle with, is crucial in assessing L2 English writing. Existing automated writing evaluation systems primarily use basic surface linguistic features to detect coherence in writing. However, little effort has been made to correct the detected incoherence, which could significantly benefit L2 language learners seeking to improve their writing. To bridge this gap, we introduce DECOR, a novel benchmark that includes expert annotations for detecting incoherence in L2 English writing, identifying the underlying reasons, and rewriting the incoherent sentences. To our knowledge, DECOR is the first coherence assessment dataset specifically designed for improving L2 English writing, featuring pairs of original incoherent sentences alongside their expert-rewritten counterparts. Additionally, we fine-tuned models to automatically detect and rewrite incoherence in student essays. We find that incorporating specific reasons for incoherence during fine-tuning consistently improves the quality of the rewrites, achieving a result that is favored in both automatic and human evaluations.
comment: 21 pages, 5 figures, 20 tables
☆ Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction between humans and multiple chatbots. This study examines the impact of multi-chatbot communication in a specific persuasion setting: promoting charitable donations. We developed an online environment that enables multi-chatbot communication and conducted a pilot experiment utilizing two GPT-based chatbots, Save the Children and UNICEF chatbots, to promote charitable donations. In this study, we present our development process of the multi-chatbot interface and present preliminary findings from a pilot experiment. Analysis of qualitative and quantitative feedback are presented, and limitations are addressed.
☆ Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.
☆ IDT: Dual-Task Adversarial Attacks for Privacy Protection
Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or -- for instance, in cases where users might not trust a model, the sort of scenario of interest here -- changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g. the gender of the author, or their location by the writing style) whilst keeping the text useful for its original intention (e.g. the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This paper explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t one task (privacy) whilst keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t privacy task.
comment: 28 pages, 1 figure
☆ Mixture of In-Context Experts Enhance LLMs' Long Context Awareness
Many studies have revealed that large language models (LLMs) exhibit uneven awareness of different contextual positions.Their limited context awareness can lead to overlooking critical information and subsequent task failures. While several approaches have been proposed to enhance LLMs' context awareness, achieving both effectiveness and efficiency remains challenging.In this paper, for LLMs utilizing RoPE as position embeddings, we introduce a novel method called ``Mixture of In-Context Experts'' (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy: (1) MoICE views each RoPE angle as an `in-context' expert, demonstrated to be capable of directing the attention of a head to specific contextual positions. Consequently, each attention head flexibly processes tokens using multiple RoPE angles dynamically selected by the router to attend to the needed positions. This approach mitigates the risk of overlooking essential contextual information. (2) The router-only training strategy entails freezing LLM parameters and exclusively updating routers for only a few steps. When applied to open-source LLMs including Llama and Mistral, MoICE surpasses prior methods across multiple tasks on long context understanding and generation, all while maintaining commendable inference efficiency.
comment: 14 pages, 5 figures
☆ SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Synthetic data generation has gained significant attention recently for its utility in training large vision and language models. However, the application of synthetic data to the training of multimodal context-augmented generation systems has been relatively unexplored. This gap in existing work is important because existing vision and language models (VLMs) are not trained specifically for context-augmented generation. Resources for adapting such models are therefore crucial for enabling their use in retrieval-augmented generation (RAG) settings, where a retriever is used to gather relevant information that is then subsequently provided to a generative model via context augmentation. To address this challenging problem, we generate SK-VQA: a large synthetic multimodal dataset containing over 2 million question-answer pairs which require external knowledge to determine the final answer. Our dataset is both larger and significantly more diverse than existing resources of its kind, possessing over 11x more unique questions and containing images from a greater variety of sources than previously-proposed datasets. Through extensive experiments, we demonstrate that our synthetic dataset can not only serve as a challenging benchmark, but is also highly effective for adapting existing generative multimodal models for context-augmented generation.
♻ ☆ AutoMix: Automatically Mixing Language Models
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
comment: The first two authors contributed equally. Work started and partly done during Aman's internship at Google. This version adds results on additional models and datasets
♻ ☆ MBIAS: Mitigating Bias in Large Language Models While Retaining Context
The deployment of Large Language Models (LLMs) in diverse applications necessitates an assurance of safety without compromising the contextual integrity of the generated content. Traditional approaches, including safety-specific fine-tuning or adversarial testing, often yield safe outputs at the expense of contextual meaning. This can result in a diminished capacity to handle nuanced aspects of bias and toxicity, such as underrepresentation or negative portrayals across various demographics. To address these challenges, we introduce MBIAS, an LLM framework carefully instruction fine-tuned on a custom dataset designed specifically for safety interventions. MBIAS is designed to significantly reduce biases and toxic elements in LLM outputs while preserving the main information. This work also details our further use of LLMs: as annotator under human supervision and as evaluator of generated content. Empirical analysis reveals that MBIAS achieves a reduction in bias and toxicity by over 30\% in standard evaluations, and by more than 90\% in diverse demographic tests, highlighting the robustness of our approach. We make the dataset and the fine-tuned model available to the research community for further investigation and ensure reproducibility. The code for this project can be accessed here https://github.com/shainarazavi/MBIAS/tree/main. Warning: This paper contains examples that may be offensive or upsetting.
♻ ☆ MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as "black-boxes," making it challenging to modify their behavior. To address the problem, our study delves into retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.
comment: Accepted by AMIA 2024 Annual Symposium
♻ ☆ LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.
comment: 10 pages, 3 figures
♻ ☆ A Small and Fast BERT for Chinese Medical Punctuation Restoration INTERSPEECH 2024
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
comment: 5 pages, 2 figures, Accepted by INTERSPEECH 2024
♻ ☆ Distributed Speculative Inference of Large Language Models
Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.
♻ ☆ How well ChatGPT understand Malaysian English? An Evaluation on Named Entity Recognition and Relation Extraction EMNLP 2023
Recently, ChatGPT has attracted a lot of interest from both researchers and the general public. While the performance of ChatGPT in named entity recognition and relation extraction from Standard English texts is satisfactory, it remains to be seen if it can perform similarly for Malaysian English. Malaysian English is unique as it exhibits morphosyntactic and semantical adaptation from local contexts. In this study, we assess ChatGPT's capability in extracting entities and relations from the Malaysian English News (MEN) dataset. We propose a three-step methodology referred to as \textbf{\textit{educate-predict-evaluate}}. The performance of ChatGPT is assessed using F1-Score across 18 unique prompt settings, which were carefully engineered for a comprehensive review. From our evaluation, we found that ChatGPT does not perform well in extracting entities from Malaysian English news articles, with the highest F1-Score of 0.497. Further analysis shows that the morphosyntactic adaptation in Malaysian English caused the limitation. However, interestingly, this morphosyntactic adaptation does not impact the performance of ChatGPT for relation extraction.
comment: Accepted in Generation, Evaluation & Metrics (GEM) Workshop at EMNLP 2023
♻ ☆ Are LLM-based Evaluators Confusing NLG Quality Criteria? ACL 2024
Some prior work has shown that LLMs perform well in NLG evaluation for different tasks. However, we discover that LLMs seem to confuse different evaluation criteria, which reduces their reliability. For further verification, we first consider avoiding issues of inconsistent conceptualization and vague expression in existing NLG quality criteria themselves. So we summarize a clear hierarchical classification system for 11 common aspects with corresponding different criteria from previous studies involved. Inspired by behavioral testing, we elaborately design 18 types of aspect-targeted perturbation attacks for fine-grained analysis of the evaluation behaviors of different LLMs. We also conduct human annotations beyond the guidance of the classification system to validate the impact of the perturbations. Our experimental results reveal confusion issues inherent in LLMs, as well as other noteworthy phenomena, and necessitate further research and improvements for LLM-based evaluation.
comment: Accepted by ACL 2024
♻ ☆ NoteChat: A Dataset of Synthetic Doctor-Patient Conversations Conditioned on Clinical Notes
We introduce NoteChat, a novel cooperative multi-agent framework leveraging Large Language Models (LLMs) to generate patient-physician dialogues. NoteChat embodies the principle that an ensemble of role-specific LLMs, through structured role-play and strategic prompting, can perform their assigned roles more effectively. The synergy among these role-playing LLMs results in a cohesive and efficient dialogue generation. Evaluation on MTS-dialogue, a benchmark dataset for patient-physician dialogues-note pairs, shows that models trained with the augmented synthetic patient-physician dialogues by NoteChat outperforms other state-of-the-art models for generating clinical notes. Our comprehensive automatic and human evaluation demonstrates that NoteChat substantially surpasses state-of-the-art models like ChatGPT and GPT-4 up to 22.78% by domain experts in generating superior synthetic patient-physician dialogues based on clinical notes. NoteChat has the potential to engage patients directly and help clinical documentation, a leading cause of physician burnout.
♻ ☆ JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability
Large Language Models (LLMs) have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. However, LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. Previously, retrieval augmented generation (RAG) has limited success in addressing hallucinations. Unlike previous methods in RAG where the retrieval model was trained separately from the LLM, we introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase. The synchronized training mechanism enhances JMLR's ability to retrieve clinical guidelines and leverage medical knowledge to reason and answer questions and reduces the demand for computational resources. We evaluated JMLR on the important medical question-answering application. Our experimental results demonstrate that JMLR-13B (70.5%) outperforms a previous state-of-the-art open-source model using conventional pre-training and fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances reasoning quality and reduces hallucinations better than Claude3-Opus. Additionally, JMLR-13B (148 GPU hours) also trains much faster than Meditron-70B (42630 GPU hours). Through this work, we provide a new and efficient knowledge enhancement method for healthcare, demonstrating the potential of integrating retrieval and LLM training for medical question-answering systems.
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation ECML-PKDD
Current state-of-the-art dialogue systems heavily rely on extensive training datasets. However, challenges arise in domains where domain-specific training datasets are insufficient or entirely absent. To tackle this challenge, we propose a novel data \textbf{A}ugmentation framework for \textbf{M}ulti-\textbf{D}omain \textbf{D}ialogue \textbf{G}eneration, referred to as \textbf{AMD$^2$G}. The AMD$^2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training. We posit that domain corpora are a blend of domain-agnostic and domain-specific features, with certain representation patterns shared among diverse domains. Domain-agnostic training aims to enable models to learn these common expressive patterns. To construct domain-agnostic dialogue corpora, we employ a \textit{\textbf{de-domaining}} data processing technique used to remove domain-specific features. By mitigating the effects of domain-specific features, the model trained on the de-domained corpora can effectively learn common expression patterns in different domains. Subsequently, we adapt the learned domain-agnostic features to the target domain through domain adaptation training. We conduct experiments on Chinese dialogue datasets from five different domains and show that AMD$^2$G achieves superior performance compared to both direct training on the target domain corpus and collective training on all five domain corpora. Our work underscores AMD$^2$G as a viable alternative solution for low-resource multi-domain dialogue generation. Code and data associated with our work are available on GitHub repository$^{\text 1}$.
comment: 17pages,ECML-PKDD
♻ ☆ Do prompt positions really matter?
Prompt-based models have gathered a lot of attention from researchers due to their remarkable advancements in the fields of zero-shot and few-shot learning. Developing an effective prompt template plays a critical role. However, prior studies have mainly focused on prompt vocabulary searching or embedding initialization within a predefined template with the prompt position fixed. In this empirical study, we conduct the most comprehensive analysis to date of prompt position for diverse Natural Language Processing (NLP) tasks. Our findings quantify the substantial impact prompt position has on model performance. We observe that the prompt positions used in prior studies are often sub-optimal, and this observation is consistent even in widely used instruction-tuned models. These findings suggest prompt position optimisation as a valuable research direction to augment prompt engineering methodologies and prompt position-aware instruction tuning as a potential way to build more robust models in the future.
comment: 8 pages, 2 figures
♻ ☆ Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning
Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.
comment: Accepted by IALP 2024
♻ ☆ RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.
♻ ☆ TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models ACL 2024
Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different reasoning categories. Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning. Resources are available at: https://github.com/zchuz/TimeBench
comment: Accepted to ACL 2024
♻ ☆ A synthetic data approach for domain generalization of NLI models
Natural Language Inference (NLI) remains an important benchmark task for LLMs. NLI datasets are a springboard for transfer learning to other semantic tasks, and NLI models are standard tools for identifying the faithfulness of model-generated text. There are several large scale NLI datasets today, and models have improved greatly by hill-climbing on these collections. Yet their realistic performance on out-of-distribution/domain data is less well-understood. We explore the opportunity for synthetic high-quality datasets to adapt NLI models for zero-shot use in downstream applications across new and unseen text domains. We demonstrate a new approach for generating NLI data in diverse domains and lengths, so far not covered by existing training sets. The resulting examples have meaningful premises, the hypotheses are formed in creative ways rather than simple edits to a few premise tokens, and the labels have high accuracy. We show that models trained on this data ($685$K synthetic examples) have the best generalization to completely new downstream test settings. On the TRUE benchmark, a T5-small model trained with our data improves around $7\%$ on average compared to training on the best alternative dataset. The improvements are more pronounced for smaller models, while still meaningful on a T5 XXL model. We also demonstrate gains on test sets when in-domain training data is augmented with our domain-general synthetic data.
♻ ☆ Chitchat as Interference: Adding User Backstories to Task-Oriented Dialogues LREC
During task-oriented dialogues (TODs), human users naturally introduce chitchat that is beyond the immediate scope of the task, interfering with the flow of the conversation. To address this issue without the need for expensive manual data creation, we use few-shot prompting with Llama-2-70B to enhance the MultiWOZ dataset with user backstories, a typical example of chitchat interference in TODs. We assess the impact of this addition by testing two models: one trained solely on TODs and another trained on TODs with a preliminary chitchat interaction. Our analysis demonstrates that our enhanced dataset poses a challenge for these systems. Moreover, we demonstrate that our dataset can be effectively used for training purposes, enabling a system to consistently acknowledge the user's backstory while also successfully moving the task forward in the same turn, as confirmed by human evaluation. These findings highlight the benefits of generating novel chitchat-TOD scenarios to test TOD systems more thoroughly and improve their resilience to natural user interferences
comment: Accepted @ LREC-COLING 2024
♻ ☆ MathChat: Converse to Tackle Challenging Math Problems with LLM Agents
Employing Large Language Models (LLMs) to address mathematical problems is an intriguing research endeavor, considering the abundance of math problems expressed in natural language across numerous science and engineering fields. LLMs, with their generalized ability, are used as a foundation model to build AI agents for different tasks. In this paper, we study the effectiveness of utilizing LLM agents to solve math problems through conversations. We propose MathChat, a conversational problem-solving framework designed for math problems. MathChat consists of an LLM agent and a user proxy agent which is responsible for tool execution and additional guidance. This synergy facilitates a collaborative problem-solving process, where the agents engage in a dialogue to solve the problems. We perform evaluation on difficult high school competition problems from the MATH dataset. Utilizing Python, we show that MathChat can further improve previous tool-using prompting methods by 6%.
comment: Update version
♻ ☆ A Unified Approach to Emotion Detection and Task-Oriented Dialogue Modeling
In current text-based task-oriented dialogue (TOD) systems, user emotion detection (ED) is often overlooked or is typically treated as a separate and independent task, requiring additional training. In contrast, our work demonstrates that seamlessly unifying ED and TOD modeling brings about mutual benefits, and is therefore an alternative to be considered. Our method consists in augmenting SimpleToD, an end-to-end TOD system, by extending belief state tracking to include ED, relying on a single language model. We evaluate our approach using GPT-2 and Llama-2 on the EmoWOZ benchmark, a version of MultiWOZ annotated with emotions. Our results reveal a general increase in performance for ED and task results. Our findings also indicate that user emotions provide useful contextual conditioning for system responses, and can be leveraged to further refine responses in terms of empathy.
comment: Accepted @ IWSDS 2024
♻ ☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.
comment: 39 pages
♻ ☆ BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labelling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes multiple feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that the use of LLM embedding in the event detection framework has significantly improved the results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy in news analysis and reporting.
♻ ☆ SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.
♻ ☆ SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models ICML 2024
Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
comment: To appear at ICML 2024
♻ ☆ Active Preference Learning for Large Language Models
As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
comment: 13 pages, 5 figures, 6 tables
♻ ☆ UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
♻ ☆ Concept-aware Data Construction Improves In-context Learning of Language Models ACL 2024
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.
comment: Long paper to appear in Findings of ACL 2024
♻ ☆ Latent Logic Tree Extraction for Event Sequence Explanation from LLMs
Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences. Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence. Built on the temporal point process model for events, our method employs the likelihood function as a score to evaluate generated logic trees. We propose an amortized Expectation-Maximization (EM) learning framework and treat the logic tree as latent variables. In the E-step, we evaluate the posterior distribution over the latent logic trees using an LLM prior and the likelihood of the observed event sequences. LLM provides a high-quality prior for the latent logic trees, however, since the posterior is built over a discrete combinatorial space, we cannot get the closed-form solution. We propose to generate logic tree samples from the posterior using a learnable GFlowNet, which is a diversity-seeking generator for structured discrete variables. The M-step employs the generated logic rules to approximate marginalization over the posterior, facilitating the learning of model parameters and refining the tunable LLM prior parameters. In the online setting, our locally built, lightweight model will iteratively extract the most relevant rules from LLMs for each sequence using only a few iterations. Empirical demonstrations showcase the promising performance and adaptability of our framework.
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets, and more than 10 different GLLMs together with 3 different prompting methods using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In 6 out of 7 cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $10$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
♻ ☆ Does Geo-co-location Matter? A Case Study of Public Health Conversations during COVID-19
Social media platforms like Twitter (now X) have been pivotal in information dissemination and public engagement, especially during COVID-19. A key goal for public health experts was to encourage prosocial behavior that could impact local outcomes such as masking and social distancing. Given the importance of local news and guidance during COVID-19, the objective of our research is to analyze the effect of localized engagement, on social media conversations. This study examines the impact of geographic co-location, as a proxy for localized engagement between public health experts (PHEs) and the public, on social media. We analyze a Twitter conversation dataset from January 2020 to November 2021, comprising over 19 K tweets from nearly five hundred PHEs, along with approximately 800 K replies from 350 K participants. Our findings reveal that geo-co-location is associated with higher engagement rates, especially in conversations on topics including masking, lockdowns, and education, and in conversations with academic and medical professionals. Lexical features associated with emotion and personal experiences were more common in geo-co-located contexts. This research provides insights into how geographic co-location influences social media engagement and can inform strategies to improve public health messaging.
♻ ☆ Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People
Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
comment: Preprint
♻ ☆ SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WD). We focus on two mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.
comment: 26 pages, including reference and appendix sections, 8 figures, and 16 tables
♻ ☆ AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.
comment: https://github.com/LibertFan/AI_Hospital
♻ ☆ Navigating LLM Ethics: Advancements, Challenges, and Future Directions
This study addresses ethical issues surrounding Large Language Models (LLMs) within the field of artificial intelligence. It explores the common ethical challenges posed by both LLMs and other AI systems, such as privacy and fairness, as well as ethical challenges uniquely arising from LLMs. It highlights challenges such as hallucination, verifiable accountability, and decoding censorship complexity, which are unique to LLMs and distinct from those encountered in traditional AI systems. The study underscores the need to tackle these complexities to ensure accountability, reduce biases, and enhance transparency in the influential role that LLMs play in shaping information dissemination. It proposes mitigation strategies and future directions for LLM ethics, advocating for interdisciplinary collaboration. It recommends ethical frameworks tailored to specific domains and dynamic auditing systems adapted to diverse contexts. This roadmap aims to guide responsible development and integration of LLMs, envisioning a future where ethical considerations govern AI advancements in society.
♻ ☆ The global landscape of academic guidelines for generative AI and Large Language Models
The integration of Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) in academia has spurred a global discourse on their potential pedagogical benefits and ethical considerations. Positive reactions highlight some potential, such as collaborative creativity, increased access to education, and empowerment of trainers and trainees. However, negative reactions raise concerns about ethical complexities, balancing innovation and academic integrity, unequal access, and misinformation risks. Through a systematic survey and text-mining-based analysis of global and national directives, insights from independent research, and eighty university-level guidelines, this study provides a nuanced understanding of the opportunities and challenges posed by GAI and LLMs in education. It emphasizes the importance of balanced approaches that harness the benefits of these technologies while addressing ethical considerations and ensuring equitable access and educational outcomes. The paper concludes with recommendations for fostering responsible innovation and ethical practices to guide the integration of GAI and LLMs in academia.
♻ ☆ Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges
In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.
♻ ☆ MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations ICLR 2024
Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
comment: Accepted by ICLR 2024, Long Paper; The abstract is slightly modified due to the length limitation
♻ ☆ Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process COLING 2024
Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to simulate human reasoning and inference processes, achieving proficient performance in multi-hop QA. However, a gap persists between PLMs' reasoning abilities and those of humans when tackling complex problems. Psychological studies suggest a vital connection between explicit information in passages and human prior knowledge during reading. Nevertheless, current research has given insufficient attention to linking input passages and PLMs' pre-training-based knowledge from the perspective of human cognition studies. In this study, we introduce a Prompting Explicit and Implicit knowledge (PEI) framework, which uses prompts to connect explicit and implicit knowledge, aligning with human reading process for multi-hop QA. We consider the input passages as explicit knowledge, employing them to elicit implicit knowledge through unified prompt reasoning. Furthermore, our model incorporates type-specific reasoning via prompts, a form of implicit knowledge. Experimental results show that PEI performs comparably to the state-of-the-art on HotpotQA. Ablation studies confirm the efficacy of our model in bridging and integrating explicit and implicit knowledge.
comment: This paper has been accepted at COLING 2024
Machine Learning 132
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
☆ Scaling Synthetic Data Creation with 1,000,000,000 Personas
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.
comment: Work in progress
☆ ProgressGym: Alignment with a Millennium of Moral Progress
Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.
☆ Token Erasure as a Footprint of Implicit Vocabulary Items in LLMs
LLMs process text as sequences of tokens that roughly correspond to words, where less common words are represented by multiple tokens. However, individual tokens are often semantically unrelated to the meanings of the words/concepts they comprise. For example, Llama-2-7b's tokenizer splits the word "northeastern" into the tokens ['_n', 'ort', 'he', 'astern'], none of which correspond to semantically meaningful units like "north" or "east." Similarly, the overall meanings of named entities like "Neil Young" and multi-word expressions like "break a leg" cannot be directly inferred from their constituent tokens. Mechanistically, how do LLMs convert such arbitrary groups of tokens into useful higher-level representations? In this work, we find that last token representations of named entities and multi-token words exhibit a pronounced "erasure" effect, where information about previous and current tokens is rapidly forgotten in early layers. Using this observation, we propose a method to "read out" the implicit vocabulary of an autoregressive LLM by examining differences in token representations across layers, and present results of this method for Llama-2-7b and Llama-3-8B. To our knowledge, this is the first attempt to probe the implicit vocabulary of an LLM.
comment: 13 pages, 14 figures. Code and data at https://footprints.baulab.info/
☆ Segment Anything without Supervision
The Segmentation Anything Model (SAM) requires labor-intensive data labeling. We present Unsupervised SAM (UnSAM) for promptable and automatic whole-image segmentation that does not require human annotations. UnSAM utilizes a divide-and-conquer strategy to "discover" the hierarchical structure of visual scenes. We first leverage top-down clustering methods to partition an unlabeled image into instance/semantic level segments. For all pixels within a segment, a bottom-up clustering method is employed to iteratively merge them into larger groups, thereby forming a hierarchical structure. These unsupervised multi-granular masks are then utilized to supervise model training. Evaluated across seven popular datasets, UnSAM achieves competitive results with the supervised counterpart SAM, and surpasses the previous state-of-the-art in unsupervised segmentation by 11% in terms of AR. Moreover, we show that supervised SAM can also benefit from our self-supervised labels. By integrating our unsupervised pseudo masks into SA-1B's ground-truth masks and training UnSAM with only 1% of SA-1B, a lightly semi-supervised UnSAM can often segment entities overlooked by supervised SAM, exceeding SAM's AR by over 6.7% and AP by 3.9% on SA-1B.
comment: Code: https://github.com/frank-xwang/UnSAM
☆ Cost-aware Bayesian optimization via the Pandora's Box Gittins index
Bayesian optimization is a technique for efficiently optimizing unknown functions in a black-box manner. To handle practical settings where gathering data requires use of finite resources, it is desirable to explicitly incorporate function evaluation costs into Bayesian optimization policies. To understand how to do so, we develop a previously-unexplored connection between cost-aware Bayesian optimization and the Pandora's Box problem, a decision problem from economics. The Pandora's Box problem admits a Bayesian-optimal solution based on an expression called the Gittins index, which can be reinterpreted as an acquisition function. We study the use of this acquisition function for cost-aware Bayesian optimization, and demonstrate empirically that it performs well, particularly in medium-high dimensions. We further show that this performance carries over to classical Bayesian optimization without explicit evaluation costs. Our work constitutes a first step towards integrating techniques from Gittins index theory into Bayesian optimization.
☆ SpotlessSplats: Ignoring Distractors in 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) is a promising technique for 3D reconstruction, offering efficient training and rendering speeds, making it suitable for real-time applications.However, current methods require highly controlled environments (no moving people or wind-blown elements, and consistent lighting) to meet the inter-view consistency assumption of 3DGS. This makes reconstruction of real-world captures problematic. We present SpotlessSplats, an approach that leverages pre-trained and general-purpose features coupled with robust optimization to effectively ignore transient distractors. Our method achieves state-of-the-art reconstruction quality both visually and quantitatively, on casual captures.
☆ Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
comment: 22 pages
☆ Evaluation of autonomous systems under data distribution shifts
We posit that data can only be safe to use up to a certain threshold of the data distribution shift, after which control must be relinquished by the autonomous system and operation halted or handed to a human operator. With the use of a computer vision toy example we demonstrate that network predictive accuracy is impacted by data distribution shifts and propose distance metrics between training and testing data to define safe operation limits within said shifts. We conclude that beyond an empirically obtained threshold of the data distribution shift, it is unreasonable to expect network predictive accuracy not to degrade
comment: 13 pages, 10 figures, 4 tables
☆ Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent
Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.
comment: 22 pages, 19 figures, original work
☆ Pairwise Difference Learning for Classification
Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a mapping from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.
☆ On the Trade-off between Flatness and Optimization in Distributed Learning
This paper proposes a theoretical framework to evaluate and compare the performance of gradient-descent algorithms for distributed learning in relation to their behavior around local minima in nonconvex environments. Previous works have noticed that convergence toward flat local minima tend to enhance the generalization ability of learning algorithms. This work discovers two interesting results. First, it shows that decentralized learning strategies are able to escape faster away from local minimizers and favor convergence toward flatter minima relative to the centralized solution in the large-batch training regime. Second, and importantly, the ultimate classification accuracy is not solely dependent on the flatness of the local minimizer but also on how well a learning algorithm can approach that minimum. In other words, the classification accuracy is a function of both flatness and optimization performance. The paper examines the interplay between the two measures of flatness and optimization error closely. One important conclusion is that decentralized strategies of the diffusion type deliver enhanced classification accuracy because it strikes a more favorable balance between flatness and optimization performance.
☆ Wavelets Are All You Need for Autoregressive Image Generation
In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
comment: 16 pages, 10 figures
☆ Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.
☆ Machine Learning Predictors for Min-Entropy Estimation
This study investigates the application of machine learning predictors for min-entropy estimation in Random Number Generators (RNGs), a key component in cryptographic applications where accurate entropy assessment is essential for cybersecurity. Our research indicates that these predictors, and indeed any predictor that leverages sequence correlations, primarily estimate average min-entropy, a metric not extensively studied in this context. We explore the relationship between average min-entropy and the traditional min-entropy, focusing on their dependence on the number of target bits being predicted. Utilizing data from Generalized Binary Autoregressive Models, a subset of Markov processes, we demonstrate that machine learning models (including a hybrid of convolutional and recurrent Long Short-Term Memory layers and the transformer-based GPT-2 model) outperform traditional NIST SP 800-90B predictors in certain scenarios. Our findings underscore the importance of considering the number of target bits in min-entropy assessment for RNGs and highlight the potential of machine learning approaches in enhancing entropy estimation techniques for improved cryptographic security.
☆ Comparative Analysis of LSTM Neural Networks and Traditional Machine Learning Models for Predicting Diabetes Patient Readmission
Diabetes mellitus is a chronic metabolic disorder that has emerged as one of the major health problems worldwide due to its high prevalence and serious complications, which are pricey to manage. Effective management requires good glycemic control and regular follow-up in the clinic; however, non-adherence to scheduled follow-ups is very common. This study uses the Diabetes 130-US Hospitals dataset for analysis and prediction of readmission patients by various traditional machine learning models, such as XGBoost, LightGBM, CatBoost, Decision Tree, and Random Forest, and also uses an in-house LSTM neural network for comparison. The quality of the data was assured by preprocessing it, and the performance evaluation for all these models was based on accuracy, precision, recall, and F1-score. LightGBM turned out to be the best traditional model, while XGBoost was the runner-up. The LSTM model suffered from overfitting despite high training accuracy. A major strength of LSTM is capturing temporal dependencies among the patient data. Further, SHAP values were used, which improved model interpretability, whereby key factors among them number of lab procedures and discharge disposition were identified as critical in the prediction of readmissions. This study demonstrates that model selection, validation, and interpretability are key steps in predictive healthcare modeling. This will help health providers design interventions for improved follow-up adherence and better management of diabetes.
☆ ScaleBiO: Scalable Bilevel Optimization for LLM Data Reweighting
Bilevel optimization has shown its utility across various machine learning settings, yet most algorithms in practice require second-order information, making it challenging to scale them up. Only recently, a paradigm of first-order algorithms emerged, capable of effectively addressing bilevel optimization problems. Nevertheless, the practical efficiency of this paradigm remains unverified, particularly in the context of large language models (LLMs). This paper introduces the first scalable instantiation of this paradigm called ScaleBiO, focusing on bilevel optimization for large-scale LLM data reweighting. By combining with a recently proposed memory-efficient training technique called LISA, our novel algorithm allows the paradigm to scale to 34-billion-parameter LLMs on eight A40 GPUs, marking the first successful application of bilevel optimization under practical scenarios for large-sized LLMs. Empirically, extensive experiments on data reweighting verify the effectiveness of ScaleBiO for different-scaled models, including GPT-2, LLaMA-3-8B, GPT-NeoX-20B, and Yi-34B, where bilevel optimization succeeds in filtering irrelevant data samples and selecting informative samples. Theoretically, ScaleBiO ensures the optimality of the learned data weights, along with a convergence guarantee matching the conventional first-order bilevel optimization paradigm on smooth and strongly convex objectives.
☆ STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical
Large Vision-Language Models (LVLMs) have shown significant potential in assisting medical diagnosis by leveraging extensive biomedical datasets. However, the advancement of medical image understanding and reasoning critically depends on building high-quality visual instruction data, which is costly and labor-intensive to obtain, particularly in the medical domain. To mitigate this data-starving issue, we introduce Self-Training Large Language and Vision Assistant for Medical (STLLaVA-Med). The proposed method is designed to train a policy model (an LVLM) capable of auto-generating medical visual instruction data to improve data efficiency, guided through Direct Preference Optimization (DPO). Specifically, a more powerful and larger LVLM (e.g., GPT-4o) is involved as a biomedical expert to oversee the DPO fine-tuning process on the auto-generated data, encouraging the policy model to align efficiently with human preferences. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks, demonstrating competitive zero-shot performance with the utilization of only 9% of the medical data.
comment: 10 pages, 6 figures
☆ Text2Robot: Evolutionary Robot Design from Text Descriptions
Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.
comment: Our project website is at: https://generalroboticslab.com/Text2Robot
☆ The Computational Curse of Big Data for Bayesian Additive Regression Trees: A Hitting Time Analysis
Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression model that is commonly used in causal inference and beyond. Its strong predictive performance is supported by theoretical guarantees that its posterior distribution concentrates around the true regression function at optimal rates under various data generative settings and for appropriate prior choices. In this paper, we show that the BART sampler often converges slowly, confirming empirical observations by other researchers. Assuming discrete covariates, we show that, while the BART posterior concentrates on a set comprising all optimal tree structures (smallest bias and complexity), the Markov chain's hitting time for this set increases with $n$ (training sample size), under several common data generative settings. As $n$ increases, the approximate BART posterior thus becomes increasingly different from the exact posterior (for the same number of MCMC samples), contrasting with earlier concentration results on the exact posterior. This contrast is highlighted by our simulations showing worsening frequentist undercoverage for approximate posterior intervals and a growing ratio between the MSE of the approximate posterior and that obtainable by artificially improving convergence via averaging multiple sampler chains. Finally, based on our theoretical insights, possibilities are discussed to improve the BART sampler convergence performance.
☆ Kolmogorov-Smirnov GAN
We propose a novel deep generative model, the Kolmogorov-Smirnov Generative Adversarial Network (KSGAN). Unlike existing approaches, KSGAN formulates the learning process as a minimization of the Kolmogorov-Smirnov (KS) distance, generalized to handle multivariate distributions. This distance is calculated using the quantile function, which acts as the critic in the adversarial training process. We formally demonstrate that minimizing the KS distance leads to the trained approximate distribution aligning with the target distribution. We propose an efficient implementation and evaluate its effectiveness through experiments. The results show that KSGAN performs on par with existing adversarial methods, exhibiting stability during training, resistance to mode dropping and collapse, and tolerance to variations in hyperparameter settings. Additionally, we review the literature on the Generalized KS test and discuss the connections between KSGAN and existing adversarial generative models.
comment: Code available at https://github.com/DMML-Geneva/ksgan
☆ Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank Decomposition
To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9\%.
comment: 12 pages, 8 figures
☆ `Just One More Sensor is Enough' -- Iterative Water Leak Localization with Physical Simulation and a Small Number of Pressure Sensors
In this article, we propose an approach to leak localisation in a complex water delivery grid with the use of data from physical simulation (e.g. EPANET software). This task is usually achieved by a network of multiple water pressure sensors and analysis of the so-called sensitivity matrix of pressure differences between the network's simulated data and actual data of the network affected by the leak. However, most algorithms using this approach require a significant number of pressure sensors -- a condition that is not easy to fulfil in the case of many less equipped networks. Therefore, we answer the question of whether leak localisation is possible by utilising very few sensors but having the ability to relocate one of them. Our algorithm is based on physical simulations (EPANET software) and an iterative scheme for mobile sensor relocation. The experiments show that the proposed system can equalise the low number of sensors with adjustments made for their positioning, giving a very good approximation of leak's position both in simulated cases and real-life example taken from BattLeDIM competition L-Town data.
☆ FI-CBL: A Probabilistic Method for Concept-Based Learning with Expert Rules
A method for solving concept-based learning (CBL) problem is proposed. The main idea behind the method is to divide each concept-annotated image into patches, to transform the patches into embeddings by using an autoencoder, and to cluster the embeddings assuming that each cluster will mainly contain embeddings of patches with certain concepts. To find concepts of a new image, the method implements the frequentist inference by computing prior and posterior probabilities of concepts based on rates of patches from images with certain values of the concepts. Therefore, the proposed method is called the Frequentist Inference CBL (FI-CBL). FI-CBL allows us to incorporate the expert rules in the form of logic functions into the inference procedure. An idea behind the incorporation is to update prior and conditional probabilities of concepts to satisfy the rules. The method is transparent because it has an explicit sequence of probabilistic calculations and a clear frequency interpretation. Numerical experiments show that FI-CBL outperforms the concept bottleneck model in cases when the number of training data is small. The code of proposed algorithms is publicly available.
☆ Attention Meets UAVs: A Comprehensive Evaluation of DDoS Detection in Low-Cost UAVs
This paper explores the critical issue of enhancing cybersecurity measures for low-cost, Wi-Fi-based Unmanned Aerial Vehicles (UAVs) against Distributed Denial of Service (DDoS) attacks. In the current work, we have explored three variants of DDoS attacks, namely Transmission Control Protocol (TCP), Internet Control Message Protocol (ICMP), and TCP + ICMP flooding attacks, and developed a detection mechanism that runs on the companion computer of the UAV system. As a part of the detection mechanism, we have evaluated various machine learning, and deep learning algorithms, such as XGBoost, Isolation Forest, Long Short-Term Memory (LSTM), Bidirectional-LSTM (Bi-LSTM), LSTM with attention, Bi-LSTM with attention, and Time Series Transformer (TST) in terms of various classification metrics. Our evaluation reveals that algorithms with attention mechanisms outperform their counterparts in general, and TST stands out as the most efficient model with a run time of 0.1 seconds. TST has demonstrated an F1 score of 0.999, 0.997, and 0.943 for TCP, ICMP, and TCP + ICMP flooding attacks respectively. In this work, we present the necessary steps required to build an on-board DDoS detection mechanism. Further, we also present the ablation study to identify the best TST hyperparameters for DDoS detection, and we have also underscored the advantage of adapting learnable positional embeddings in TST for DDoS detection with an improvement in F1 score from 0.94 to 0.99.
☆ Koopman based trajectory model and computation offloading for high mobility paradigm in ISAC enabled IoT system
User experience on mobile devices is constrained by limited battery capacity and processing power, but 6G technology advancements are diving rapidly into mobile technical evolution. Mobile edge computing (MEC) offers a solution, offloading computationally intensive tasks to edge cloud servers, reducing battery drain compared to local processing. The upcoming integrated sensing and communication in mobile communication may improve the trajectory prediction and processing delays. This study proposes a greedy resource allocation optimization strategy for multi-user networks to minimize aggregate energy usage. Numerical results show potential improvement at 33\% for every 1000 iteration. Addressing prediction model division and velocity accuracy issues is crucial for better results. A plan for further improvement and achieving objectives is outlined for the upcoming work phase.
☆ Operator World Models for Reinforcement Learning
Policy Mirror Descent (PMD) is a powerful and theoretically sound methodology for sequential decision-making. However, it is not directly applicable to Reinforcement Learning (RL) due to the inaccessibility of explicit action-value functions. We address this challenge by introducing a novel approach based on learning a world model of the environment using conditional mean embeddings. We then leverage the operatorial formulation of RL to express the action-value function in terms of this quantity in closed form via matrix operations. Combining these estimators with PMD leads to POWR, a new RL algorithm for which we prove convergence rates to the global optimum. Preliminary experiments in finite and infinite state settings support the effectiveness of our method.
☆ MuGSI: Distilling GNNs with Multi-Granularity Structural Information for Graph Classification
Recent works have introduced GNN-to-MLP knowledge distillation (KD) frameworks to combine both GNN's superior performance and MLP's fast inference speed. However, existing KD frameworks are primarily designed for node classification within single graphs, leaving their applicability to graph classification largely unexplored. Two main challenges arise when extending KD for node classification to graph classification: (1) The inherent sparsity of learning signals due to soft labels being generated at the graph level; (2) The limited expressiveness of student MLPs, especially in datasets with limited input feature spaces. To overcome these challenges, we introduce MuGSI, a novel KD framework that employs Multi-granularity Structural Information for graph classification. Specifically, we propose multi-granularity distillation loss in MuGSI to tackle the first challenge. This loss function is composed of three distinct components: graph-level distillation, subgraph-level distillation, and node-level distillation. Each component targets a specific granularity of the graph structure, ensuring a comprehensive transfer of structural knowledge from the teacher model to the student model. To tackle the second challenge, MuGSI proposes to incorporate a node feature augmentation component, thereby enhancing the expressiveness of the student MLPs and making them more capable learners. We perform extensive experiments across a variety of datasets and different teacher/student model architectures. The experiment results demonstrate the effectiveness, efficiency, and robustness of MuGSI. Codes are publicly available at: \textbf{\url{https://github.com/tianyao-aka/MuGSI}.}
comment: 12 pages, 4 figures. Accepted by TheWebConf2024
☆ Towards Stable and Storage-efficient Dataset Distillation: Matching Convexified Trajectory
The rapid evolution of deep learning and large language models has led to an exponential growth in the demand for training data, prompting the development of Dataset Distillation methods to address the challenges of managing large datasets. Among these, Matching Training Trajectories (MTT) has been a prominent approach, which replicates the training trajectory of an expert network on real data with a synthetic dataset. However, our investigation found that this method suffers from three significant limitations: 1. Instability of expert trajectory generated by Stochastic Gradient Descent (SGD); 2. Low convergence speed of the distillation process; 3. High storage consumption of the expert trajectory. To address these issues, we offer a new perspective on understanding the essence of Dataset Distillation and MTT through a simple transformation of the objective function, and introduce a novel method called Matching Convexified Trajectory (MCT), which aims to provide better guidance for the student trajectory. MCT leverages insights from the linearized dynamics of Neural Tangent Kernel methods to create a convex combination of expert trajectories, guiding the student network to converge rapidly and stably. This trajectory is not only easier to store, but also enables a continuous sampling strategy during distillation, ensuring thorough learning and fitting of the entire expert trajectory. Comprehensive experiments across three public datasets validate the superiority of MCT over traditional MTT methods.
comment: 11 pages
☆ Reinforcement Learning for Efficient Design and Control Co-optimisation of Energy Systems
The ongoing energy transition drives the development of decentralised renewable energy sources, which are heterogeneous and weather-dependent, complicating their integration into energy systems. This study tackles this issue by introducing a novel reinforcement learning (RL) framework tailored for the co-optimisation of design and control in energy systems. Traditionally, the integration of renewable sources in the energy sector has relied on complex mathematical modelling and sequential processes. By leveraging RL's model-free capabilities, the framework eliminates the need for explicit system modelling. By optimising both control and design policies jointly, the framework enhances the integration of renewable sources and improves system efficiency. This contribution paves the way for advanced RL applications in energy management, leading to more efficient and effective use of renewable energy sources.
☆ Deceptive Diffusion: Generating Synthetic Adversarial Examples
We introduce the concept of deceptive diffusion -- training a generative AI model to produce adversarial images. Whereas a traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton, the deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images. Deceptive diffusion offers the possibility of strengthening defence algorithms by providing adversarial training data at scale, including types of misclassification that are otherwise difficult to find. In our experiments, we also investigate the effect of training on a partially attacked data set. This highlights a new type of vulnerability for generative diffusion models: if an attacker is able to stealthily poison a portion of the training data, then the resulting diffusion model will generate a similar proportion of misleading outputs.
☆ MulTi-Wise Sampling: Trading Uniform T-Wise Feature Interaction Coverage for Smaller Samples
Ensuring the functional safety of highly configurable systems often requires testing representative subsets of all possible configurations to reduce testing effort and save resources. The ratio of covered t-wise feature interactions (i.e., T-Wise Feature Interaction Coverage) is a common criterion for determining whether a subset of configurations is representative and capable of finding faults. Existing t-wise sampling algorithms uniformly cover t-wise feature interactions for all features, resulting in lengthy execution times and large sample sizes, particularly when large t-wise feature interactions are considered (i.e., high values of t). In this paper, we introduce a novel approach to t-wise feature interaction sampling, questioning the necessity of uniform coverage across all t-wise feature interactions, called \emph{\mulTiWise{}}. Our approach prioritizes between subsets of critical and non-critical features, considering higher t-values for subsets of critical features when generating a t-wise feature interaction sample. We evaluate our approach using subject systems from real-world applications, including \busybox{}, \soletta{}, \fiasco{}, and \uclibc{}. Our results show that sacrificing uniform t-wise feature interaction coverage between all features reduces the time needed to generate a sample and the resulting sample size. Hence, \mulTiWise{} Sampling offers an alternative to existing approaches if knowledge about feature criticality is available.
☆ Modeling the Real World with High-Density Visual Particle Dynamics
We present High-Density Visual Particle Dynamics (HD-VPD), a learned world model that can emulate the physical dynamics of real scenes by processing massive latent point clouds containing 100K+ particles. To enable efficiency at this scale, we introduce a novel family of Point Cloud Transformers (PCTs) called Interlacers leveraging intertwined linear-attention Performer layers and graph-based neighbour attention layers. We demonstrate the capabilities of HD-VPD by modeling the dynamics of high degree-of-freedom bi-manual robots with two RGB-D cameras. Compared to the previous graph neural network approach, our Interlacer dynamics is twice as fast with the same prediction quality, and can achieve higher quality using 4x as many particles. We illustrate how HD-VPD can evaluate motion plan quality with robotic box pushing and can grasping tasks. See videos and particle dynamics rendered by HD-VPD at https://sites.google.com/view/hd-vpd.
☆ Improving Performance Prediction of Electrolyte Formulations with Transformer-based Molecular Representation Model ICML 2024
Development of efficient and high-performing electrolytes is crucial for advancing energy storage technologies, particularly in batteries. Predicting the performance of battery electrolytes rely on complex interactions between the individual constituents. Consequently, a strategy that adeptly captures these relationships and forms a robust representation of the formulation is essential for integrating with machine learning models to predict properties accurately. In this paper, we introduce a novel approach leveraging a transformer-based molecular representation model to effectively and efficiently capture the representation of electrolyte formulations. The performance of the proposed approach is evaluated on two battery property prediction tasks and the results show superior performance compared to the state-of-the-art methods.
comment: Accepted in ML4LMS Workshop at ICML 2024
Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection ECML
Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at https://github.com/mala-lab/STEN.
comment: 18 pages, 4 figures, accepted in ECML PKDD2024
☆ Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control Priors
Combining Reinforcement Learning (RL) with a prior controller can yield the best out of two worlds: RL can solve complex nonlinear problems, while the control prior ensures safer exploration and speeds up training. Prior work largely blends both components with a fixed weight, neglecting that the RL agent's performance varies with the training progress and across regions in the state space. Therefore, we advocate for an adaptive strategy that dynamically adjusts the weighting based on the RL agent's current capabilities. We propose a new adaptive hybrid RL algorithm, Contextualized Hybrid Ensemble Q-learning (CHEQ). CHEQ combines three key ingredients: (i) a time-invariant formulation of the adaptive hybrid RL problem treating the adaptive weight as a context variable, (ii) a weight adaption mechanism based on the parametric uncertainty of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient RL. Evaluating CHEQ on a car racing task reveals substantially stronger data efficiency, exploration safety, and transferability to unknown scenarios than state-of-the-art adaptive hybrid RL methods.
comment: 20 pages, 12 figures
☆ Systematic Literature Review on Application of Learning-based Approaches in Continuous Integration
Context: Machine learning (ML) and deep learning (DL) analyze raw data to extract valuable insights in specific phases. The rise of continuous practices in software projects emphasizes automating Continuous Integration (CI) with these learning-based methods, while the growing adoption of such approaches underscores the need for systematizing knowledge. Objective: Our objective is to comprehensively review and analyze existing literature concerning learning-based methods within the CI domain. We endeavour to identify and analyse various techniques documented in the literature, emphasizing the fundamental attributes of training phases within learning-based solutions in the context of CI. Method: We conducted a Systematic Literature Review (SLR) involving 52 primary studies. Through statistical and thematic analyses, we explored the correlations between CI tasks and the training phases of learning-based methodologies across the selected studies, encompassing a spectrum from data engineering techniques to evaluation metrics. Results: This paper presents an analysis of the automation of CI tasks utilizing learning-based methods. We identify and analyze nine types of data sources, four steps in data preparation, four feature types, nine subsets of data features, five approaches for hyperparameter selection and tuning, and fifteen evaluation metrics. Furthermore, we discuss the latest techniques employed, existing gaps in CI task automation, and the characteristics of the utilized learning-based techniques. Conclusion: This study provides a comprehensive overview of learning-based methods in CI, offering valuable insights for researchers and practitioners developing CI task automation. It also highlights the need for further research to advance these methods in CI.
comment: This paper has been accepted to be published in IEEE Access
☆ Backdoor Attack in Prompt-Based Continual Learning
Prompt-based approaches offer a cutting-edge solution to data privacy issues in continual learning, particularly in scenarios involving multiple data suppliers where long-term storage of private user data is prohibited. Despite delivering state-of-the-art performance, its impressive remembering capability can become a double-edged sword, raising security concerns as it might inadvertently retain poisoned knowledge injected during learning from private user data. Following this insight, in this paper, we expose continual learning to a potential threat: backdoor attack, which drives the model to follow a desired adversarial target whenever a specific trigger is present while still performing normally on clean samples. We highlight three critical challenges in executing backdoor attacks on incremental learners and propose corresponding solutions: (1) \emph{Transferability}: We employ a surrogate dataset and manipulate prompt selection to transfer backdoor knowledge to data from other suppliers; (2) \emph{Resiliency}: We simulate static and dynamic states of the victim to ensure the backdoor trigger remains robust during intense incremental learning processes; and (3) \emph{Authenticity}: We apply binary cross-entropy loss as an anti-cheating factor to prevent the backdoor trigger from devolving into adversarial noise. Extensive experiments across various benchmark datasets and continual learners validate our continual backdoor framework, achieving up to $100\%$ attack success rate, with further ablation studies confirming our contributions' effectiveness.
☆ Classical Bandit Algorithms for Entanglement Detection in Parameterized Qubit States
Entanglement is a key resource for a wide range of tasks in quantum information and computing. Thus, verifying availability of this quantum resource is essential. Extensive research on entanglement detection has led to no-go theorems (Lu et al. [Phys. Rev. Lett., 116, 230501 (2016)]) that highlight the need for full state tomography (FST) in the absence of adaptive or joint measurements. Recent advancements, as proposed by Zhu, Teo, and Englert [Phys. Rev. A, 81, 052339, 2010], introduce a single-parameter family of entanglement witness measurements which are capable of conclusively detecting certain entangled states and only resort to FST when all witness measurements are inconclusive. We find a variety of realistic noisy two-qubit quantum states $\mathcal{F}$ that yield conclusive results under this witness family. We solve the problem of detecting entanglement among $K$ quantum states in $\mathcal{F}$, of which $m$ states are entangled, with $m$ potentially unknown. We recognize a structural connection of this problem to the Bad Arm Identification problem in stochastic Multi-Armed Bandits (MAB). In contrast to existing quantum bandit frameworks, we establish a new correspondence tailored for entanglement detection and term it the $(m,K)$-quantum Multi-Armed Bandit. We implement two well-known MAB policies for arbitrary states derived from $\mathcal{F}$, present theoretical guarantees on the measurement/sample complexity and demonstrate the practicality of the policies through numerical simulations. More broadly, this paper highlights the potential for employing classical machine learning techniques for quantum entanglement detection.
comment: 20 pages, 5 figures
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at https://github.com/jihaonew/MM-Instruct
☆ EPOCH: Jointly Estimating the 3D Pose of Cameras and Humans
Monocular Human Pose Estimation (HPE) aims at determining the 3D positions of human joints from a single 2D image captured by a camera. However, a single 2D point in the image may correspond to multiple points in 3D space. Typically, the uniqueness of the 2D-3D relationship is approximated using an orthographic or weak-perspective camera model. In this study, instead of relying on approximations, we advocate for utilizing the full perspective camera model. This involves estimating camera parameters and establishing a precise, unambiguous 2D-3D relationship. To do so, we introduce the EPOCH framework, comprising two main components: the pose lifter network (LiftNet) and the pose regressor network (RegNet). LiftNet utilizes the full perspective camera model to precisely estimate the 3D pose in an unsupervised manner. It takes a 2D pose and camera parameters as inputs and produces the corresponding 3D pose estimation. These inputs are obtained from RegNet, which starts from a single image and provides estimates for the 2D pose and camera parameters. RegNet utilizes only 2D pose data as weak supervision. Internally, RegNet predicts a 3D pose, which is then projected to 2D using the estimated camera parameters. This process enables RegNet to establish the unambiguous 2D-3D relationship. Our experiments show that modeling the lifting as an unsupervised task with a camera in-the-loop results in better generalization to unseen data. We obtain state-of-the-art results for the 3D HPE on the Human3.6M and MPI-INF-3DHP datasets. Our code is available at: [Github link upon acceptance, see supplementary materials].
comment: 17 pages, 7 figures
☆ State Matching and Multiple References in Adaptive Active Automata Learning
Active automata learning (AAL) is a method to infer state machines by interacting with black-box systems. Adaptive AAL aims to reduce the sample complexity of AAL by incorporating domain specific knowledge in the form of (similar) reference models. Such reference models appear naturally when learning multiple versions or variants of a software system. In this paper, we present state matching, which allows flexible use of the structure of these reference models by the learner. State matching is the main ingredient of adaptive L#, a novel framework for adaptive learning, built on top of L#. Our empirical evaluation shows that adaptive L# improves the state of the art by up to two orders of magnitude.
comment: Extended paper for FM 2024
☆ CHASE: A Causal Heterogeneous Graph based Framework for Root Cause Analysis in Multimodal Microservice Systems
In recent years, the widespread adoption of distributed microservice architectures within the industry has significantly increased the demand for enhanced system availability and robustness. Due to the complex service invocation paths and dependencies at enterprise-level microservice systems, it is challenging to locate the anomalies promptly during service invocations, thus causing intractable issues for normal system operations and maintenance. In this paper, we propose a Causal Heterogeneous grAph baSed framEwork for root cause analysis, namely CHASE, for microservice systems with multimodal data, including traces, logs, and system monitoring metrics. Specifically, related information is encoded into representative embeddings and further modeled by a multimodal invocation graph. Following that, anomaly detection is performed on each instance node with attentive heterogeneous message passing from its adjacent metric and log nodes. Finally, CHASE learns from the constructed hypergraph with hyperedges representing the flow of causality and performs root cause localization. We evaluate the proposed framework on two public microservice datasets with distinct attributes and compare with the state-of-the-art methods. The results show that CHASE achieves the average performance gain up to 36.2%(A@1) and 29.4%(Percentage@1), respectively to its best counterpart.
☆ InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management OSDI 2024
Transformer-based large language models (LLMs) demonstrate impressive performance across various natural language processing tasks. Serving LLM inference for generating long contents, however, poses a challenge due to the enormous memory footprint of the transient state, known as the key-value (KV) cache, which scales with the sequence length and batch size. In this paper, we present InfiniGen, a novel KV cache management framework tailored for long-text generation, which synergistically works with modern offloading-based inference systems. InfiniGen leverages the key insight that a few important tokens that are essential for computing the subsequent attention layer in the Transformer can be speculated by performing a minimal rehearsal with the inputs of the current layer and part of the query weight and key cache of the subsequent layer. This allows us to prefetch only the essential KV cache entries (without fetching them all), thereby mitigating the fetch overhead from the host memory in offloading-based LLM serving systems. Our evaluation on several representative LLMs shows that InfiniGen improves the overall performance of a modern offloading-based system by up to 3.00x compared to prior KV cache management methods while offering substantially better model accuracy.
comment: OSDI 2024
☆ Less is More: Accurate Speech Recognition & Translation without Web-Scale Data
Recent advances in speech recognition and translation rely on hundreds of thousands of hours of Internet speech data. We argue that state-of-the art accuracy can be reached without relying on web-scale data. Canary - multilingual ASR and speech translation model, outperforms current state-of-the-art models - Whisper, OWSM, and Seamless-M4T on English, French, Spanish, and German languages, while being trained on an order of magnitude less data than these models. Three key factors enables such data-efficient model: (1) a FastConformer-based attention encoder-decoder architecture (2) training on synthetic data generated with machine translation and (3) advanced training techniques: data-balancing, dynamic data blending, dynamic bucketing and noise-robust fine-tuning. The model, weights, and training code will be open-sourced.
comment: Accepted at Interspeech-2024
☆ Function+Data Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning
The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.
comment: 10 pages, 5 figures, to be published in AIware'24
☆ Finite basis Kolmogorov-Arnold networks: domain decomposition for data-driven and physics-informed problems
Kolmogorov-Arnold networks (KANs) have attracted attention recently as an alternative to multilayer perceptrons (MLPs) for scientific machine learning. However, KANs can be expensive to train, even for relatively small networks. Inspired by finite basis physics-informed neural networks (FBPINNs), in this work, we develop a domain decomposition method for KANs that allows for several small KANs to be trained in parallel to give accurate solutions for multiscale problems. We show that finite basis KANs (FBKANs) can provide accurate results with noisy data and for physics-informed training.
☆ LLMEasyQuant -- An Easy to Use Toolkit for LLM Quantization
Currently, there are many quantization methods appeared for LLM quantization, yet few are user-friendly and easy to be deployed locally. Packages like TensorRT and Quantohave many underlying structures and self-invoking internal functions, which are not conducive to developers' personalized development and learning for deployment. Therefore, we develop LLMEasyQuant, it is a package aiming to for easy quantization deployment which is user-friendly and suitable for beginners' learning.
☆ ACES: Automatic Cohort Extraction System for Event-Stream Datasets
Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automatic Cohort Extraction System for Event-Stream Datasets (ACES). This tool is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to *any* dataset for which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at https://github.com/justin13601/aces.
comment: For ACES Online Documentation, see https://eventstreamaces.readthedocs.io/en/latest/
☆ IDT: Dual-Task Adversarial Attacks for Privacy Protection
Natural language processing (NLP) models may leak private information in different ways, including membership inference, reconstruction or attribute inference attacks. Sensitive information may not be explicit in the text, but hidden in underlying writing characteristics. Methods to protect privacy can involve using representations inside models that are demonstrated not to detect sensitive attributes or -- for instance, in cases where users might not trust a model, the sort of scenario of interest here -- changing the raw text before models can have access to it. The goal is to rewrite text to prevent someone from inferring a sensitive attribute (e.g. the gender of the author, or their location by the writing style) whilst keeping the text useful for its original intention (e.g. the sentiment of a product review). The few works tackling this have focused on generative techniques. However, these often create extensively different texts from the original ones or face problems such as mode collapse. This paper explores a novel adaptation of adversarial attack techniques to manipulate a text to deceive a classifier w.r.t one task (privacy) whilst keeping the predictions of another classifier trained for another task (utility) unchanged. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change for the privacy task, and which ones should be kept for the utility task. We evaluate different datasets for NLP suitable for different tasks. Automatic and human evaluations show that IDT retains the utility of text, while also outperforming existing methods when deceiving a classifier w.r.t privacy task.
comment: 28 pages, 1 figure
☆ Enforcing Equity in Neural Climate Emulators
Neural network emulators have become an invaluable tool for a wide variety of climate and weather prediction tasks. While showing incredibly promising results, these networks do not have an inherent ability to produce equitable predictions. That is, they are not guaranteed to provide a uniform quality of prediction along any particular class or group of people. This potential for inequitable predictions motivates the need for explicit representations of fairness in these neural networks. To that end, we draw on methods for enforcing analytical physical constraints in neural networks to bias networks towards more equitable predictions. We demonstrate the promise of this methodology using the task of climate model emulation. Specifically, we propose a custom loss function which punishes emulators with unequal quality of predictions across any prespecified regions or category, here defined using human development index (HDI). This loss function weighs a standard loss metric such as mean squared error against another metric which captures inequity along the equity category (HDI), allowing us to adjust the priority of each term before training. Importantly, the loss function does not specify a particular definition of equity to bias the neural network towards, opening the door for custom fairness metrics. Our results show that neural climate emulators trained with our loss function provide more equitable predictions and that the equity metric improves with greater weighting in the loss function. We empirically demonstrate that while there is a tradeoff between accuracy and equity when prioritizing the latter during training, an appropriate selection of the equity priority hyperparameter can minimize loss of performance.
comment: 10 pages, 9 figures
☆ Model Predictive Simulation Using Structured Graphical Models and Transformers
We propose an approach to simulating trajectories of multiple interacting agents (road users) based on transformers and probabilistic graphical models (PGMs), and apply it to the Waymo SimAgents challenge. The transformer baseline is based on the MTR model, which predicts multiple future trajectories conditioned on the past trajectories and static road layout features. We then improve upon these generated trajectories using a PGM, which contains factors which encode prior knowledge, such as a preference for smooth trajectories, and avoidance of collisions with static obstacles and other moving agents. We perform (approximate) MAP inference in this PGM using the Gauss-Newton method. Finally we sample $K=32$ trajectories for each of the $N \sim 100$ agents for the next $T=8 \Delta$ time steps, where $\Delta=10$ is the sampling rate per second. Following the Model Predictive Control (MPC) paradigm, we only return the first element of our forecasted trajectories at each step, and then we replan, so that the simulation can constantly adapt to its changing environment. We therefore call our approach "Model Predictive Simulation" or MPS. We show that MPS improves upon the MTR baseline, especially in safety critical metrics such as collision rate. Furthermore, our approach is compatible with any underlying forecasting model, and does not require extra training, so we believe it is a valuable contribution to the community.
comment: Special Mention at the Waymo Sim Agents Challenge 2024
☆ Personalized Interpretation on Federated Learning: A Virtual Concepts approach
Tackling non-IID data is an open challenge in federated learning research. Existing FL methods, including robust FL and personalized FL, are designed to improve model performance without consideration of interpreting non-IID across clients. This paper aims to design a novel FL method to robust and interpret the non-IID data across clients. Specifically, we interpret each client's dataset as a mixture of conceptual vectors that each one represents an interpretable concept to end-users. These conceptual vectors could be pre-defined or refined in a human-in-the-loop process or be learnt via the optimization procedure of the federated learning system. In addition to the interpretability, the clarity of client-specific personalization could also be applied to enhance the robustness of the training process on FL system. The effectiveness of the proposed method have been validated on benchmark datasets.
☆ Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into inputs. We propose a novel algorithm that remaps the input domain into a constrained range, reducing the Lipschitz constant and potentially enhancing robustness. Unlike existing adversarially trained models, where robustness is enhanced by introducing additional examples from other datasets or generative models, our method is almost cost-free as it can be integrated with existing models without requiring re-training. Experimental results demonstrate the generalizability of our method, as it can be combined with various models and achieve enhancements in robustness. Furthermore, our method achieves the best robust accuracy for CIFAR10, CIFAR100, and ImageNet datasets on the RobustBench leaderboard.
☆ Machine-Learning-Driven Runtime Optimization of BLAS Level 3 on Modern Multi-Core Systems
BLAS Level 3 operations are essential for scientific computing, but finding the optimal number of threads for multi-threaded implementations on modern multi-core systems is challenging. We present an extension to the Architecture and Data-Structure Aware Linear Algebra (ADSALA) library that uses machine learning to optimize the runtime of all BLAS Level 3 operations. Our method predicts the best number of threads for each operation based on the matrix dimensions and the system architecture. We test our method on two HPC platforms with Intel and AMD processors, using MKL and BLIS as baseline BLAS implementations. We achieve speedups of 1.5 to 3.0 for all operations, compared to using the maximum number of threads. We also analyze the runtime patterns of different BLAS operations and explain the sources of speedup. Our work shows the effectiveness and generality of the ADSALA approach for optimizing BLAS routines on modern multi-core systems.
comment: Multi-Thread, Matrix Multiplication, Optimization, BLAS, Machine Learning
☆ ScoreFusion: fusing score-based generative models via Kullback-Leibler barycenters
We study the problem of fusing pre-trained (auxiliary) generative models to enhance the training of a target generative model. We propose using KL-divergence weighted barycenters as an optimal fusion mechanism, in which the barycenter weights are optimally trained to minimize a suitable loss for the target population. While computing the optimal KL-barycenter weights can be challenging, we demonstrate that this process can be efficiently executed using diffusion score training when the auxiliary generative models are also trained based on diffusion score methods. Moreover, we show that our fusion method has a dimension-free sample complexity in total variation distance provided that the auxiliary models are well fitted for their own task and the auxiliary tasks combined capture the target well. The main takeaway of our method is that if the auxiliary models are well-trained and can borrow features from each other that are present in the target, our fusion method significantly improves the training of generative models. We provide a concise computational implementation of the fusion algorithm, and validate its efficiency in the low-data regime with numerical experiments involving mixtures models and image datasets.
comment: 40 pages, 6 figures
☆ Stochastic Zeroth-Order Optimization under Strongly Convexity and Lipschitz Hessian: Minimax Sample Complexity
Optimization of convex functions under stochastic zeroth-order feedback has been a major and challenging question in online learning. In this work, we consider the problem of optimizing second-order smooth and strongly convex functions where the algorithm is only accessible to noisy evaluations of the objective function it queries. We provide the first tight characterization for the rate of the minimax simple regret by developing matching upper and lower bounds. We propose an algorithm that features a combination of a bootstrapping stage and a mirror-descent stage. Our main technical innovation consists of a sharp characterization for the spherical-sampling gradient estimator under higher-order smoothness conditions, which allows the algorithm to optimally balance the bias-variance tradeoff, and a new iterative method for the bootstrapping stage, which maintains the performance for unbounded Hessian.
☆ VarteX: Enhancing Weather Forecast through Distributed Variable Representation ICML 2024
Weather forecasting is essential for various human activities. Recent data-driven models have outperformed numerical weather prediction by utilizing deep learning in forecasting performance. However, challenges remain in efficiently handling multiple meteorological variables. This study proposes a new variable aggregation scheme and an efficient learning framework for that challenge. Experiments show that VarteX outperforms the conventional model in forecast performance, requiring significantly fewer parameters and resources. The effectiveness of learning through multiple aggregations and regional split training is demonstrated, enabling more efficient and accurate deep learning-based weather forecasting.
comment: ICML 2024, Workshop on Machine Learning for Earth System Modeling
☆ A Survey on Data Quality Dimensions and Tools for Machine Learning
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
comment: This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper
☆ A Survey on Deep Clustering: From the Prior Perspective
Facilitated by the powerful feature extraction ability of neural networks, deep clustering has achieved great success in analyzing high-dimensional and complex real-world data. The performance of deep clustering methods is affected by various factors such as network structures and learning objectives. However, as pointed out in this survey, the essence of deep clustering lies in the incorporation and utilization of prior knowledge, which is largely ignored by existing works. From pioneering deep clustering methods based on data structure assumptions to recent contrastive clustering methods based on data augmentation invariances, the development of deep clustering intrinsically corresponds to the evolution of prior knowledge. In this survey, we provide a comprehensive review of deep clustering methods by categorizing them into six types of prior knowledge. We find that in general the prior innovation follows two trends, namely, i) from mining to constructing, and ii) from internal to external. Besides, we provide a benchmark on five widely-used datasets and analyze the performance of methods with diverse priors. By providing a novel prior knowledge perspective, we hope this survey could provide some novel insights and inspire future research in the deep clustering community.
☆ Optimizing Cyber Defense in Dynamic Active Directories through Reinforcement Learning ESORICS
This paper addresses a significant gap in Autonomous Cyber Operations (ACO) literature: the absence of effective edge-blocking ACO strategies in dynamic, real-world networks. It specifically targets the cybersecurity vulnerabilities of organizational Active Directory (AD) systems. Unlike the existing literature on edge-blocking defenses which considers AD systems as static entities, our study counters this by recognizing their dynamic nature and developing advanced edge-blocking defenses through a Stackelberg game model between attacker and defender. We devise a Reinforcement Learning (RL)-based attack strategy and an RL-assisted Evolutionary Diversity Optimization-based defense strategy, where the attacker and defender improve each other strategy via parallel gameplay. To address the computational challenges of training attacker-defender strategies on numerous dynamic AD graphs, we propose an RL Training Facilitator that prunes environments and neural networks to eliminate irrelevant elements, enabling efficient and scalable training for large graphs. We extensively train the attacker strategy, as a sophisticated attacker model is essential for a robust defense. Our empirical results successfully demonstrate that our proposed approach enhances defender's proficiency in hardening dynamic AD graphs while ensuring scalability for large-scale AD.
comment: The manuscript has been accepted as full paper at European Symposium on Research in Computer Security (ESORICS) 2024
☆ Network Bending of Diffusion Models for Audio-Visual Generation
In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pre-trained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms.
comment: 8 pages, 5 figures, to be published in the proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), for additional image and video examples see https://dzluke.github.io/DAFX2024/
☆ HarmonICA: Neural non-stationarity correction and source separation for motor neuron interfaces
A major outstanding problem when interfacing with spinal motor neurons is how to accurately compensate for non-stationary effects in the signal during source separation routines, particularly when they cannot be estimated in advance. This forces current systems to instead use undifferentiated bulk signal, which limits the potential degrees of freedom for control. In this study we propose a potential solution, using an unsupervised learning algorithm to blindly correct for the effects of latent processes which drive the signal non-stationarities. We implement this methodology within the theoretical framework of a quasilinear version of independent component analysis (ICA). The proposed design, HarmonICA, sidesteps the identifiability problems of nonlinear ICA, allowing for equivalent predictability to linear ICA whilst retaining the ability to learn complex nonlinear relationships between non-stationary latents and their effects on the signal. We test HarmonICA on both invasive and non-invasive recordings both simulated and real, demonstrating an ability to blindly compensate for the non-stationary effects specific to each, and thus to significantly enhance the quality of a source separation routine.
☆ FRED: Flexible REduction-Distribution Interconnect and Communication Implementation for Wafer-Scale Distributed Training of DNN Models
Distributed Deep Neural Network (DNN) training is a technique to reduce the training overhead by distributing the training tasks into multiple accelerators, according to a parallelization strategy. However, high-performance compute and interconnects are needed for maximum speed-up and linear scaling of the system. Wafer-scale systems are a promising technology that allows for tightly integrating high-end accelerators with high-speed wafer-scale interconnects, making it an attractive platform for distributed training. However, the wafer-scale interconnect should offer high performance and flexibility for various parallelization strategies to enable maximum optimizations for compute and memory usage. In this paper, we propose FRED, a wafer-scale interconnect that is tailored for the high-BW requirements of wafer-scale networks and can efficiently execute communication patterns of different parallelization strategies. Furthermore, FRED supports in-switch collective communication execution that reduces the network traffic by approximately 2X. Our results show that FRED can improve the average end-to-end training time of ResNet-152, Transformer-17B, GPT-3, and Transformer-1T by 1.76X, 1.87X, 1.34X, and 1.4X, respectively when compared to a baseline waferscale 2D-Mesh fabric.
♻ ☆ GEO: Generative Engine Optimization KDD 2024
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them using LLMs. While this shift significantly improves $\textit{user}$ utility and $\textit{generative search engine}$ traffic, it poses a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of generative engines, content creators have little to no control over $\textit{when}$ and $\textit{how}$ their content is displayed. With generative engines here to stay, we must ensure the creator economy is not disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), the first novel paradigm to aid content creators in improving their content visibility in generative engine responses through a flexible black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation by introducing GEO-bench, a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to $40\%$ in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific optimization methods. Our work opens a new frontier in information discovery systems, with profound implications for both developers of generative engines and content creators.
comment: Accepted to KDD 2024
♻ ☆ Fishnets: Information-Optimal, Scalable Aggregation for Sets and Graphs
Set-based learning is an essential component of modern deep learning and network science. Graph Neural Networks (GNNs) and their edge-free counterparts Deepsets have proven remarkably useful on ragged and topologically challenging datasets. The key to learning informative embeddings for set members is a specified aggregation function, usually a sum, max, or mean. We propose Fishnets, an aggregation strategy for learning information-optimal embeddings for sets of data for both Bayesian inference and graph aggregation. We demonstrate that i) Fishnets neural summaries can be scaled optimally to an arbitrary number of data objects, ii) Fishnets aggregations are robust to changes in data distribution, unlike standard deepsets, iii) Fishnets saturate Bayesian information content and extend to regimes where MCMC techniques fail and iv) Fishnets can be used as a drop-in aggregation scheme within GNNs. We show that by adopting a Fishnets aggregation scheme for message passing, GNNs can achieve state-of-the-art performance versus architecture size on ogbn-protein data over existing benchmarks with a fraction of learnable parameters and faster training time.
comment: 15 pages, 6 figures, 2 tables. Submitted to JMLR
♻ ☆ Scalable Training of Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN
We present our work on developing and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stopping was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution opens the door for rapidly developing, training, and deploying GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.
comment: 16 pages, 13 figures
♻ ☆ ChIRAAG: ChatGPT Informed Rapid and Automated Assertion Generation
System Verilog Assertion (SVA) formulation -- a critical yet complex task is a prerequisite in the Assertion Based Verification (ABV) process. Traditionally, SVA formulation involves expert-driven interpretation of specifications, which is time-consuming and prone to human error. Recently, LLM-informed automatic assertion generation is gaining interest. We designed a novel framework called ChIRAAG, based on OpenAI GPT4, to generate SVA from natural language specifications of a design. ChIRAAG constitutes the systematic breakdown of design specifications into a standardized format, further generating assertions from formatted specifications using LLM. Furthermore, we used few test cases to validate the LLM-generated assertions. Automatic feedback of log messages from the simulation tool to the LLM ensures that the framework can generate correct SVAs. In our experiments, only 27% of LLM-generated raw assertions had errors, which was rectified in few iterations based on the simulation log. Our results on OpenTitan designs show that LLMs can streamline and assist engineers in the assertion generation process, reshaping verification workflows.
comment: 4 pages, 2 figures and 2 tables
♻ ☆ Solving Differential Equations using Physics-Informed Deep Equilibrium Models
This paper introduces Physics-Informed Deep Equilibrium Models (PIDEQs) for solving initial value problems (IVPs) of ordinary differential equations (ODEs). Leveraging recent advancements in deep equilibrium models (DEQs) and physics-informed neural networks (PINNs), PIDEQs combine the implicit output representation of DEQs with physics-informed training techniques. We validate PIDEQs using the Van der Pol oscillator as a benchmark problem, demonstrating their efficiency and effectiveness in solving IVPs. Our analysis includes key hyperparameter considerations for optimizing PIDEQ performance. By bridging deep learning and physics-based modeling, this work advances computational techniques for solving IVPs, with implications for scientific computing and engineering applications.
comment: Accepted at CASE 2024; Extended Sec. III.B
♻ ☆ The Impact of Feature Representation on the Accuracy of Photonic Neural Networks
Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network's handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network's ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.
♻ ☆ Importance Weighted Expectation-Maximization for Protein Sequence Design
Designing protein sequences with desired biological function is crucial in biology and chemistry. Recent machine learning methods use a surrogate sequence-function model to replace the expensive wet-lab validation. How can we efficiently generate diverse and novel protein sequences with high fitness? In this paper, we propose IsEM-Pro, an approach to generate protein sequences towards a given fitness criterion. At its core, IsEM-Pro is a latent generative model, augmented by combinatorial structure features from a separately learned Markov random fields (MRFs). We develop an Monte Carlo Expectation-Maximization method (MCEM) to learn the model. During inference, sampling from its latent space enhances diversity while its MRFs features guide the exploration in high fitness regions. Experiments on eight protein sequence design tasks show that our IsEM-Pro outperforms the previous best methods by at least 55% on average fitness score and generates more diverse and novel protein sequences.
♻ ☆ A Simple Mixture Policy Parameterization for Improving Sample Efficiency of CVaR Optimization
Reinforcement learning algorithms utilizing policy gradients (PG) to optimize Conditional Value at Risk (CVaR) face significant challenges with sample inefficiency, hindering their practical applications. This inefficiency stems from two main facts: a focus on tail-end performance that overlooks many sampled trajectories, and the potential of gradient vanishing when the lower tail of the return distribution is overly flat. To address these challenges, we propose a simple mixture policy parameterization. This method integrates a risk-neutral policy with an adjustable policy to form a risk-averse policy. By employing this strategy, all collected trajectories can be utilized for policy updating, and the issue of vanishing gradients is counteracted by stimulating higher returns through the risk-neutral component, thus lifting the tail and preventing flatness. Our empirical study reveals that this mixture parameterization is uniquely effective across a variety of benchmark domains. Specifically, it excels in identifying risk-averse CVaR policies in some Mujoco environments where the traditional CVaR-PG fails to learn a reasonable policy.
comment: RLC 2024
♻ ☆ Robustness Assessment of a Runway Object Classifier for Safe Aircraft Taxiing SC
As deep neural networks (DNNs) are becoming the prominent solution for many computational problems, the aviation industry seeks to explore their potential in alleviating pilot workload and in improving operational safety. However, the use of DNNs in this type of safety-critical applications requires a thorough certification process. This need can be addressed through formal verification, which provides rigorous assurances -- e.g.,~by proving the absence of certain mispredictions. In this case-study paper, we demonstrate this process using an image-classifier DNN currently under development at Airbus and intended for use during the aircraft taxiing phase. We use formal methods to assess this DNN's robustness to three common image perturbation types: noise, brightness and contrast, and some of their combinations. This process entails multiple invocations of the underlying verifier, which might be computationally expensive; and we therefore propose a method that leverages the monotonicity of these robustness properties, as well as the results of past verification queries, in order to reduce the overall number of verification queries required by nearly 60%. Our results provide an indication of the level of robustness achieved by the DNN classifier under study, and indicate that it is considerably more vulnerable to noise than to brightness or contrast perturbations.
comment: This is a preprint version of the paper in the proceedings of 43rd Digital Avionics Systems Conference (DASC)
♻ ☆ Scaling laws for learning with real and surrogate data
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
comment: Added new experiments
♻ ☆ Distributed Speculative Inference of Large Language Models
Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.
♻ ☆ Scalable Bayesian uncertainty quantification with data-driven priors for radio interferometric imaging
Next-generation radio interferometers like the Square Kilometer Array have the potential to unlock scientific discoveries thanks to their unprecedented angular resolution and sensitivity. One key to unlocking their potential resides in handling the deluge and complexity of incoming data. This challenge requires building radio interferometric imaging methods that can cope with the massive data sizes and provide high-quality image reconstructions with uncertainty quantification (UQ). This work proposes a method coined QuantifAI to address UQ in radio-interferometric imaging with data-driven (learned) priors for high-dimensional settings. Our model, rooted in the Bayesian framework, uses a physically motivated model for the likelihood. The model exploits a data-driven convex prior, which can encode complex information learned implicitly from simulations and guarantee the log-concavity of the posterior. We leverage probability concentration phenomena of high-dimensional log-concave posteriors that let us obtain information about the posterior, avoiding MCMC sampling techniques. We rely on convex optimisation methods to compute the MAP estimation, which is known to be faster and better scale with dimension than MCMC sampling strategies. Our method allows us to compute local credible intervals, i.e., Bayesian error bars, and perform hypothesis testing of structure on the reconstructed image. In addition, we propose a novel blazing-fast method to compute pixel-wise uncertainties at different scales. We demonstrate our method by reconstructing radio-interferometric images in a simulated setting and carrying out fast and scalable UQ, which we validate with MCMC sampling. Our method shows an improved image quality and more meaningful uncertainties than the benchmark method based on a sparsity-promoting prior. QuantifAI's source code: https://github.com/astro-informatics/QuantifAI.
comment: 30 pages, 14 figures, 10 tables, code available at https://github.com/astro-informatics/QuantifAI
♻ ☆ Dynamic planning in hierarchical active inference
By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behavior could be explained in terms of an active inferential process - either as discrete decision-making or continuous motor control - inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, keeping in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.
♻ ☆ Digital Twin Calibration for Biological System-of-Systems: Cell Culture Manufacturing Process
Biomanufacturing innovation relies on an efficient Design of Experiments (DoEs) to optimize processes and product quality. Traditional DoE methods, ignoring the underlying bioprocessing mechanisms, often suffer from a lack of interpretability and sample efficiency. This limitation motivates us to create a new optimal learning approach for digital twin model calibration. In this study, we consider the cell culture process multi-scale mechanistic model, also known as Biological System-of-Systems (Bio-SoS). This model with a modular design, composed of sub-models, allows us to integrate data across various production processes. To calibrate the Bio-SoS digital twin, we evaluate the mean squared error of model prediction and develop a computational approach to quantify the impact of parameter estimation error of individual sub-models on the prediction accuracy of digital twin, which can guide sample-efficient and interpretable DoEs.
comment: 11 pages, 5 figures
♻ ☆ Nearest Neighbor Sampling for Covariate Shift Adaptation
Many existing covariate shift adaptation methods estimate sample weights given to loss values to mitigate the gap between the source and the target distribution. However, estimating the optimal weights typically involves computationally expensive matrix inversion and hyper-parameter tuning. In this paper, we propose a new covariate shift adaptation method which avoids estimating the weights. The basic idea is to directly work on unlabeled target data, labeled according to the $k$-nearest neighbors in the source dataset. Our analysis reveals that setting $k = 1$ is an optimal choice. This property removes the necessity of tuning the only hyper-parameter $k$ and leads to a running time quasi-linear in the sample size. Our results include sharp rates of convergence for our estimator, with a tight control of the mean square error and explicit constants. In particular, the variance of our estimators has the same rate of convergence as for standard parametric estimation despite their non-parametric nature. The proposed estimator shares similarities with some matching-based treatment effect estimators used, e.g., in biostatistics, econometrics, and epidemiology. Our experiments show that it achieves drastic reduction in the running time with remarkable accuracy.
♻ ☆ Latent variable model for high-dimensional point process with structured missingness
Longitudinal data are important in numerous fields, such as healthcare, sociology and seismology, but real-world datasets present notable challenges for practitioners because they can be high-dimensional, contain structured missingness patterns, and measurement time points can be governed by an unknown stochastic process. While various solutions have been suggested, the majority of them have been designed to account for only one of these challenges. In this work, we propose a flexible and efficient latent-variable model that is capable of addressing all these limitations. Our approach utilizes Gaussian processes to capture temporal correlations between samples and their associated missingness masks as well as to model the underlying point process. We construct our model as a variational autoencoder together with deep neural network parameterised encoder and decoder models, and develop a scalable amortised variational inference approach for efficient model training. We demonstrate competitive performance using both simulated and real datasets.
♻ ☆ Catastrophic-risk-aware reinforcement learning with extreme-value-theory-based policy gradients
This paper tackles the problem of mitigating catastrophic risk (which is risk with very low frequency but very high severity) in the context of a sequential decision making process. This problem is particularly challenging due to the scarcity of observations in the far tail of the distribution of cumulative costs (negative rewards). A policy gradient algorithm is developed, that we call POTPG. It is based on approximations of the tail risk derived from extreme value theory. Numerical experiments highlight the out-performance of our method over common benchmarks, relying on the empirical distribution. An application to financial risk management, more precisely to the dynamic hedging of a financial option, is presented.
comment: The Python code to replicate the various numerical experiments of this paper is available at https://github.com/parisadavar/EVT-policy-gradient-RL
♻ ☆ Tracking Object Positions in Reinforcement Learning: A Metric for Keypoint Detection (extended version)
Reinforcement learning (RL) for robot control typically requires a detailed representation of the environment state, including information about task-relevant objects not directly measurable. Keypoint detectors, such as spatial autoencoders (SAEs), are a common approach to extracting a low-dimensional representation from high-dimensional image data. SAEs aim at spatial features such as object positions, which are often useful representations in robotic RL. However, whether an SAE is actually able to track objects in the scene and thus yields a spatial state representation well suited for RL tasks has rarely been examined due to a lack of established metrics. In this paper, we propose to assess the performance of an SAE instance by measuring how well keypoints track ground truth objects in images. We present a computationally lightweight metric and use it to evaluate common baseline SAE architectures on image data from a simulated robot task. We find that common SAEs differ substantially in their spatial extraction capability. Furthermore, we validate that SAEs that perform well in our metric achieve superior performance when used in downstream RL. Thus, our metric is an effective and lightweight indicator of RL performance before executing expensive RL training. Building on these insights, we identify three key modifications of SAE architectures to improve tracking performance. We make our code available at anonymous.4open.science/r/sae-rl.
comment: 19 pages, 12 figures
♻ ☆ The G-invariant graph Laplacian
Graph Laplacian based algorithms for data lying on a manifold have been proven effective for tasks such as dimensionality reduction, clustering, and denoising. In this work, we consider data sets whose data points lie on a manifold that is closed under the action of a known unitary matrix Lie group G. We propose to construct the graph Laplacian by incorporating the distances between all the pairs of points generated by the action of G on the data set. We deem the latter construction the ``G-invariant Graph Laplacian'' (G-GL). We show that the G-GL converges to the Laplace-Beltrami operator on the data manifold, while enjoying a significantly improved convergence rate compared to the standard graph Laplacian which only utilizes the distances between the points in the given data set. Furthermore, we show that the G-GL admits a set of eigenfunctions that have the form of certain products between the group elements and eigenvectors of certain matrices, which can be estimated from the data efficiently using FFT-type algorithms. We demonstrate our construction and its advantages on the problem of filtering data on a noisy manifold closed under the action of the special unitary group SU(2).
♻ ☆ Learning Decision Policies with Instrumental Variables through Double Machine Learning ICML 2024
A common issue in learning decision-making policies in data-rich settings is spurious correlations in the offline dataset, which can be caused by hidden confounders. Instrumental variable (IV) regression, which utilises a key unconfounded variable known as the instrument, is a standard technique for learning causal relationships between confounded action, outcome, and context variables. Most recent IV regression algorithms use a two-stage approach, where a deep neural network (DNN) estimator learnt in the first stage is directly plugged into the second stage, in which another DNN is used to estimate the causal effect. Naively plugging the estimator can cause heavy bias in the second stage, especially when regularisation bias is present in the first stage estimator. We propose DML-IV, a non-linear IV regression method that reduces the bias in two-stage IV regressions and effectively learns high-performing policies. We derive a novel learning objective to reduce bias and design the DML-IV algorithm following the double/debiased machine learning (DML) framework. The learnt DML-IV estimator has strong convergence rate and $O(N^{-1/2})$ suboptimality guarantees that match those when the dataset is unconfounded. DML-IV outperforms state-of-the-art IV regression methods on IV regression benchmarks and learns high-performing policies in the presence of instruments.
comment: Accepted at ICML 2024
♻ ☆ MatText: Do Language Models Need More than Text & Scale for Materials Modeling?
Effectively representing materials as text has the potential to leverage the vast advancements of large language models (LLMs) for discovering new materials. While LLMs have shown remarkable success in various domains, their application to materials science remains underexplored. A fundamental challenge is the lack of understanding of how to best utilize text-based representations for materials modeling. This challenge is further compounded by the absence of a comprehensive benchmark to rigorously evaluate the capabilities and limitations of these text representations in capturing the complexity of material systems. To address this gap, we propose MatText, a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials. MatText encompasses nine distinct text-based representations for material systems, including several novel representations. Each representation incorporates unique inductive biases that capture relevant information and integrate prior physical knowledge about materials. Additionally, MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science. These tools include standardized dataset splits for each representation, probes for evaluating sensitivity to geometric factors, and tools for seamlessly converting crystal structures into text. Using MatText, we conduct an extensive analysis of the capabilities of language models in modeling materials. Our findings reveal that current language models consistently struggle to capture the geometric information crucial for materials modeling across all representations. Instead, these models tend to leverage local information, which is emphasized in some of our novel representations. Our analysis underscores MatText's ability to reveal shortcomings of text-based methods for materials design.
♻ ☆ The Intelligible and Effective Graph Neural Additive Networks
Graph Neural Networks (GNNs) have emerged as the predominant approach for learning over graph-structured data. However, most GNNs operate as black-box models and require post-hoc explanations, which may not suffice in high-stakes scenarios where transparency is crucial. In this paper, we present a GNN that is interpretable by design. Our model, Graph Neural Additive Network (GNAN), is a novel extension of the interpretable class of Generalized Additive Models, and can be visualized and fully understood by humans. GNAN is designed to be fully interpretable, allowing both global and local explanations at the feature and graph levels through direct visualization of the model. These visualizations describe the exact way the model uses the relationships between the target variable, the features, and the graph. We demonstrate the intelligibility of GNANs in a series of examples on different tasks and datasets. In addition, we show that the accuracy of GNAN is on par with black-box GNNs, making it suitable for critical applications where transparency is essential, alongside high accuracy.
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ Towards Learning Stochastic Population Models by Gradient Descent
Increasing effort is put into the development of methods for learning mechanistic models from data. This task entails not only the accurate estimation of parameters but also a suitable model structure. Recent work on the discovery of dynamical systems formulates this problem as a linear equation system. Here, we explore several simulation-based optimization approaches, which allow much greater freedom in the objective formulation and weaker conditions on the available data. We show that even for relatively small stochastic population models, simultaneous estimation of parameters and structure poses major challenges for optimization procedures. Particularly, we investigate the application of the local stochastic gradient descent method, commonly used for training machine learning models. We demonstrate accurate estimation of models but find that enforcing the inference of parsimonious, interpretable models drastically increases the difficulty. We give an outlook on how this challenge can be overcome.
comment: 5 pages, 2 figures
♻ ☆ Deep Maxout Network-based Feature Fusion and Political Tangent Search Optimizer enabled Transfer Learning for Thalassemia Detection
Thalassemia is a heritable blood disorder which is the outcome of a genetic defect causing lack of production of hemoglobin polypeptide chains. However, there is less understanding of the precise frequency as well as sharing in these areas. Knowing about the frequency of thalassemia occurrence and dependable mutations is thus a significant step in preventing, controlling, and treatment planning. Here, Political Tangent Search Optimizer based Transfer Learning (PTSO_TL) is introduced for thalassemia detection. Initially, input data obtained from a particular dataset is normalized in the data normalization stage. Quantile normalization is utilized in the data normalization stage, and the data are then passed to the feature fusion phase, in which Weighted Euclidean Distance with Deep Maxout Network (DMN) is utilized. Thereafter, data augmentation is performed using the oversampling method to increase data dimensionality. Lastly, thalassemia detection is carried out by TL, wherein a convolutional neural network (CNN) is utilized with hyperparameters from a trained model such as Xception. TL is tuned by PTSO, and the training algorithm PTSO is presented by merging of Political Optimizer (PO) and Tangent Search Algorithm (TSA). Furthermore, PTSO_TL obtained maximal precision, recall, and f-measure values of about 94.3%, 96.1%, and 95.2%, respectively.
♻ ☆ MALIBO: Meta-learning for Likelihood-free Bayesian Optimization
Bayesian optimization (BO) is a popular method to optimize costly black-box functions. While traditional BO optimizes each new target task from scratch, meta-learning has emerged as a way to leverage knowledge from related tasks to optimize new tasks faster. However, existing meta-learning BO methods rely on surrogate models that suffer from scalability issues and are sensitive to observations with different scales and noise types across tasks. Moreover, they often overlook the uncertainty associated with task similarity. This leads to unreliable task adaptation when only limited observations are obtained or when the new tasks differ significantly from the related tasks. To address these limitations, we propose a novel meta-learning BO approach that bypasses the surrogate model and directly learns the utility of queries across tasks. Our method explicitly models task uncertainty and includes an auxiliary model to enable robust adaptation to new tasks. Extensive experiments show that our method demonstrates strong anytime performance and outperforms state-of-the-art meta-learning BO methods in various benchmarks.
♻ ☆ PDFA Distillation via String Probability Queries
Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as language models. In this work we present an algorithm to distill PDFA from neural networks. Our algorithm is a derivative of the L# algorithm and capable of learning PDFA from a new type of query, in which the algorithm infers conditional probabilities from the probability of the queried string to occur. We show its effectiveness on a recent public dataset by distilling PDFA from a set of trained neural networks.
comment: LearnAUT 2024
♻ ☆ Generative AI-Driven Human Digital Twin in IoT-Healthcare: A Comprehensive Survey
The Internet of things (IoT) can significantly enhance the quality of human life, specifically in healthcare, attracting extensive attentions to IoT-healthcare services. Meanwhile, the human digital twin (HDT) is proposed as an innovative paradigm that can comprehensively characterize the replication of the individual human body in the digital world and reflect its physical status in real time. Naturally, HDT is envisioned to empower IoT-healthcare beyond the application of healthcare monitoring by acting as a versatile and vivid human digital testbed, simulating the outcomes and guiding the practical treatments. However, successfully establishing HDT requires high-fidelity virtual modeling and strong information interactions but possibly with scarce, biased and noisy data. Fortunately, a recent popular technology called generative artificial intelligence (GAI) may be a promising solution because it can leverage advanced AI algorithms to automatically create, manipulate, and modify valuable while diverse data. This survey particularly focuses on the implementation of GAI-driven HDT in IoT-healthcare. We start by introducing the background of IoT-healthcare and the potential of GAI-driven HDT. Then, we delve into the fundamental techniques and present the overall framework of GAI-driven HDT. After that, we explore the realization of GAI-driven HDT in detail, including GAI-enabled data acquisition, communication, data management, digital modeling, and data analysis. Besides, we discuss typical IoT-healthcare applications that can be revolutionized by GAI-driven HDT, namely personalized health monitoring and diagnosis, personalized prescription, and personalized rehabilitation. Finally, we conclude this survey by highlighting some future research directions.
♻ ☆ Networked Communication for Decentralised Agents in Mean-Field Games
We introduce networked communication to the mean-field game framework, in particular to oracle-free settings where $N$ decentralised agents learn along a single, non-episodic run of the empirical system. We prove that our architecture, with only a few reasonable assumptions about network structure, has sample guarantees bounded between those of the centralised- and independent-learning cases. We discuss how the sample guarantees of the three theoretical algorithms do not actually result in practical convergence. We therefore show that in practical settings where the theoretical parameters are not observed (leading to poor estimation of the Q-function), our communication scheme significantly accelerates convergence over the independent case (and often even the centralised case), without relying on the assumption of a centralised learner. We contribute further practical enhancements to all three theoretical algorithms, allowing us to present their first empirical demonstrations. Our experiments confirm that we can remove several of the theoretical assumptions of the algorithms, and display the empirical convergence benefits brought by our new networked communication. We additionally show that the networked approach has significant advantages, over both the centralised and independent alternatives, in terms of robustness to unexpected learning failures and to changes in population size.
♻ ☆ Kandinsky 3.0 Technical Report
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3
comment: Project page: https://ai-forever.github.io/Kandinsky-3
♻ ☆ Straggler-Resilient Differentially-Private Decentralized Learning
We consider the straggler problem in decentralized learning over a logical ring while preserving user data privacy. Especially, we extend the recently proposed framework of differential privacy (DP) amplification by decentralization by Cyffers and Bellet to include overall training latency--comprising both computation and communication latency. Analytical results on both the convergence speed and the DP level are derived for both a skipping scheme (which ignores the stragglers after a timeout) and a baseline scheme that waits for each node to finish before the training continues. A trade-off between overall training latency, accuracy, and privacy, parameterized by the timeout of the skipping scheme, is identified and empirically validated for logistic regression on a real-world dataset and for image classification using the MNIST and CIFAR-10 datasets.
comment: To appear in the IEEE Journal on Selected Areas in Information Theory (special issue on Information-Theoretic Methods for Trustworthy and Reliable Machine Learning)
♻ ☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.
comment: 39 pages
♻ ☆ Optimal Rate of Kernel Regression in Large Dimensions
We perform a study on kernel regression for large-dimensional data (where the sample size $n$ is polynomially depending on the dimension $d$ of the samples, i.e., $n\asymp d^{\gamma}$ for some $\gamma >0$ ). We first build a general tool to characterize the upper bound and the minimax lower bound of kernel regression for large dimensional data through the Mendelson complexity $\varepsilon_{n}^{2}$ and the metric entropy $\bar{\varepsilon}_{n}^{2}$ respectively. When the target function falls into the RKHS associated with a (general) inner product model defined on $\mathbb{S}^{d}$, we utilize the new tool to show that the minimax rate of the excess risk of kernel regression is $n^{-1/2}$ when $n\asymp d^{\gamma}$ for $\gamma =2, 4, 6, 8, \cdots$. We then further determine the optimal rate of the excess risk of kernel regression for all the $\gamma>0$ and find that the curve of optimal rate varying along $\gamma$ exhibits several new phenomena including the multiple descent behavior and the periodic plateau behavior. As an application, For the neural tangent kernel (NTK), we also provide a similar explicit description of the curve of optimal rate. As a direct corollary, we know these claims hold for wide neural networks as well.
♻ ☆ BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.
♻ ☆ Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labelling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes multiple feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that the use of LLM embedding in the event detection framework has significantly improved the results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy in news analysis and reporting.
♻ ☆ Effort and Size Estimation in Software Projects with Large Language Model-based Intelligent Interfaces
The advancement of Large Language Models (LLM) has also resulted in an equivalent proliferation in its applications. Software design, being one, has gained tremendous benefits in using LLMs as an interface component that extends fixed user stories. However, inclusion of LLM-based AI agents in software design often poses unexpected challenges, especially in the estimation of development efforts. Through the example of UI-based user stories, we provide a comparison against traditional methods and propose a new way to enhance specifications of natural language-based questions that allows for the estimation of development effort by taking into account data sources, interfaces and algorithms.
♻ ☆ SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.
♻ ☆ ULLER: A Unified Language for Learning and Reasoning
The field of neuro-symbolic artificial intelligence (NeSy), which combines learning and reasoning, has recently experienced significant growth. There now are a wide variety of NeSy frameworks, each with its own specific language for expressing background knowledge and how to relate it to neural networks. This heterogeneity hinders accessibility for newcomers and makes comparing different NeSy frameworks challenging. We propose a language for NeSy, which we call ULLER, a Unfied Language for LEarning and Reasoning. ULLER encompasses a wide variety of settings, while ensuring that knowledge described in it can be used in existing NeSy systems. ULLER has a first-order logic syntax specialised for NeSy for which we provide example semantics including classical FOL, fuzzy logic, and probabilistic logic. We believe ULLER is a first step towards making NeSy research more accessible and comparable, paving the way for libraries that streamline training and evaluation across a multitude of semantics, knowledge bases, and NeSy systems.
comment: Accepted at NeSy 2024
♻ ☆ Position: Explain to Question not to Justify
Explainable Artificial Intelligence (XAI) is a young but very promising field of research. Unfortunately, the progress in this field is currently slowed down by divergent and incompatible goals. We separate various threads tangled within the area of XAI into two complementary cultures of human/value-oriented explanations (BLUE XAI) and model/validation-oriented explanations (RED XAI). This position paper argues that the area of RED XAI is currently under-explored, i.e., more methods for explainability are desperately needed to question models (e.g., extract knowledge from well-performing models as well as spotting and fixing bugs in faulty models), and the area of RED XAI hides great opportunities and potential for important research necessary to ensure the safety of AI systems. We conclude this paper by presenting promising challenges in this area.
♻ ☆ Automatic Regularization for Linear MMSE Filters
In this work, we consider the problem of regularization in the design of minimum mean square error (MMSE) linear filters. Using the relationship with statistical machine learning methods, using a Bayesian approach, the regularization parameter is found from the observed signals in a simple and automatic manner. The proposed approach is illustrated in system identification and beamforming examples, where the automatic regularization is shown to yield near-optimal results.
♻ ☆ SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models ICML 2024
Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
comment: To appear at ICML 2024
♻ ☆ Active Preference Learning for Large Language Models
As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
comment: 13 pages, 5 figures, 6 tables
♻ ☆ Latent Logic Tree Extraction for Event Sequence Explanation from LLMs
Modern high-stakes systems, such as healthcare or robotics, often generate vast streaming event sequences. Our goal is to design an efficient, plug-and-play tool to elicit logic tree-based explanations from Large Language Models (LLMs) to provide customized insights into each observed event sequence. Built on the temporal point process model for events, our method employs the likelihood function as a score to evaluate generated logic trees. We propose an amortized Expectation-Maximization (EM) learning framework and treat the logic tree as latent variables. In the E-step, we evaluate the posterior distribution over the latent logic trees using an LLM prior and the likelihood of the observed event sequences. LLM provides a high-quality prior for the latent logic trees, however, since the posterior is built over a discrete combinatorial space, we cannot get the closed-form solution. We propose to generate logic tree samples from the posterior using a learnable GFlowNet, which is a diversity-seeking generator for structured discrete variables. The M-step employs the generated logic rules to approximate marginalization over the posterior, facilitating the learning of model parameters and refining the tunable LLM prior parameters. In the online setting, our locally built, lightweight model will iteratively extract the most relevant rules from LLMs for each sequence using only a few iterations. Empirical demonstrations showcase the promising performance and adaptability of our framework.
♻ ☆ QQQ: Quality Quattuor-Bit Quantization for Large Language Models
Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$\times$ and 3.29 $\times$ over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 $\times$, 2.10$\times$, and 1.25$\times$ compared to FP16, W8A8, and W4A16, respectively.
♻ ☆ MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs
Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant 'Impossible Trinity': struggling to simultaneously maintain model performance, enhance inference speed, and extend applicability across various architectures. Due to the theoretical ambiguities surrounding MOYU, this paper elucidates the root cause of the MOYU property and outlines the mechanisms behind two primary limitations encountered by current DA methods: 1) history-related activation uncertainty, and 2) semantic-irrelevant activation inertia. Our analysis not only underscores the limitations of current dynamic activation strategies within large-scale LLaMA models but also proposes opportunities for refining the design of future sparsity schemes.
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ Sequential Model for Predicting Patient Adherence in Subcutaneous Immunotherapy for Allergic Rhinitis
Objective: Subcutaneous Immunotherapy (SCIT) is the long-lasting causal treatment of allergic rhinitis (AR). How to enhance the adherence of patients to maximize the benefit of allergen immunotherapy (AIT) plays a crucial role in the management of AIT. This study aims to leverage novel machine learning models to precisely predict the risk of non-adherence of AR patients and related local symptom scores in three years SCIT. Methods: The research develops and analyzes two models, sequential latent-variable model (SLVM) of Sequential Latent Actor-Critic (SLAC) and Long Short-Term Memory (LSTM) evaluating them based on scoring and adherence prediction capabilities. Results: Excluding the biased samples at the first time step, the predictive adherence accuracy of the SLAC models is from 60\% to 72\%, and for LSTM models, it is 66\% to 84\%, varying according to the time steps. The range of Root Mean Square Error (RMSE) for SLAC models is between 0.93 and 2.22, while for LSTM models it is between 1.09 and 1.77. Notably, these RMSEs are significantly lower than the random prediction error of 4.55. Conclusion: We creatively apply sequential models in the long-term management of SCIT with promising accuracy in the prediction of SCIT nonadherence in AR patients. While LSTM outperforms SLAC in adherence prediction, SLAC excels in score prediction for patients undergoing SCIT for AR. The state-action-based SLAC adds flexibility, presenting a novel and effective approach for managing long-term AIT.
comment: Frontiers in Pharmacology, research topic: Methods and Metrics to Measure Medication Adherence
♻ ☆ Empowering Interdisciplinary Insights with Dynamic Graph Embedding Trajectories
We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs, offering a robust method for representing complex and massive dynamic graphs. The application of DyGETViz extends to a diverse array of domains, including ethology, epidemiology, finance, genetics, linguistics, communication studies, social studies, and international relations. Through its implementation, DyGETViz has revealed or confirmed various critical insights. These include the diversity of content sharing patterns and the degree of specialization within online communities, the chronological evolution of lexicons across decades, and the distinct trajectories exhibited by aging-related and non-related genes. Importantly, DyGETViz enhances the accessibility of scientific findings to non-domain experts by simplifying the complexities of dynamic graphs. Our framework is released as an open-source Python package for use across diverse disciplines. Our work not only addresses the ongoing challenges in visualizing and analyzing DTDG models but also establishes a foundational framework for future investigations into dynamic graph representation and analysis across various disciplines.
comment: 27 pages, 11 figures
♻ ☆ Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency
Skin cancer is a major concern to public health, accounting for one-third of the reported cancers. If not detected early, the cancer has the potential for severe consequences. Recognizing the critical need for effective skin cancer classification, we address the limitations of existing models, which are often too large to deploy in areas with limited computational resources. In response, we present a knowledge distillation based approach for creating a lightweight yet high-performing classifier. The proposed solution involves fusing three models, namely ResNet152V2, ConvNeXtBase, and ViT Base, to create an effective teacher model. The teacher model is then employed to guide a lightweight student model of size 2.03 MB. This student model is further compressed to 469.77 KB using 16-bit quantization, enabling smooth incorporation into edge devices. With six-stage image preprocessing, data augmentation, and a rigorous ablation study, the model achieves an impressive accuracy of 98.75% on the HAM10000 dataset and 98.94% on the Kaggle dataset in classifying benign and malignant skin cancers. With its high accuracy and compact size, our model appears to be a potential choice for accurate skin cancer classification, particularly in resource-constrained settings.
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ Transcendence: Generative Models Can Outperform The Experts That Train Them
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
comment: Code, models, and data at https://transcendence.eddie.win
♻ ☆ Behavior Generation with Latent Actions
Generative modeling of complex behaviors from labeled datasets has been a longstanding problem in decision making. Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction. A recent class of models called Behavior Transformers (BeT) addresses this by discretizing actions using k-means clustering to capture different modes. However, k-means struggles to scale for high-dimensional action spaces or long sequences, and lacks gradient information, and thus BeT suffers in modeling long-range actions. In this work, we present Vector-Quantized Behavior Transformer (VQ-BeT), a versatile model for behavior generation that handles multimodal action prediction, conditional generation, and partial observations. VQ-BeT augments BeT by tokenizing continuous actions with a hierarchical vector quantization module. Across seven environments including simulated manipulation, autonomous driving, and robotics, VQ-BeT improves on state-of-the-art models such as BeT and Diffusion Policies. Importantly, we demonstrate VQ-BeT's improved ability to capture behavior modes while accelerating inference speed 5x over Diffusion Policies. Videos and code can be found https://sjlee.cc/vq-bet
comment: Github repo: https://github.com/jayLEE0301/vq_bet_official
♻ ☆ Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse
Recently, a vast amount of literature has focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples' features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the "lazy regime". Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature mapping learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.
comment: 34 pages, 14 figures
♻ ☆ AIGB: Generative Auto-bidding via Diffusion Modeling KDD 2024
Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI.
comment: Accepted by KDD 2024
♻ ☆ Active Sequential Two-Sample Testing
A two-sample hypothesis test is a statistical procedure used to determine whether the distributions generating two samples are identical. We consider the two-sample testing problem in a new scenario where the sample measurements (or sample features) are inexpensive to access, but their group memberships (or labels) are costly. To address the problem, we devise the first \emph{active sequential two-sample testing framework} that not only sequentially but also \emph{actively queries}. Our test statistic is a likelihood ratio where one likelihood is found by maximization over all class priors, and the other is provided by a probabilistic classification model. The classification model is adaptively updated and used to predict where the (unlabelled) features have a high dependency on labels; labeling the ``high-dependency'' features leads to the increased power of the proposed testing framework. In theory, we provide the proof that our framework produces an \emph{anytime-valid} $p$-value. In addition, we characterize the proposed framework's gain in testing power by analyzing the mutual information between the feature and label variables in asymptotic and finite-sample scenarios. In practice, we introduce an instantiation of our framework and evaluate it using several experiments; the experiments on the synthetic, MNIST, and application-specific datasets demonstrate that the testing power of the instantiated active sequential test significantly increases while the Type I error is under control.
♻ ☆ FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information
This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.
comment: 21 pages, 4 figures, 6 tables
♻ ☆ Generative Autoencoding of Dropout Patterns
We propose a generative model termed Deciphering Autoencoders. In this model, we assign a unique random dropout pattern to each data point in the training dataset and then train an autoencoder to reconstruct the corresponding data point using this pattern as information to be encoded. Even if a completely random dropout pattern is assigned to each data point regardless of their similarities, a sufficiently large encoder can smoothly map them to a low-dimensional latent space to reconstruct individual training data points. During inference, using a dropout pattern different from those used during training allows the model to function as a generator. Since the training of Deciphering Autoencoders relies solely on reconstruction error, it offers more stable training compared to other generative models. Despite their simplicity, Deciphering Autoencoders show sampling quality comparable to DCGAN on the CIFAR-10 dataset.
♻ ☆ MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm
In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: https://github.com/YUNBLAK/MeGA-Merging-Multiple-Independently-Trained-Neural-Networks-Based-on-Genetic-Algorithm
♻ ☆ Submodular Information Selection for Hypothesis Testing with Misclassification Penalties
We consider the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task where the goal is to identify the true state of the world from a finite set of hypotheses, based on finite observation samples from the sources. In order to characterize the learning performance, we propose a misclassification penalty framework, which enables nonuniform treatment of different misclassification errors. In a centralized Bayesian learning setting, we study two variants of the subset selection problem: (i) selecting a minimum cost information set to ensure that the maximum penalty of misclassifying the true hypothesis is below a desired bound and (ii) selecting an optimal information set under a limited budget to minimize the maximum penalty of misclassifying the true hypothesis. Under certain assumptions, we prove that the objective (or constraints) of these combinatorial optimization problems are weak (or approximate) submodular, and establish high-probability performance guarantees for greedy algorithms. Further, we propose an alternate metric for information set selection which is based on the total penalty of misclassification. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems. Finally, we present numerical simulations to validate our theoretical results over several randomly generated instances.
comment: 21 pages, 4 figures
♻ ☆ Robust Model-Based Optimization for Challenging Fitness Landscapes
Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
♻ ☆ Forecasting Electricity Market Signals via Generative AI
This paper presents a generative artificial intelligence approach to probabilistic forecasting of electricity market signals, such as real-time locational marginal prices and area control error signals. Inspired by the Wiener-Kallianpur innovation representation of nonparametric time series, we propose a weak innovation autoencoder architecture and a novel deep learning algorithm that extracts the canonical independent and identically distributed innovation sequence of the time series, from which samples of future time series are generated. The validity of the proposed approach is established by proving that, under ideal training conditions, the generated samples have the same conditional probability distribution as that of the ground truth. Three applications involving highly dynamic and volatile time series in real-time market operations are considered: (i) locational marginal price forecasting for self-scheduled resources such as battery storage participants, (ii) interregional price spread forecasting for virtual bidders in interchange markets, and (iii) area control error forecasting for frequency regulations. Numerical studies based on market data from multiple independent system operators demonstrate the superior performance of the proposed generative forecaster over leading classical and modern machine learning techniques under both probabilistic and point forecasting metrics.
♻ ☆ 3D-Mol: A Novel Contrastive Learning Framework for Molecular Property Prediction with 3D Information
Molecular property prediction, crucial for early drug candidate screening and optimization, has seen advancements with deep learning-based methods. While deep learning-based methods have advanced considerably, they often fall short in fully leveraging 3D spatial information. Specifically, current molecular encoding techniques tend to inadequately extract spatial information, leading to ambiguous representations where a single one might represent multiple distinct molecules. Moreover, existing molecular modeling methods focus predominantly on the most stable 3D conformations, neglecting other viable conformations present in reality. To address these issues, we propose 3D-Mol, a novel approach designed for more accurate spatial structure representation. It deconstructs molecules into three hierarchical graphs to better extract geometric information. Additionally, 3D-Mol leverages contrastive learning for pretraining on 20 million unlabeled data, treating their conformations with identical topological structures as weighted positive pairs and contrasting ones as negatives, based on the similarity of their 3D conformation descriptors and fingerprints. We compare 3D-Mol with various state-of-the-art baselines on 7 benchmarks and demonstrate our outstanding performance.
♻ ☆ Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.
♻ ☆ Impact of Domain Knowledge and Multi-Modality on Intelligent Molecular Property Prediction: A Systematic Survey
The precise prediction of molecular properties is essential for advancements in drug development, particularly in virtual screening and compound optimization. The recent introduction of numerous deep learning-based methods has shown remarkable potential in enhancing molecular property prediction (MPP), especially improving accuracy and insights into molecular structures. Yet, two critical questions arise: does the integration of domain knowledge augment the accuracy of molecular property prediction and does employing multi-modal data fusion yield more precise results than unique data source methods? To explore these matters, we comprehensively review and quantitatively analyze recent deep learning methods based on various benchmarks. We discover that integrating molecular information significantly improves molecular property prediction (MPP) for both regression and classification tasks. Specifically, regression improvements, measured by reductions in root mean square error (RMSE), are up to 4.0%, while classification enhancements, measured by the area under the receiver operating characteristic curve (ROC-AUC), are up to 1.7%. We also discover that enriching 2D graphs with 1D SMILES boosts multi-modal learning performance for regression tasks by up to 9.1%, and augmenting 2D graphs with 3D information increases performance for classification tasks by up to 13.2%, with both enhancements measured using ROC-AUC. The two consolidated insights offer crucial guidance for future advancements in drug discovery.
♻ ☆ Last Iterate Convergence of Incremental Methods and Applications in Continual Learning
Incremental gradient and incremental proximal methods are a fundamental class of optimization algorithms used for solving finite sum problems, broadly studied in the literature. Yet, without strong convexity, their convergence guarantees have primarily been established for the ergodic (average) iterate. Motivated by applications in continual learning, we obtain the first convergence guarantees for the last iterate of both incremental gradient and incremental proximal methods, in general convex smooth (for both) and convex Lipschitz (for the proximal variants) settings. Our oracle complexity bounds for the last iterate nearly match (i.e., match up to a square-root-log or a log factor) the best known oracle complexity bounds for the average iterate, for both classes of methods. We further obtain generalizations of our results to weighted averaging of the iterates with increasing weights and for randomly permuted ordering of updates. We study incremental proximal methods as a model of continual learning with generalization and argue that large amount of regularization is crucial to preventing catastrophic forgetting. Our results generalize last iterate guarantees for incremental methods compared to state of the art, as such results were previously known only for overparameterized linear models, which correspond to convex quadratic problems with infinitely many solutions.
♻ ☆ Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77% in conversational settings, 33% in document-based settings, and 51% in a mixed setting.
♻ ☆ FloorSet -- a VLSI Floorplanning Dataset with Design Constraints of Real-World SoCs
Floorplanning for systems-on-a-chip (SoCs) and its sub-systems is a crucial and non-trivial step of the physical design flow. It represents a difficult combinatorial optimization problem. A typical large scale SoC with 120 partitions generates a search-space of nearly 10E250. As novel machine learning (ML) approaches emerge to tackle such problems, there is a growing need for a modern benchmark that comprises a large training dataset and performance metrics that better reflect real-world constraints and objectives compared to existing benchmarks. To address this need, we present FloorSet -- two comprehensive datasets of synthetic fixed-outline floorplan layouts that reflect the distribution of real SoCs. Each dataset has 1M training samples and 100 test samples where each sample is a synthetic floor-plan. FloorSet-Prime comprises fully-abutted rectilinear partitions and near-optimal wire-length. A simplified dataset that reflects early design phases, FloorSet-Lite comprises rectangular partitions, with under 5 percent white-space and near-optimal wire-length. Both datasets define hard constraints seen in modern design flows such as shape constraints, edge-affinity, grouping constraints, and pre-placement constraints. FloorSet is intended to spur fundamental research on large-scale constrained optimization problems. Crucially, FloorSet alleviates the core issue of reproducibility in modern ML driven solutions to such problems. FloorSet is available as an open-source repository for the research community.
comment: 10 pages, 11 figures
Systems and Control 12
☆ `Just One More Sensor is Enough' -- Iterative Water Leak Localization with Physical Simulation and a Small Number of Pressure Sensors
In this article, we propose an approach to leak localisation in a complex water delivery grid with the use of data from physical simulation (e.g. EPANET software). This task is usually achieved by a network of multiple water pressure sensors and analysis of the so-called sensitivity matrix of pressure differences between the network's simulated data and actual data of the network affected by the leak. However, most algorithms using this approach require a significant number of pressure sensors -- a condition that is not easy to fulfil in the case of many less equipped networks. Therefore, we answer the question of whether leak localisation is possible by utilising very few sensors but having the ability to relocate one of them. Our algorithm is based on physical simulations (EPANET software) and an iterative scheme for mobile sensor relocation. The experiments show that the proposed system can equalise the low number of sensors with adjustments made for their positioning, giving a very good approximation of leak's position both in simulated cases and real-life example taken from BattLeDIM competition L-Town data.
☆ Koopman based trajectory model and computation offloading for high mobility paradigm in ISAC enabled IoT system
User experience on mobile devices is constrained by limited battery capacity and processing power, but 6G technology advancements are diving rapidly into mobile technical evolution. Mobile edge computing (MEC) offers a solution, offloading computationally intensive tasks to edge cloud servers, reducing battery drain compared to local processing. The upcoming integrated sensing and communication in mobile communication may improve the trajectory prediction and processing delays. This study proposes a greedy resource allocation optimization strategy for multi-user networks to minimize aggregate energy usage. Numerical results show potential improvement at 33\% for every 1000 iteration. Addressing prediction model division and velocity accuracy issues is crucial for better results. A plan for further improvement and achieving objectives is outlined for the upcoming work phase.
☆ Composite Adaptive Disturbance Rejection in Robotics via Instrumental Variables based DREM
In this paper we consider trajectory tracking problem for robotic systems affected by unknown external perturbations. Considering possible solutions, we restrict our attention to composite adaptation, which, particularly, ensures parametric error convergence being desirable to enhance overall stability and robustness of a closed-loop system. At the same time, existing composite approaches cannot simultaneously relax stringent persistence of excitation requirement and guarantee convergence of parametric error to zero for a perturbed scenario. So, a new composite adaptation scheme is proposed, which successfully overcomes mentioned problems of known counterparts and has several salient features. First, it includes a novel adaptive disturbance rejection control law for a general n-DoF dynamical model in the Euler-Lagrange form, which, without achievement of the parameter estimation goal, ensures global stability via application of a high-gain external torque observer augmented with some adaptation law. Secondly, such law is extended with a composite summand derived via the recently proposed Instrumental Variables based Dynamic Regressor Extension and Mixing procedure, which relaxes excitation conditions and ensures asymptotic parameter estimation and reference tracking in the presence of external torque under some non-restrictive assumptions. An illustrative example shows the effectiveness and superiority of the proposed approach in comparison with existing solutions.
comment: 10 pages, 4 figures
☆ Identifying Ordinary Differential Equations for Data-efficient Model-based Reinforcement Learning
The identification of a mathematical dynamics model is a crucial step in the designing process of a controller. However, it is often very difficult to identify the system's governing equations, especially in complex environments that combine physical laws of different disciplines. In this paper, we present a new approach that allows identifying an ordinary differential equation by means of a physics-informed machine learning algorithm. Our method introduces a special neural network that allows exploiting prior human knowledge to a certain degree and extends it autonomously, so that the resulting differential equations describe the system as accurately as possible. We validate the method on a Duffing oscillator with simulation data and, additionally, on a cascaded tank example with real-world data. Subsequently, we use the developed algorithm in a model-based reinforcement learning framework by alternately identifying and controlling a system to a target state. We test the performance by swinging-up an inverted pendulum on a cart.
comment: 10 pages, 6 figures, accepted at the IEEE World Congress on Computational Intelligence 2024
☆ Funplex: A Modified Simplex Algorithm to Efficiently Explore Near-Optimal Spaces
Modeling to generate alternatives (MGA) is an increasingly popular method in energy system optimization. MGA explores the near-optimal space, namely, system alternatives whose costs are within a certain fraction of the globally optimal cost. Real-world stakeholders may prefer these alternatives due to intangible factors. Nonetheless, widespread MGA adoption is hampered by its additional computational burden. Current MGA methods identify boundary points of the near-optimal space through repeated, independent optimization problems. Hundreds of model runs are usually required, and such individual runs are often inefficient because they repeat calculations or retrace previous trajectories. In this study, we transcend such limitations by introducing a novel algorithm called Funplex, which uses methods from multi-objective Simplex to optimize many MGA objectives with minimal computational redundancy. For a simple linear-programming energy hub case study, we show that Funplex is five times faster than existing methods and yields higher-quality near-optimal spaces. Furthermore, sensitivity analyses suggest that Funplex scales well with the number of investment variables, making it promising for capacity planning models. The current proof-of-concept implementation based on a full multi-objective tableau may face memory and stability limitations for large models. Nonetheless, future developments based on more advanced versions of Simplex may overcome such barriers, thereby making MGA more accessible and standard among modeling teams.
☆ Condition monitoring of wind turbine blades via learning-based methods
This paper addresses the topic of condition monitoring of wind turbine blades and presents a learning-based approach to fault detection. The proposed scheme utilises Principal Components Analysis and Autoencoders to derive data-driven models from root-bending moment and other measurements. The models are trained with real data obtained from a fault-free wind turbine, and then validated on data corresponding to unknown health condition. Online test statistics, employing static thresholds and Generalized Likelihood Ratio tests, are used on residual signals generated by discrepancies between the actual and reconstructed measurements to detect deviations from nominal operation. The efficacy and effectiveness of the proposed methods are demonstrated using real-life data collected from wind turbines experiencing blade faults.
☆ Global Regulation of Feedforward Nonlinear Systems: A Logic-Based Switching Gain Approach
In this article, we investigate the global regulation problem for a class of feedforward nonlinear systems. Notably, the systems under consideration allow unknown input-output-dependent nonlinear growth rates, which has not been considered in existing works. A novel logic-based switching (LBS) gain approach is proposed to counteract system uncertainties and nonlinearities. Furthermore, a tanh-type speed-regulation function is embedded into the switching mechanism for the first time to improve the convergence speed and transient performance. Then, a switching adaptive output feedback (SAOF) controller is proposed based on the developed switching mechanism, which is of a concise form and low-complexity characteristic. It is shown that the objective of global regulation is achieved with faster convergence speed and better transient performance under the proposed controller. Moreover, by strengthening the switching mechanism, the improved control approach can deal with feedforward nonlinear systems with external disturbances. Finally, representative examples are presented to demonstrate the effectiveness and advantages of our approach in comparison with the existing approaches.
☆ Practical Power System Inertia Monitoring Based on Pumped Storage Hydropower Operation Signature
This paper proposes a practical method to monitor power system inertia using Pumped Storage Hydropower (PSH) switching-off events. This approach offers real-time system-level inertia estimation with minimal expenses, no disruption, and the inclusion of behind-the-meter inertia. First, accurate inertia estimation is achieved through improved RoCoF calculation that accounts for pre-event RoCoF, reducing common random frequency fluctuations in practice. Second, PSH field data is analyzed, highlighting the benefits of using switching-off events for grid inertia estimation. Third, an event detection trigger is designed to capture pump switching-off events based on local and system features. Fourth, the method is validated on the U.S. Eastern Interconnection model with over 60,000 buses, demonstrating very high accuracy (3%-5% error rate). Finally, it is applied to the U.S. Western Interconnection, with field validation showing a 9.9% average absolute error rate. Despite challenges in practical power system inertia estimation, this method enhances decision-making for power grid reliability and efficiency, addressing challenges posed by renewable energy integration.
comment: 9 pages, 16 figures
☆ Multi-service collaboration and composition of cloud manufacturing customized production based on problem decomposition
Cloud manufacturing system is a service-oriented and knowledge-based one, which can provide solutions for the large-scale customized production. The service resource allocation is the primary factor that restricts the production time and cost in the cloud manufacturing customized production (CMCP). In order to improve the efficiency and reduce the cost in CMCP, we propose a new framework which considers the collaboration among services with the same functionality. A mathematical evaluation formulation for the service composition and service usage scheme is constructed with the following critical indexes: completion time, cost, and number of selected services. Subsequently, a problem decomposition based genetic algorithm is designed to obtain the optimal service compositions with service usage schemes. A smart clothing customization case is illustrated so as to show the effectiveness and efficiency of the method proposed in this paper. Finally, the results of simulation experiments and comparisons show that these solutions obtained by our method are with the minimum time, a lower cost, and the fewer selected services.
comment: 12 pages, 8 figures
♻ ☆ Solving Differential Equations using Physics-Informed Deep Equilibrium Models
This paper introduces Physics-Informed Deep Equilibrium Models (PIDEQs) for solving initial value problems (IVPs) of ordinary differential equations (ODEs). Leveraging recent advancements in deep equilibrium models (DEQs) and physics-informed neural networks (PINNs), PIDEQs combine the implicit output representation of DEQs with physics-informed training techniques. We validate PIDEQs using the Van der Pol oscillator as a benchmark problem, demonstrating their efficiency and effectiveness in solving IVPs. Our analysis includes key hyperparameter considerations for optimizing PIDEQ performance. By bridging deep learning and physics-based modeling, this work advances computational techniques for solving IVPs, with implications for scientific computing and engineering applications.
comment: Accepted at CASE 2024; Extended Sec. III.B
♻ ☆ Networked Communication for Decentralised Agents in Mean-Field Games
We introduce networked communication to the mean-field game framework, in particular to oracle-free settings where $N$ decentralised agents learn along a single, non-episodic run of the empirical system. We prove that our architecture, with only a few reasonable assumptions about network structure, has sample guarantees bounded between those of the centralised- and independent-learning cases. We discuss how the sample guarantees of the three theoretical algorithms do not actually result in practical convergence. We therefore show that in practical settings where the theoretical parameters are not observed (leading to poor estimation of the Q-function), our communication scheme significantly accelerates convergence over the independent case (and often even the centralised case), without relying on the assumption of a centralised learner. We contribute further practical enhancements to all three theoretical algorithms, allowing us to present their first empirical demonstrations. Our experiments confirm that we can remove several of the theoretical assumptions of the algorithms, and display the empirical convergence benefits brought by our new networked communication. We additionally show that the networked approach has significant advantages, over both the centralised and independent alternatives, in terms of robustness to unexpected learning failures and to changes in population size.
♻ ☆ Convex Optimization of Initial Perturbations toward Quantitative Weather Control
This study proposes introducing convex optimization to find initial perturbations of atmospheric models for realizing specified changes in subsequent forecasts. In the proposed method, we formulate and solve an inverse problem to find effective perturbations in atmospheric variables so that controlled variables satisfy specified changes at a specified time. The proposed method first constructs a sensitivity matrix of controlled variables, such as accumulated precipitation, to the initial atmospheric variables, such as temperature and humidity, through sensitivity analysis using numerical weather prediction (NWP) models. The sensitivity matrix is used to solve the inverse problem as convex optimization, in which a global optimal solution can be found computationally efficiently. The proposed method was validated through a benchmark warm bubble experiment using an NWP model. The experiments showed that identified perturbation successfully realized specified spatial distributions of accumulated precipitation. These results demonstrated the possibility of controlling the real atmosphere by solving inverse problems and adding small perturbations to atmospheric states.
comment: license changed to CC BY 4.0; revised to improve readability; some figures in Appendix omitted to improve conciseness
Multiagent Systems 5
☆ BMW Agents -- A Framework For Task Automation Through Multi-agent Collaboration
Autonomous agents driven by Large Language Models (LLMs) offer enormous potential for automation. Early proof of this technology can be found in various demonstrations of agents solving complex tasks, interacting with external systems to augment their knowledge, and triggering actions. In particular, workflows involving multiple agents solving complex tasks in a collaborative fashion exemplify their capacity to operate in less strict and less well-defined environments. Thus, a multi-agent approach has great potential for serving as a backbone in many industrial applications, ranging from complex knowledge retrieval systems to next generation robotic process automation. Given the reasoning abilities within the current generation of LLMs, complex processes require a multi-step approach that includes a plan of well-defined and modular tasks. Depending on the level of complexity, these tasks can be executed either by a single agent or a group of agents. In this work, we focus on designing a flexible agent engineering framework with careful attention to planning and execution, capable of handling complex use case applications across various domains. The proposed framework provides reliability in industrial applications and presents techniques to ensure a scalable, flexible, and collaborative workflow for multiple autonomous agents working together towards solving tasks.
comment: 24 pages. 21 PDF images
☆ Exploring 6G Potential for Industrial Digital Twinning and Swarm Intelligence in Obstacle-Rich
With the advent of 6G technology, the demand for efficient and intelligent systems in industrial applications has surged, driving the need for advanced solutions in target localization. Utilizing swarm robots to locate unknown targets involves navigating increasingly complex environments. Digital Twinning (DT) offers a robust solution by creating a virtual replica of the physical world, which enhances the swarm's navigation capabilities. Our framework leverages DT and integrates Swarm Intelligence to store physical map information in the cloud, enabling robots to efficiently locate unknown targets. The simulation results demonstrate that the DT framework, augmented by Swarm Intelligence, significantly improves target location efficiency in obstacle-rich environments compared to traditional methods. This research underscores the potential of combining DT and Swarm Intelligence to advance the field of robotic navigation and target localization in complex industrial settings.
comment: Submitted to IEEE VTM
☆ FootBots: A Transformer-based Architecture for Motion Prediction in Soccer ICIP 2024
Motion prediction in soccer involves capturing complex dynamics from player and ball interactions. We present FootBots, an encoder-decoder transformer-based architecture addressing motion prediction and conditioned motion prediction through equivariance properties. FootBots captures temporal and social dynamics using set attention blocks and multi-attention block decoder. Our evaluation utilizes two datasets: a real soccer dataset and a tailored synthetic one. Insights from the synthetic dataset highlight the effectiveness of FootBots' social attention mechanism and the significance of conditioned motion prediction. Empirical results on real soccer data demonstrate that FootBots outperforms baselines in motion prediction and excels in conditioned tasks, such as predicting the players based on the ball position, predicting the offensive (defensive) team based on the ball and the defensive (offensive) team, and predicting the ball position based on all players. Our evaluation connects quantitative and qualitative findings. https://youtu.be/9kaEkfzG3L8
comment: Published as a conference paper at IEEE ICIP 2024
☆ Multi-UAVs end-to-end Distributed Trajectory Generation over Point Cloud Data
This paper introduces an end-to-end trajectory planning algorithm tailored for multi-UAV systems that generates collision-free trajectories in environments populated with both static and dynamic obstacles, leveraging point cloud data. Our approach consists of a 2-fork neural network fed with sensing and localization data, able to communicate intermediate learned features among the agents. One network branch crafts an initial collision-free trajectory estimate, while the other devises a neural collision constraint for subsequent optimization, ensuring trajectory continuity and adherence to physicalactuation limits. Extensive simulations in challenging cluttered environments, involving up to 25 robots and 25% obstacle density, show a collision avoidance success rate in the range of 100 -- 85%. Finally, we introduce a saliency map computation method acting on the point cloud data, offering qualitative insights into our methodology.
♻ ☆ Networked Communication for Decentralised Agents in Mean-Field Games
We introduce networked communication to the mean-field game framework, in particular to oracle-free settings where $N$ decentralised agents learn along a single, non-episodic run of the empirical system. We prove that our architecture, with only a few reasonable assumptions about network structure, has sample guarantees bounded between those of the centralised- and independent-learning cases. We discuss how the sample guarantees of the three theoretical algorithms do not actually result in practical convergence. We therefore show that in practical settings where the theoretical parameters are not observed (leading to poor estimation of the Q-function), our communication scheme significantly accelerates convergence over the independent case (and often even the centralised case), without relying on the assumption of a centralised learner. We contribute further practical enhancements to all three theoretical algorithms, allowing us to present their first empirical demonstrations. Our experiments confirm that we can remove several of the theoretical assumptions of the algorithms, and display the empirical convergence benefits brought by our new networked communication. We additionally show that the networked approach has significant advantages, over both the centralised and independent alternatives, in terms of robustness to unexpected learning failures and to changes in population size.
Artificial Intelligence 103
☆ Web2Code: A Large-scale Webpage-to-Code Dataset and Evaluation Framework for Multimodal LLMs
Multimodal large language models (MLLMs) have shown impressive success across modalities such as image, video, and audio in a variety of understanding and generation tasks. However, current MLLMs are surprisingly poor at understanding webpage screenshots and generating their corresponding HTML code. To address this problem, we propose Web2Code, a benchmark consisting of a new large-scale webpage-to-code dataset for instruction tuning and an evaluation framework for the webpage understanding and HTML code translation abilities of MLLMs. For dataset construction, we leverage pretrained LLMs to enhance existing webpage-to-code datasets as well as generate a diverse pool of new webpages rendered into images. Specifically, the inputs are webpage images and instructions, while the responses are the webpage's HTML code. We further include diverse natural language QA pairs about the webpage content in the responses to enable a more comprehensive understanding of the web content. To evaluate model performance in these tasks, we develop an evaluation framework for testing MLLMs' abilities in webpage understanding and web-to-code generation. Extensive experiments show that our proposed dataset is beneficial not only to our proposed tasks but also in the general visual domain, while previous datasets result in worse performance. We hope our work will contribute to the development of general MLLMs suitable for web-based content generation and task automation. Our data and code will be available at https://github.com/MBZUAI-LLM/web2code.
comment: Website at https://mbzuai-llm.github.io/webpage2code/
☆ LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
Large Language Models (LLMs) equipped with extensive world knowledge and strong reasoning skills can tackle diverse tasks across domains, often by posing them as conversation-style instruction-response pairs. In this paper, we propose LLaRA: Large Language and Robotics Assistant, a framework which formulates robot action policy as conversations, and provides improved responses when trained with auxiliary data that complements policy learning. LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and generate optimal policy decisions in text. To train such action policy VLMs, we first introduce an automated pipeline to generate diverse high-quality robotics instruction data from existing behavior cloning data. A VLM finetuned with the resulting collection of datasets based on a conversation-style formulation tailored for robotics tasks, can generate meaningful robot action policy decisions. Our experiments across multiple simulated and real-world environments demonstrate the state-of-the-art performance of the proposed LLaRA framework. The code, datasets, and pretrained models are available at https://github.com/LostXine/LLaRA.
☆ ProgressGym: Alignment with a Millennium of Moral Progress
Frontier AI systems, including large language models (LLMs), hold increasing influence over the epistemology of human users. Such influence can reinforce prevailing societal values, potentially contributing to the lock-in of misguided moral beliefs and, consequently, the perpetuation of problematic moral practices on a broad scale. We introduce progress alignment as a technical solution to mitigate this imminent risk. Progress alignment algorithms learn to emulate the mechanics of human moral progress, thereby addressing the susceptibility of existing alignment methods to contemporary moral blindspots. To empower research in progress alignment, we introduce ProgressGym, an experimental framework allowing the learning of moral progress mechanics from history, in order to facilitate future progress in real-world moral decisions. Leveraging 9 centuries of historical text and 18 historical LLMs, ProgressGym enables codification of real-world progress alignment challenges into concrete benchmarks. Specifically, we introduce three core challenges: tracking evolving values (PG-Follow), preemptively anticipating moral progress (PG-Predict), and regulating the feedback loop between human and AI value shifts (PG-Coevolve). Alignment methods without a temporal dimension are inapplicable to these tasks. In response, we present lifelong and extrapolative algorithms as baseline methods of progress alignment, and build an open leaderboard soliciting novel algorithms and challenges. The framework and the leaderboard are available at https://github.com/PKU-Alignment/ProgressGym and https://huggingface.co/spaces/PKU-Alignment/ProgressGym-LeaderBoard respectively.
☆ AI for Extreme Event Modeling and Understanding: Methodologies and Challenges
In recent years, artificial intelligence (AI) has deeply impacted various fields, including Earth system sciences. Here, AI improved weather forecasting, model emulation, parameter estimation, and the prediction of extreme events. However, the latter comes with specific challenges, such as developing accurate predictors from noisy, heterogeneous and limited annotated data. This paper reviews how AI is being used to analyze extreme events (like floods, droughts, wildfires and heatwaves), highlighting the importance of creating accurate, transparent, and reliable AI models. We discuss the hurdles of dealing with limited data, integrating information in real-time, deploying models, and making them understandable, all crucial for gaining the trust of stakeholders and meeting regulatory needs. We provide an overview of how AI can help identify and explain extreme events more effectively, improving disaster response and communication. We emphasize the need for collaboration across different fields to create AI solutions that are practical, understandable, and trustworthy for analyzing and predicting extreme events. Such collaborative efforts aim to enhance disaster readiness and disaster risk reduction.
☆ Molecular Facts: Desiderata for Decontextualization in LLM Fact Verification
Automatic factuality verification of large language model (LLM) generations is becoming more and more widely used to combat hallucinations. A major point of tension in the literature is the granularity of this fact-checking: larger chunks of text are hard to fact-check, but more atomic facts like propositions may lack context to interpret correctly. In this work, we assess the role of context in these atomic facts. We argue that fully atomic facts are not the right representation, and define two criteria for molecular facts: decontextuality, or how well they can stand alone, and minimality, or how little extra information is added to achieve decontexuality. We quantify the impact of decontextualization on minimality, then present a baseline methodology for generating molecular facts automatically, aiming to add the right amount of information. We compare against various methods of decontextualization and find that molecular facts balance minimality with fact verification accuracy in ambiguous settings.
☆ Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.
comment: 22 pages
☆ Electrostatics-based particle sampling and approximate inference
A new particle-based sampling and approximate inference method, based on electrostatics and Newton mechanics principles, is introduced with theoretical ground, algorithm design and experimental validation. This method simulates an interacting particle system (IPS) where particles, i.e. the freely-moving negative charges and spatially-fixed positive charges with magnitudes proportional to the target distribution, interact with each other via attraction and repulsion induced by the resulting electric fields described by Poisson's equation. The IPS evolves towards a steady-state where the distribution of negative charges conforms to the target distribution. This physics-inspired method offers deterministic, gradient-free sampling and inference, achieving comparable performance as other particle-based and MCMC methods in benchmark tasks of inferring complex densities, Bayesian logistic regression and dynamical system identification. A discrete-time, discrete-space algorithmic design, readily extendable to continuous time and space, is provided for usage in more general inference problems occurring in probabilistic machine learning scenarios such as Bayesian inference, generative modelling, and beyond.
☆ BMW Agents -- A Framework For Task Automation Through Multi-agent Collaboration
Autonomous agents driven by Large Language Models (LLMs) offer enormous potential for automation. Early proof of this technology can be found in various demonstrations of agents solving complex tasks, interacting with external systems to augment their knowledge, and triggering actions. In particular, workflows involving multiple agents solving complex tasks in a collaborative fashion exemplify their capacity to operate in less strict and less well-defined environments. Thus, a multi-agent approach has great potential for serving as a backbone in many industrial applications, ranging from complex knowledge retrieval systems to next generation robotic process automation. Given the reasoning abilities within the current generation of LLMs, complex processes require a multi-step approach that includes a plan of well-defined and modular tasks. Depending on the level of complexity, these tasks can be executed either by a single agent or a group of agents. In this work, we focus on designing a flexible agent engineering framework with careful attention to planning and execution, capable of handling complex use case applications across various domains. The proposed framework provides reliability in industrial applications and presents techniques to ensure a scalable, flexible, and collaborative workflow for multiple autonomous agents working together towards solving tasks.
comment: 24 pages. 21 PDF images
☆ Pairwise Difference Learning for Classification
Pairwise difference learning (PDL) has recently been introduced as a new meta-learning technique for regression. Instead of learning a mapping from instances to outcomes in the standard way, the key idea is to learn a function that takes two instances as input and predicts the difference between the respective outcomes. Given a function of this kind, predictions for a query instance are derived from every training example and then averaged. This paper extends PDL toward the task of classification and proposes a meta-learning technique for inducing a PDL classifier by solving a suitably defined (binary) classification problem on a paired version of the original training data. We analyze the performance of the PDL classifier in a large-scale empirical study and find that it outperforms state-of-the-art methods in terms of prediction performance. Last but not least, we provide an easy-to-use and publicly available implementation of PDL in a Python package.
☆ ToolBeHonest: A Multi-level Hallucination Diagnostic Benchmark for Tool-Augmented Large Language Models
Tool-augmented large language models (LLMs) are rapidly being integrated into real-world applications. Due to the lack of benchmarks, the community still needs to fully understand the hallucination issues within these models. To address this challenge, we introduce a comprehensive diagnostic benchmark, ToolBH. Specifically, we assess the LLM's hallucinations through two perspectives: depth and breadth. In terms of depth, we propose a multi-level diagnostic process, including (1) solvability detection, (2) solution planning, and (3) missing-tool analysis. For breadth, we consider three scenarios based on the characteristics of the toolset: missing necessary tools, potential tools, and limited functionality tools. Furthermore, we developed seven tasks and collected 700 evaluation samples through multiple rounds of manual annotation. The results show the significant challenges presented by the ToolBH benchmark. The current advanced models Gemini-1.5-Pro and GPT-4o only achieve a total score of 45.3 and 37.0, respectively, on a scale of 100. In this benchmark, larger model parameters do not guarantee better performance; the training data and response strategies also play a crucial role in tool-enhanced LLM scenarios. Our diagnostic analysis indicates that the primary reason for model errors lies in assessing task solvability. Additionally, open-weight models suffer from performance drops with verbose replies, whereas proprietary models excel with longer reasoning.
☆ Wavelets Are All You Need for Autoregressive Image Generation
In this paper, we take a new approach to autoregressive image generation that is based on two main ingredients. The first is wavelet image coding, which allows to tokenize the visual details of an image from coarse to fine details by ordering the information starting with the most significant bits of the most significant wavelet coefficients. The second is a variant of a language transformer whose architecture is re-designed and optimized for token sequences in this 'wavelet language'. The transformer learns the significant statistical correlations within a token sequence, which are the manifestations of well-known correlations between the wavelet subbands at various resolutions. We show experimental results with conditioning on the generation process.
comment: 16 pages, 10 figures
☆ Single Parent Family: A Spectrum of Family Members from a Single Pre-Trained Foundation Model
This paper introduces a novel method of Progressive Low Rank Decomposition (PLRD) tailored for the compression of large language models. Our approach leverages a pre-trained model, which is then incrementally decompressed to smaller sizes using progressively lower ranks. This method allows for significant reductions in computational overhead and energy consumption, as subsequent models are derived from the original without the need for retraining from scratch. We detail the implementation of PLRD, which strategically decreases the tensor ranks, thus optimizing the trade-off between model performance and resource usage. The efficacy of PLRD is demonstrated through extensive experiments showing that models trained with PLRD method on only 1B tokens maintain comparable performance with traditionally trained models while using 0.1% of the tokens. The versatility of PLRD is highlighted by its ability to generate multiple model sizes from a single foundational model, adapting fluidly to varying computational and memory budgets. Our findings suggest that PLRD could set a new standard for the efficient scaling of LLMs, making advanced AI more feasible on diverse platforms.
☆ Into the Unknown: Generating Geospatial Descriptions for New Environments
Similar to vision-and-language navigation (VLN) tasks that focus on bridging the gap between vision and language for embodied navigation, the new Rendezvous (RVS) task requires reasoning over allocentric spatial relationships (independent of the observer's viewpoint) using non-sequential navigation instructions and maps. However, performance substantially drops in new environments with no training data. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text resulting in low geolocation resolution. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments using readily available geospatial data. Our method constructs a grounded knowledge-graph, capturing entity relationships. Sampled entities and relations (`shop north of school') generate navigation instructions via (i) generating numerous templates using context-free grammar (CFG) to embed specific entities and relations; (ii) feeding the entities and relation into a large language model (LLM) for instruction generation. A comprehensive evaluation on RVS, showed that our approach improves the 100-meter accuracy by 45.83% on unseen environments. Furthermore, we demonstrate that models trained with CFG-based augmentation achieve superior performance compared with those trained with LLM-based augmentation, both in unseen and seen environments. These findings suggest that the potential advantages of explicitly structuring spatial information for text-based geospatial reasoning in previously unknown, can unlock data-scarce scenarios.
☆ Text2Robot: Evolutionary Robot Design from Text Descriptions
Robot design has traditionally been costly and labor-intensive. Despite advancements in automated processes, it remains challenging to navigate a vast design space while producing physically manufacturable robots. We introduce Text2Robot, a framework that converts user text specifications and performance preferences into physical quadrupedal robots. Within minutes, Text2Robot can use text-to-3D models to provide strong initializations of diverse morphologies. Within a day, our geometric processing algorithms and body-control co-optimization produce a walking robot by explicitly considering real-world electronics and manufacturability. Text2Robot enables rapid prototyping and opens new opportunities for robot design with generative models.
comment: Our project website is at: https://generalroboticslab.com/Text2Robot
☆ Uncovering the hidden core-periphery structure in hyperbolic networks
The hyperbolic network models exhibit very fundamental and essential features, like small-worldness, scale-freeness, high-clustering coefficient, and community structure. In this paper, we comprehensively explore the presence of an important feature, the core-periphery structure, in the hyperbolic network models, which is often exhibited by real-world networks. We focused on well-known hyperbolic models such as popularity-similarity optimization model (PSO) and S1/H2 models and studied core-periphery structures using a well-established method that is based on standard random walk Markov chain model. The observed core-periphery centralization values indicate that the core-periphery structure can be very pronounced under certain conditions. We also validate our findings by statistically testing for the significance of the observed core-periphery structure in the network geometry. This study extends network science and reveals core-periphery insights applicable to various domains, enhancing network performance and resiliency in transportation and information systems.
☆ From the Least to the Most: Building a Plug-and-Play Visual Reasoner via Data Synthesis
We explore multi-step reasoning in vision-language models (VLMs). The problem is challenging, as reasoning data consisting of multiple steps of visual and language processing are barely available. To overcome the challenge, we first introduce a least-to-most visual reasoning paradigm, which interleaves steps of decomposing a question into sub-questions and invoking external tools for resolving sub-questions. Based on the paradigm, we further propose a novel data synthesis approach that can automatically create questions and multi-step reasoning paths for an image in a bottom-up manner. Our approach divides the complex synthesis task into a few simple sub-tasks, and (almost entirely) relies on open-sourced models to accomplish the sub-tasks. Therefore, the entire synthesis process is reproducible and cost-efficient, and the synthesized data is quality guaranteed. With the approach, we construct $50$k visual reasoning examples. Then, we develop a visual reasoner through supervised fine-tuning, which is capable of generally enhancing the reasoning abilities of a wide range of existing VLMs in a plug-and-play fashion. Extensive experiments indicate that the visual reasoner can consistently and significantly improve four VLMs on four VQA benchmarks. Our code and dataset are available at https://github.com/steven-ccq/VisualReasoner.
☆ Decoupling General and Personalized Knowledge in Federated Learning via Additive and Low-Rank Decomposition
To address data heterogeneity, the key strategy of Personalized Federated Learning (PFL) is to decouple general knowledge (shared among clients) and client-specific knowledge, as the latter can have a negative impact on collaboration if not removed. Existing PFL methods primarily adopt a parameter partitioning approach, where the parameters of a model are designated as one of two types: parameters shared with other clients to extract general knowledge and parameters retained locally to learn client-specific knowledge. However, as these two types of parameters are put together like a jigsaw puzzle into a single model during the training process, each parameter may simultaneously absorb both general and client-specific knowledge, thus struggling to separate the two types of knowledge effectively. In this paper, we introduce FedDecomp, a simple but effective PFL paradigm that employs parameter additive decomposition to address this issue. Instead of assigning each parameter of a model as either a shared or personalized one, FedDecomp decomposes each parameter into the sum of two parameters: a shared one and a personalized one, thus achieving a more thorough decoupling of shared and personalized knowledge compared to the parameter partitioning method. In addition, as we find that retaining local knowledge of specific clients requires much lower model capacity compared with general knowledge across all clients, we let the matrix containing personalized parameters be low rank during the training process. Moreover, a new alternating training strategy is proposed to further improve the performance. Experimental results across multiple datasets and varying degrees of data heterogeneity demonstrate that FedDecomp outperforms state-of-the-art methods up to 4.9\%.
comment: 12 pages, 8 figures
☆ AuthAttLyzer-V2: Unveiling Code Authorship Attribution using Enhanced Ensemble Learning Models & Generating Benchmark Dataset
Source Code Authorship Attribution (SCAA) is crucial for software classification because it provides insights into the origin and behavior of software. By accurately identifying the author or group behind a piece of code, experts can better understand the motivations and techniques of developers. In the cybersecurity era, this attribution helps trace the source of malicious software, identify patterns in the code that may indicate specific threat actors or groups, and ultimately enhance threat intelligence and mitigation strategies. This paper presents AuthAttLyzer-V2, a new source code feature extractor for SCAA, focusing on lexical, semantic, syntactic, and N-gram features. Our research explores author identification in C++ by examining 24,000 source code samples from 3,000 authors. Our methodology integrates Random Forest, Gradient Boosting, and XGBoost models, enhanced with SHAP for interpretability. The study demonstrates how ensemble models can effectively discern individual coding styles, offering insights into the unique attributes of code authorship. This approach is pivotal in understanding and interpreting complex patterns in authorship attribution, especially for malware classification.
☆ Fine-tuning of Geospatial Foundation Models for Aboveground Biomass Estimation
Global vegetation structure mapping is critical for understanding the global carbon cycle and maximizing the efficacy of nature-based carbon sequestration initiatives. Moreover, vegetation structure mapping can help reduce the impacts of climate change by, for example, guiding actions to improve water security, increase biodiversity and reduce flood risk. Global satellite measurements provide an important set of observations for monitoring and managing deforestation and degradation of existing forests, natural forest regeneration, reforestation, biodiversity restoration, and the implementation of sustainable agricultural practices. In this paper, we explore the effectiveness of fine-tuning of a geospatial foundation model to estimate above-ground biomass (AGB) using space-borne data collected across different eco-regions in Brazil. The fine-tuned model architecture consisted of a Swin-B transformer as the encoder (i.e., backbone) and a single convolutional layer for the decoder head. All results were compared to a U-Net which was trained as the baseline model Experimental results of this sparse-label prediction task demonstrate that the fine-tuned geospatial foundation model with a frozen encoder has comparable performance to a U-Net trained from scratch. This is despite the fine-tuned model having 13 times less parameters requiring optimization, which saves both time and compute resources. Further, we explore the transfer-learning capabilities of the geospatial foundation models by fine-tuning on satellite imagery with sparse labels from different eco-regions in Brazil.
☆ Detecting Subtle Differences between Human and Model Languages Using Spectrum of Relative Likelihood
Human and model-generated texts can be distinguished by examining the magnitude of likelihood in language. However, it is becoming increasingly difficult as language model's capabilities of generating human-like texts keep evolving. This study provides a new perspective by using the relative likelihood values instead of absolute ones, and extracting useful features from the spectrum-view of likelihood for the human-model text detection task. We propose a detection procedure with two classification methods, supervised and heuristic-based, respectively, which results in competitive performances with previous zero-shot detection methods and a new state-of-the-art on short-text detection. Our method can also reveal subtle differences between human and model languages, which find theoretical roots in psycholinguistics studies. Our code is available at https://github.com/CLCS-SUSTech/FourierGPT
comment: 13 pages, 12 figures
☆ MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis
MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results.
comment: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt
☆ YuLan: An Open-source Large Language Model
Large language models (LLMs) have become the foundation of many applications, leveraging their extensive capabilities in processing and understanding natural language. While many open-source LLMs have been released with technical reports, the lack of training details hinders further research and development. This paper presents the development of YuLan, a series of open-source LLMs with $12$ billion parameters. The base model of YuLan is pre-trained on approximately $1.7$T tokens derived from a diverse corpus, including massive English, Chinese, and multilingual texts. We design a three-stage pre-training method to enhance YuLan's overall capabilities. Subsequent phases of training incorporate instruction-tuning and human alignment, employing a substantial volume of high-quality synthesized data. To facilitate the learning of complex and long-tail knowledge, we devise a curriculum-learning framework throughout across these stages, which helps LLMs learn knowledge in an easy-to-hard manner. YuLan's training is finished on Jan, 2024 and has achieved performance on par with state-of-the-art LLMs across various English and Chinese benchmarks. This paper outlines a comprehensive technical roadmap for developing LLMs from scratch. Our model and codes are available at https://github.com/RUC-GSAI/YuLan-Chat.
☆ AnomaLLMy -- Detecting anomalous tokens in black-box LLMs through low-confidence single-token predictions
This paper introduces AnomaLLMy, a novel technique for the automatic detection of anomalous tokens in black-box Large Language Models (LLMs) with API-only access. Utilizing low-confidence single-token predictions as a cost-effective indicator, AnomaLLMy identifies irregularities in model behavior, addressing the issue of anomalous tokens degrading the quality and reliability of models. Validated on the cl100k_base dataset, the token set of GPT-4, AnomaLLMy detected 413 major and 65 minor anomalies, demonstrating the method's efficiency with just \$24.39 spent in API credits. The insights from this research are expected to be beneficial for enhancing the robustness of and accuracy of LLMs, particularly in the development and assessment of tokenizers.
comment: 6 pages
☆ BeamAggR: Beam Aggregation Reasoning over Multi-source Knowledge for Multi-hop Question Answering ACL 2024
Large language models (LLMs) have demonstrated strong reasoning capabilities. Nevertheless, they still suffer from factual errors when tackling knowledge-intensive tasks. Retrieval-augmented reasoning represents a promising approach. However, significant challenges still persist, including inaccurate and insufficient retrieval for complex questions, as well as difficulty in integrating multi-source knowledge. To address this, we propose Beam Aggregation Reasoning, BeamAggR, a reasoning framework for knowledge-intensive multi-hop QA. BeamAggR explores and prioritizes promising answers at each hop of question. Concretely, we parse the complex questions into trees, which include atom and composite questions, followed by bottom-up reasoning. For atomic questions, the LLM conducts reasoning on multi-source knowledge to get answer candidates. For composite questions, the LLM combines beam candidates, explores multiple reasoning paths through probabilistic aggregation, and prioritizes the most promising trajectory. Extensive experiments on four open-domain multi-hop reasoning datasets show that our method significantly outperforms SOTA methods by 8.5%. Furthermore, our analysis reveals that BeamAggR elicits better knowledge collaboration and answer aggregation.
comment: Accepted to ACL 2024
☆ Emotion Loss Attacking: Adversarial Attack Perception for Skeleton based on Multi-dimensional Features
Adversarial attack on skeletal motion is a hot topic. However, existing researches only consider part of dynamic features when measuring distance between skeleton graph sequences, which results in poor imperceptibility. To this end, we propose a novel adversarial attack method to attack action recognizers for skeletal motions. Firstly, our method systematically proposes a dynamic distance function to measure the difference between skeletal motions. Meanwhile, we innovatively introduce emotional features for complementary information. In addition, we use Alternating Direction Method of Multipliers(ADMM) to solve the constrained optimization problem, which generates adversarial samples with better imperceptibility to deceive the classifiers. Experiments show that our method is effective on multiple action classifiers and datasets. When the perturbation magnitude measured by l norms is the same, the dynamic perturbations generated by our method are much lower than that of other methods. What's more, we are the first to prove the effectiveness of emotional features, and provide a new idea for measuring the distance between skeletal motions.
☆ Fuzzy Logic Guided Reward Function Variation: An Oracle for Testing Reinforcement Learning Programs
Reinforcement Learning (RL) has gained significant attention across various domains. However, the increasing complexity of RL programs presents testing challenges, particularly the oracle problem: defining the correctness of the RL program. Conventional human oracles struggle to cope with the complexity, leading to inefficiencies and potential unreliability in RL testing. To alleviate this problem, we propose an automated oracle approach that leverages RL properties using fuzzy logic. Our oracle quantifies an agent's behavioral compliance with reward policies and analyzes its trend over training episodes. It labels an RL program as "Buggy" if the compliance trend violates expectations derived from RL characteristics. We evaluate our oracle on RL programs with varying complexities and compare it with human oracles. Results show that while human oracles perform well in simpler testing scenarios, our fuzzy oracle demonstrates superior performance in complex environments. The proposed approach shows promise in addressing the oracle problem for RL testing, particularly in complex cases where manual testing falls short. It offers a potential solution to improve the efficiency, reliability, and scalability of RL program testing. This research takes a step towards automated testing of RL programs and highlights the potential of fuzzy logic-based oracles in tackling the oracle problem.
comment: 10 pages, 5 figures
☆ Deceptive Diffusion: Generating Synthetic Adversarial Examples
We introduce the concept of deceptive diffusion -- training a generative AI model to produce adversarial images. Whereas a traditional adversarial attack algorithm aims to perturb an existing image to induce a misclassificaton, the deceptive diffusion model can create an arbitrary number of new, misclassified images that are not directly associated with training or test images. Deceptive diffusion offers the possibility of strengthening defence algorithms by providing adversarial training data at scale, including types of misclassification that are otherwise difficult to find. In our experiments, we also investigate the effect of training on a partially attacked data set. This highlights a new type of vulnerability for generative diffusion models: if an attacker is able to stealthily poison a portion of the training data, then the resulting diffusion model will generate a similar proportion of misleading outputs.
Self-Supervised Spatial-Temporal Normality Learning for Time Series Anomaly Detection ECML
Time Series Anomaly Detection (TSAD) finds widespread applications across various domains such as financial markets, industrial production, and healthcare. Its primary objective is to learn the normal patterns of time series data, thereby identifying deviations in test samples. Most existing TSAD methods focus on modeling data from the temporal dimension, while ignoring the semantic information in the spatial dimension. To address this issue, we introduce a novel approach, called Spatial-Temporal Normality learning (STEN). STEN is composed of a sequence Order prediction-based Temporal Normality learning (OTN) module that captures the temporal correlations within sequences, and a Distance prediction-based Spatial Normality learning (DSN) module that learns the relative spatial relations between sequences in a feature space. By synthesizing these two modules, STEN learns expressive spatial-temporal representations for the normal patterns hidden in the time series data. Extensive experiments on five popular TSAD benchmarks show that STEN substantially outperforms state-of-the-art competing methods. Our code is available at https://github.com/mala-lab/STEN.
comment: 18 pages, 4 figures, accepted in ECML PKDD2024
☆ xSemAD: Explainable Semantic Anomaly Detection in Event Logs Using Sequence-to-Sequence Models
The identification of undesirable behavior in event logs is an important aspect of process mining that is often addressed by anomaly detection methods. Traditional anomaly detection methods tend to focus on statistically rare behavior and neglect the subtle difference between rarity and undesirability. The introduction of semantic anomaly detection has opened a promising avenue by identifying semantically deviant behavior. This work addresses a gap in semantic anomaly detection, which typically indicates the occurrence of an anomaly without explaining the nature of the anomaly. We propose xSemAD, an approach that uses a sequence-to-sequence model to go beyond pure identification and provides extended explanations. In essence, our approach learns constraints from a given process model repository and then checks whether these constraints hold in the considered event log. This approach not only helps understand the specifics of the undesired behavior, but also facilitates targeted corrective actions. Our experiments demonstrate that our approach outperforms existing state-of-the-art semantic anomaly detection methods.
comment: Accepted at BPM 2024
☆ Structure-aware World Model for Probe Guidance via Large-scale Self-supervised Pre-train
The complex structure of the heart leads to significant challenges in echocardiography, especially in acquisition cardiac ultrasound images. Successful echocardiography requires a thorough understanding of the structures on the two-dimensional plane and the spatial relationships between planes in three-dimensional space. In this paper, we innovatively propose a large-scale self-supervised pre-training method to acquire a cardiac structure-aware world model. The core innovation lies in constructing a self-supervised task that requires structural inference by predicting masked structures on a 2D plane and imagining another plane based on pose transformation in 3D space. To support large-scale pre-training, we collected over 1.36 million echocardiograms from ten standard views, along with their 3D spatial poses. In the downstream probe guidance task, we demonstrate that our pre-trained model consistently reduces guidance errors across the ten most common standard views on the test set with 0.29 million samples from 74 routine clinical scans, indicating that structure-aware pre-training benefits the scanning.
comment: Technical report
☆ Protein Representation Learning with Sequence Information Embedding: Does it Always Lead to a Better Performance?
Deep learning has become a crucial tool in studying proteins. While the significance of modeling protein structure has been discussed extensively in the literature, amino acid types are typically included in the input as a default operation for many inference tasks. This study demonstrates with structure alignment task that embedding amino acid types in some cases may not help a deep learning model learn better representation. To this end, we propose ProtLOCA, a local geometry alignment method based solely on amino acid structure representation. The effectiveness of ProtLOCA is examined by a global structure-matching task on protein pairs with an independent test dataset based on CATH labels. Our method outperforms existing sequence- and structure-based representation learning methods by more quickly and accurately matching structurally consistent protein domains. Furthermore, in local structure pairing tasks, ProtLOCA for the first time provides a valid solution to highlight common local structures among proteins with different overall structures but the same function. This suggests a new possibility for using deep learning methods to analyze protein structure to infer function.
comment: 8 pages, 4 figures
☆ ROS-LLM: A ROS framework for embodied AI with task feedback and structured reasoning
We present a framework for intuitive robot programming by non-experts, leveraging natural language prompts and contextual information from the Robot Operating System (ROS). Our system integrates large language models (LLMs), enabling non-experts to articulate task requirements to the system through a chat interface. Key features of the framework include: integration of ROS with an AI agent connected to a plethora of open-source and commercial LLMs, automatic extraction of a behavior from the LLM output and execution of ROS actions/services, support for three behavior modes (sequence, behavior tree, state machine), imitation learning for adding new robot actions to the library of possible actions, and LLM reflection via human and environment feedback. Extensive experiments validate the framework, showcasing robustness, scalability, and versatility in diverse scenarios, including long-horizon tasks, tabletop rearrangements, and remote supervisory control. To facilitate the adoption of our framework and support the reproduction of our results, we have made our code open-source. You can access it at: https://github.com/huawei-noah/HEBO/tree/master/ROSLLM.
comment: This document contains 26 pages and 13 figures
☆ Classical Bandit Algorithms for Entanglement Detection in Parameterized Qubit States
Entanglement is a key resource for a wide range of tasks in quantum information and computing. Thus, verifying availability of this quantum resource is essential. Extensive research on entanglement detection has led to no-go theorems (Lu et al. [Phys. Rev. Lett., 116, 230501 (2016)]) that highlight the need for full state tomography (FST) in the absence of adaptive or joint measurements. Recent advancements, as proposed by Zhu, Teo, and Englert [Phys. Rev. A, 81, 052339, 2010], introduce a single-parameter family of entanglement witness measurements which are capable of conclusively detecting certain entangled states and only resort to FST when all witness measurements are inconclusive. We find a variety of realistic noisy two-qubit quantum states $\mathcal{F}$ that yield conclusive results under this witness family. We solve the problem of detecting entanglement among $K$ quantum states in $\mathcal{F}$, of which $m$ states are entangled, with $m$ potentially unknown. We recognize a structural connection of this problem to the Bad Arm Identification problem in stochastic Multi-Armed Bandits (MAB). In contrast to existing quantum bandit frameworks, we establish a new correspondence tailored for entanglement detection and term it the $(m,K)$-quantum Multi-Armed Bandit. We implement two well-known MAB policies for arbitrary states derived from $\mathcal{F}$, present theoretical guarantees on the measurement/sample complexity and demonstrate the practicality of the policies through numerical simulations. More broadly, this paper highlights the potential for employing classical machine learning techniques for quantum entanglement detection.
comment: 20 pages, 5 figures
☆ MM-Instruct: Generated Visual Instructions for Large Multimodal Model Alignment
This paper introduces MM-Instruct, a large-scale dataset of diverse and high-quality visual instruction data designed to enhance the instruction-following capabilities of large multimodal models (LMMs). While existing visual instruction datasets often focus on question-answering, they struggle to generalize to broader application scenarios such as creative writing, summarization, or image analysis. To address these limitations, we propose a novel approach to constructing MM-Instruct that leverages the strong instruction-following capabilities of existing LLMs to generate novel visual instruction data from large-scale but conventional image captioning datasets. MM-Instruct first leverages ChatGPT to automatically generate diverse instructions from a small set of seed instructions through augmenting and summarization. It then matches these instructions with images and uses an open-sourced large language model (LLM) to generate coherent answers to the instruction-image pairs. The LLM is grounded by the detailed text descriptions of images in the whole answer generation process to guarantee the alignment of the instruction data. Moreover, we introduce a benchmark based on the generated instruction data to evaluate the instruction-following capabilities of existing LMMs. We demonstrate the effectiveness of MM-Instruct by training a LLaVA-1.5 model on the generated data, denoted as LLaVA-Instruct, which exhibits significant improvements in instruction-following capabilities compared to LLaVA-1.5 models. The MM-Instruct dataset, benchmark, and pre-trained models are available at https://github.com/jihaonew/MM-Instruct.
comment: Dataset and models are available at https://github.com/jihaonew/MM-Instruct
☆ CUPID: Improving Battle Fairness and Position Satisfaction in Online MOBA Games with a Re-matchmaking System SC
The multiplayer online battle arena (MOBA) genre has gained significant popularity and economic success, attracting considerable research interest within the Human-Computer Interaction community. Enhancing the gaming experience requires a deep understanding of player behavior, and a crucial aspect of MOBA games is matchmaking, which aims to assemble teams of comparable skill levels. However, existing matchmaking systems often neglect important factors such as players' position preferences and team assignment, resulting in imbalanced matches and reduced player satisfaction. To address these limitations, this paper proposes a novel framework called CUPID, which introduces a novel process called ``re-matchmaking'' to optimize team and position assignments to improve both fairness and player satisfaction. CUPID incorporates a pre-filtering step to ensure a minimum level of matchmaking quality, followed by a pre-match win-rate prediction model that evaluates the fairness of potential assignments. By simultaneously considering players' position satisfaction and game fairness, CUPID aims to provide an enhanced matchmaking experience. Extensive experiments were conducted on two large-scale, real-world MOBA datasets to validate the effectiveness of CUPID. The results surpass all existing state-of-the-art baselines, with an average relative improvement of 7.18% in terms of win prediction accuracy. Furthermore, CUPID has been successfully deployed in a popular online mobile MOBA game. The deployment resulted in significant improvements in match fairness and player satisfaction, as evidenced by critical Human-Computer Interaction (HCI) metrics covering usability, accessibility, and engagement, observed through A/B testing. To the best of our knowledge, CUPID is the first re-matchmaking system designed specifically for large-scale MOBA games.
comment: 38 pages, accepted by CSCW 24
☆ Uncertainty Quantification in Large Language Models Through Convex Hull Analysis
Uncertainty quantification approaches have been more critical in large language models (LLMs), particularly high-risk applications requiring reliable outputs. However, traditional methods for uncertainty quantification, such as probabilistic models and ensemble techniques, face challenges when applied to the complex and high-dimensional nature of LLM-generated outputs. This study proposes a novel geometric approach to uncertainty quantification using convex hull analysis. The proposed method leverages the spatial properties of response embeddings to measure the dispersion and variability of model outputs. The prompts are categorized into three types, i.e., `easy', `moderate', and `confusing', to generate multiple responses using different LLMs at varying temperature settings. The responses are transformed into high-dimensional embeddings via a BERT model and subsequently projected into a two-dimensional space using Principal Component Analysis (PCA). The Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm is utilized to cluster the embeddings and compute the convex hull for each selected cluster. The experimental results indicate that the uncertainty of the model for LLMs depends on the prompt complexity, the model, and the temperature setting.
comment: 17 pages
☆ A Differentiable Approach to Multi-scale Brain Modeling ICML 2024
We present a multi-scale differentiable brain modeling workflow utilizing BrainPy, a unique differentiable brain simulator that combines accurate brain simulation with powerful gradient-based optimization. We leverage this capability of BrainPy across different brain scales. At the single-neuron level, we implement differentiable neuron models and employ gradient methods to optimize their fit to electrophysiological data. On the network level, we incorporate connectomic data to construct biologically constrained network models. Finally, to replicate animal behavior, we train these models on cognitive tasks using gradient-based learning rules. Experiments demonstrate that our approach achieves superior performance and speed in fitting generalized leaky integrate-and-fire and Hodgkin-Huxley single neuron models. Additionally, training a biologically-informed network of excitatory and inhibitory spiking neurons on working memory tasks successfully replicates observed neural activity and synaptic weight distributions. Overall, our differentiable multi-scale simulation approach offers a promising tool to bridge neuroscience data across electrophysiological, anatomical, and behavioral scales.
comment: 2nd Differentiable Almost Everything Workshop at ICML 2024
☆ DISCO: Efficient Diffusion Solver for Large-Scale Combinatorial Optimization Problems
Combinatorial Optimization (CO) problems are fundamentally crucial in numerous practical applications across diverse industries, characterized by entailing enormous solution space and demanding time-sensitive response. Despite significant advancements made by recent neural solvers, their limited expressiveness does not conform well to the multi-modal nature of CO landscapes. While some research has pivoted towards diffusion models, they require simulating a Markov chain with many steps to produce a sample, which is time-consuming and does not meet the efficiency requirement of real applications, especially at scale. We propose DISCO, an efficient DIffusion Solver for Combinatorial Optimization problems that excels in both solution quality and inference speed. DISCO's efficacy is two-pronged: Firstly, it achieves rapid denoising of solutions through an analytically solvable form, allowing for direct sampling from the solution space with very few reverse-time steps, thereby drastically reducing inference time. Secondly, DISCO enhances solution quality by restricting the sampling space to a more constrained, meaningful domain guided by solution residues, while still preserving the inherent multi-modality of the output probabilistic distributions. DISCO achieves state-of-the-art results on very large Traveling Salesman Problems with 10000 nodes and challenging Maximal Independent Set benchmarks, with its per-instance denoising time up to 44.8 times faster. Through further combining a divide-and-conquer strategy, DISCO can be generalized to solve arbitrary-scale problem instances off the shelf, even outperforming models trained specifically on corresponding scales.
☆ Deep Fusion Model for Brain Tumor Classification Using Fine-Grained Gradient Preservation
Brain tumors are one of the most common diseases that lead to early death if not diagnosed at an early stage. Traditional diagnostic approaches are extremely time-consuming and prone to errors. In this context, computer vision-based approaches have emerged as an effective tool for accurate brain tumor classification. While some of the existing solutions demonstrate noteworthy accuracy, the models become infeasible to deploy in areas where computational resources are limited. This research addresses the need for accurate and fast classification of brain tumors with a priority of deploying the model in technologically underdeveloped regions. The research presents a novel architecture for precise brain tumor classification fusing pretrained ResNet152V2 and modified VGG16 models. The proposed architecture undergoes a diligent fine-tuning process that ensures fine gradients are preserved in deep neural networks, which are essential for effective brain tumor classification. The proposed solution incorporates various image processing techniques to improve image quality and achieves an astounding accuracy of 98.36% and 98.04% in Figshare and Kaggle datasets respectively. This architecture stands out for having a streamlined profile, with only 2.8 million trainable parameters. We have leveraged 8-bit quantization to produce a model of size 73.881 MB, significantly reducing it from the previous size of 289.45 MB, ensuring smooth deployment in edge devices even in resource-constrained areas. Additionally, the use of Grad-CAM improves the interpretability of the model, offering insightful information regarding its decision-making process. Owing to its high discriminative ability, this model can be a reliable option for accurate brain tumor classification.
☆ Enhancing Radiological Diagnosis: A Collaborative Approach Integrating AI and Human Expertise for Visual Miss Correction
Human-AI collaboration to identify and correct perceptual errors in chest radiographs has not been previously explored. This study aimed to develop a collaborative AI system, CoRaX, which integrates eye gaze data and radiology reports to enhance diagnostic accuracy in chest radiology by pinpointing perceptual errors and refining the decision-making process. Using public datasets REFLACX and EGD-CXR, the study retrospectively developed CoRaX, employing a large multimodal model to analyze image embeddings, eye gaze data, and radiology reports. The system's effectiveness was evaluated based on its referral-making process, the quality of referrals, and performance in collaborative diagnostic settings. CoRaX was tested on a simulated error dataset of 271 samples with 28% (93 of 332) missed abnormalities. The system corrected 21% (71 of 332) of these errors, leaving 7% (22 of 312) unresolved. The Referral-Usefulness score, indicating the accuracy of predicted regions for all true referrals, was 0.63 (95% CI 0.59, 0.68). The Total-Usefulness score, reflecting the diagnostic accuracy of CoRaX's interactions with radiologists, showed that 84% (237 of 280) of these interactions had a score above 0.40. In conclusion, CoRaX efficiently collaborates with radiologists to address perceptual errors across various abnormalities, with potential applications in the education and training of novice radiologists.
comment: Under Review in Journal
☆ MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .
☆ Function+Data Flow: A Framework to Specify Machine Learning Pipelines for Digital Twinning
The development of digital twins (DTs) for physical systems increasingly leverages artificial intelligence (AI), particularly for combining data from different sources or for creating computationally efficient, reduced-dimension models. Indeed, even in very different application domains, twinning employs common techniques such as model order reduction and modelization with hybrid data (that is, data sourced from both physics-based models and sensors). Despite this apparent generality, current development practices are ad-hoc, making the design of AI pipelines for digital twinning complex and time-consuming. Here we propose Function+Data Flow (FDF), a domain-specific language (DSL) to describe AI pipelines within DTs. FDF aims to facilitate the design and validation of digital twins. Specifically, FDF treats functions as first-class citizens, enabling effective manipulation of models learned with AI. We illustrate the benefits of FDF on two concrete use cases from different domains: predicting the plastic strain of a structure and modeling the electromagnetic behavior of a bearing.
comment: 10 pages, 5 figures, to be published in AIware'24
☆ ACES: Automatic Cohort Extraction System for Event-Stream Datasets
Reproducibility remains a significant challenge in machine learning (ML) for healthcare. In this field, datasets, model pipelines, and even task/cohort definitions are often private, leading to a significant barrier in sharing, iterating, and understanding ML results on electronic health record (EHR) datasets. In this paper, we address a significant part of this problem by introducing the Automatic Cohort Extraction System for Event-Stream Datasets (ACES). This tool is designed to simultaneously simplify the development of task/cohorts for ML in healthcare and enable the reproduction of these cohorts, both at an exact level for single datasets and at a conceptual level across datasets. To accomplish this, ACES provides (1) a highly intuitive and expressive configuration language for defining both dataset-specific concepts and dataset-agnostic inclusion/exclusion criteria, and (2) a pipeline to automatically extract patient records that meet these defined criteria from real-world data. ACES can be automatically applied to any dataset in either the Medical Event Data Standard (MEDS) or EventStreamGPT (ESGPT) formats, or to *any* dataset for which the necessary task-specific predicates can be extracted in an event-stream form. ACES has the potential to significantly lower the barrier to entry for defining ML tasks, redefine the way researchers interact with EHR datasets, and significantly improve the state of reproducibility for ML studies in this modality. ACES is available at https://github.com/justin13601/aces.
comment: For ACES Online Documentation, see https://eventstreamaces.readthedocs.io/en/latest/
☆ CANDY: A Benchmark for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion
Approximate K Nearest Neighbor (AKNN) algorithms play a pivotal role in various AI applications, including information retrieval, computer vision, and natural language processing. Although numerous AKNN algorithms and benchmarks have been developed recently to evaluate their effectiveness, the dynamic nature of real-world data presents significant challenges that existing benchmarks fail to address. Traditional benchmarks primarily assess retrieval effectiveness in static contexts and often overlook update efficiency, which is crucial for handling continuous data ingestion. This limitation results in an incomplete assessment of an AKNN algorithms ability to adapt to changing data patterns, thereby restricting insights into their performance in dynamic environments. To address these gaps, we introduce CANDY, a benchmark tailored for Continuous Approximate Nearest Neighbor Search with Dynamic Data Ingestion. CANDY comprehensively assesses a wide range of AKNN algorithms, integrating advanced optimizations such as machine learning-driven inference to supplant traditional heuristic scans, and improved distance computation methods to reduce computational overhead. Our extensive evaluations across diverse datasets demonstrate that simpler AKNN baselines often surpass more complex alternatives in terms of recall and latency. These findings challenge established beliefs about the necessity of algorithmic complexity for high performance. Furthermore, our results underscore existing challenges and illuminate future research opportunities. We have made the datasets and implementation methods available at: https://github.com/intellistream/candy.
☆ Designing and Evaluating Multi-Chatbot Interface for Human-AI Communication: Preliminary Findings from a Persuasion Task
The dynamics of human-AI communication have been reshaped by language models such as ChatGPT. However, extant research has primarily focused on dyadic communication, leaving much to be explored regarding the dynamics of human-AI communication in group settings. The availability of multiple language model chatbots presents a unique opportunity for scholars to better understand the interaction between humans and multiple chatbots. This study examines the impact of multi-chatbot communication in a specific persuasion setting: promoting charitable donations. We developed an online environment that enables multi-chatbot communication and conducted a pilot experiment utilizing two GPT-based chatbots, Save the Children and UNICEF chatbots, to promote charitable donations. In this study, we present our development process of the multi-chatbot interface and present preliminary findings from a pilot experiment. Analysis of qualitative and quantitative feedback are presented, and limitations are addressed.
☆ Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks due to the difficulty in designing comprehensive and precise reward functions. This inherent difficulty curtails the broader application of RL within game environments characterized by diverse constraints. Preference-based reinforcement learning (PbRL) presents a pioneering framework that capitalizes on human preferences as pivotal reward signals, thereby circumventing the need for meticulous reward engineering. However, obtaining preference data from human experts is costly and inefficient, especially under conditions marked by complex constraints. To tackle this challenge, we propose a LLM-enabled automatic preference generation framework named LLM4PG , which harnesses the capabilities of large language models (LLMs) to abstract trajectories, rank preferences, and reconstruct reward functions to optimize conditioned policies. Experiments on tasks with complex language constraints demonstrated the effectiveness of our LLM-enabled reward functions, accelerating RL convergence and overcoming stagnation caused by slow or absent progress under original reward structures. This approach mitigates the reliance on specialized human knowledge and demonstrates the potential of LLMs to enhance RL's effectiveness in complex environments in the wild.
☆ Unlocking Varied Perspectives: A Persona-Based Multi-Agent Framework with Debate-Driven Text Planning for Argument Generation
Writing persuasive arguments is a challenging task for both humans and machines. It entails incorporating high-level beliefs from various perspectives on the topic, along with deliberate reasoning and planning to construct a coherent narrative. Current language models often generate surface tokens autoregressively, lacking explicit integration of these underlying controls, resulting in limited output diversity and coherence. In this work, we propose a persona-based multi-agent framework for argument writing. Inspired by the human debate, we first assign each agent a persona representing its high-level beliefs from a unique perspective, and then design an agent interaction process so that the agents can collaboratively debate and discuss the idea to form an overall plan for argument writing. Such debate process enables fluid and nonlinear development of ideas. We evaluate our framework on argumentative essay writing. The results show that our framework can generate more diverse and persuasive arguments through both automatic and human evaluations.
☆ Precision matters: Precision-aware ensemble for weakly supervised semantic segmentation AAAI 2024
Weakly Supervised Semantic Segmentation (WSSS) employs weak supervision, such as image-level labels, to train the segmentation model. Despite the impressive achievement in recent WSSS methods, we identify that introducing weak labels with high mean Intersection of Union (mIoU) does not guarantee high segmentation performance. Existing studies have emphasized the importance of prioritizing precision and reducing noise to improve overall performance. In the same vein, we propose ORANDNet, an advanced ensemble approach tailored for WSSS. ORANDNet combines Class Activation Maps (CAMs) from two different classifiers to increase the precision of pseudo-masks (PMs). To further mitigate small noise in the PMs, we incorporate curriculum learning. This involves training the segmentation model initially with pairs of smaller-sized images and corresponding PMs, gradually transitioning to the original-sized pairs. By combining the original CAMs of ResNet-50 and ViT, we significantly improve the segmentation performance over the single-best model and the naive ensemble model, respectively. We further extend our ensemble method to CAMs from AMN (ResNet-like) and MCTformer (ViT-like) models, achieving performance benefits in advanced WSSS models. It highlights the potential of our ORANDNet as a final add-on module for WSSS models.
comment: 5 pages, 5 figures, accepted in AAAI 2024 Edge Intelligence Workshop
☆ Optimal Video Compression using Pixel Shift Tracking
The Video comprises approximately ~85\% of all internet traffic, but video encoding/compression is being historically done with hard coded rules, which has worked well but only to a certain limit. We have seen a surge in video compression algorithms using ML-based models in the last few years and many of them have outperformed several legacy codecs. The models range from encoding video end to end using an ML approach or replacing some intermediate steps in legacy codecs using ML models to increase the efficiency of those steps. Optimizing video storage is an essential aspect of video processing, so we are proposing one of the possible approaches to achieve it is by avoiding redundant data at each frame. In this paper, we want to introduce the approach of redundancies removal in subsequent frames for a given video as a main approach for video compression. We call this method Redundancy Removal using Shift (R\textsuperscript2S). This method can be utilized across various Machine Learning model algorithms, and make the compression more accessible and adaptable. In this study, we have utilized a computer vision-based pixel point tracking method to identify redundant pixels to encode video for optimal storage.
☆ Safety through feedback in Constrained RL
In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e.g. aggressive lane changes) is inherently complex. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends to beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback on every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce \textit{novelty-based sampling} that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios.
☆ Data-Driven Lipschitz Continuity: A Cost-Effective Approach to Improve Adversarial Robustness
The security and robustness of deep neural networks (DNNs) have become increasingly concerning. This paper aims to provide both a theoretical foundation and a practical solution to ensure the reliability of DNNs. We explore the concept of Lipschitz continuity to certify the robustness of DNNs against adversarial attacks, which aim to mislead the network with adding imperceptible perturbations into inputs. We propose a novel algorithm that remaps the input domain into a constrained range, reducing the Lipschitz constant and potentially enhancing robustness. Unlike existing adversarially trained models, where robustness is enhanced by introducing additional examples from other datasets or generative models, our method is almost cost-free as it can be integrated with existing models without requiring re-training. Experimental results demonstrate the generalizability of our method, as it can be combined with various models and achieve enhancements in robustness. Furthermore, our method achieves the best robust accuracy for CIFAR10, CIFAR100, and ImageNet datasets on the RobustBench leaderboard.
☆ A Survey on Data Quality Dimensions and Tools for Machine Learning
Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
comment: This paper has been accepted by The 6th IEEE International Conference on Artificial Intelligence Testing (IEEE AITest 2024) as an invited paper
☆ Multimodal Data Integration for Precision Oncology: Challenges and Future Directions
The essence of precision oncology lies in its commitment to tailor targeted treatments and care measures to each patient based on the individual characteristics of the tumor. The inherent heterogeneity of tumors necessitates gathering information from diverse data sources to provide valuable insights from various perspectives, fostering a holistic comprehension of the tumor. Over the past decade, multimodal data integration technology for precision oncology has made significant strides, showcasing remarkable progress in understanding the intricate details within heterogeneous data modalities. These strides have exhibited tremendous potential for improving clinical decision-making and model interpretation, contributing to the advancement of cancer care and treatment. Given the rapid progress that has been achieved, we provide a comprehensive overview of about 300 papers detailing cutting-edge multimodal data integration techniques in precision oncology. In addition, we conclude the primary clinical applications that have reaped significant benefits, including early assessment, diagnosis, prognosis, and biomarker discovery. Finally, derived from the findings of this survey, we present an in-depth analysis that explores the pivotal challenges and reveals essential pathways for future research in the field of multimodal data integration for precision oncology.
comment: 15 pages, 4 figures
☆ Optimizing Cyber Defense in Dynamic Active Directories through Reinforcement Learning ESORICS
This paper addresses a significant gap in Autonomous Cyber Operations (ACO) literature: the absence of effective edge-blocking ACO strategies in dynamic, real-world networks. It specifically targets the cybersecurity vulnerabilities of organizational Active Directory (AD) systems. Unlike the existing literature on edge-blocking defenses which considers AD systems as static entities, our study counters this by recognizing their dynamic nature and developing advanced edge-blocking defenses through a Stackelberg game model between attacker and defender. We devise a Reinforcement Learning (RL)-based attack strategy and an RL-assisted Evolutionary Diversity Optimization-based defense strategy, where the attacker and defender improve each other strategy via parallel gameplay. To address the computational challenges of training attacker-defender strategies on numerous dynamic AD graphs, we propose an RL Training Facilitator that prunes environments and neural networks to eliminate irrelevant elements, enabling efficient and scalable training for large graphs. We extensively train the attacker strategy, as a sophisticated attacker model is essential for a robust defense. Our empirical results successfully demonstrate that our proposed approach enhances defender's proficiency in hardening dynamic AD graphs while ensuring scalability for large-scale AD.
comment: The manuscript has been accepted as full paper at European Symposium on Research in Computer Security (ESORICS) 2024
♻ ☆ AutoMix: Automatically Mixing Language Models
Large language models (LLMs) are now available from cloud API providers in various sizes and configurations. While this diversity offers a broad spectrum of choices, effectively leveraging the options to optimize computational cost and performance remains challenging. In this work, we present Automix, an approach that strategically routes queries to larger LMs, based on the approximate correctness of outputs from a smaller LM. Central to Automix are two key technical contributions. First, it has a few-shot self-verification mechanism, which estimates the reliability of its own outputs without requiring extensive training. Second, given that self-verification can be noisy, it employs a POMDP based router that can effectively select an appropriately sized model, based on answer confidence. Experiments across five language models and five challenging datasets show that Automix consistently surpasses strong baselines, reducing computational cost by over 50% for comparable performance.
comment: The first two authors contributed equally. Work started and partly done during Aman's internship at Google. This version adds results on additional models and datasets
♻ ☆ The Impact of Feature Representation on the Accuracy of Photonic Neural Networks
Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network's handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical methodology for analyzing feature combination. Through this methodology, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network's ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.
♻ ☆ PruningBench: A Comprehensive Benchmark of Structural Pruning NeurIPS 2024
Structural pruning has emerged as a promising approach for producing more efficient models. Nevertheless, the community suffers from a lack of standardized benchmarks and metrics, leaving the progress in this area not fully comprehended. To fill this gap, we present the first comprehensive benchmark, termed \textit{PruningBench}, for structural pruning. PruningBench showcases the following three characteristics: 1) PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques; 2) PruningBench systematically evaluates 16 existing pruning methods, encompassing a wide array of models (e.g., CNNs and ViTs) and tasks (e.g., classification and detection); 3) PruningBench provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards. We provide an online pruning platform http://pruning.vipazoo.cn for customizing pruning tasks and reproducing all results in this paper. Codes will be made publicly on https://github.com/HollyLee2000/PruningBench.
comment: Submitted to NeurIPS 2024 Datasets and Benchmarks Track
♻ ☆ MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering
Large Language Models (LLMs), although powerful in general domains, often perform poorly on domain-specific tasks like medical question answering (QA). Moreover, they tend to function as "black-boxes," making it challenging to modify their behavior. To address the problem, our study delves into retrieval augmented generation (RAG), aiming to improve LLM responses without the need for fine-tuning or retraining. Specifically, we propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs. Focusing on medical QA using the MedQA-SMILE dataset, we evaluate the impact of different retrieval models and the number of facts provided to the LLM. Notably, our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%. This work underscores the potential of RAG to enhance LLM performance, offering a practical approach to mitigate the challenges of black-box LLMs.
comment: Accepted by AMIA 2024 Annual Symposium
♻ ☆ LLMs and Memorization: On Quality and Specificity of Copyright Compliance
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.
comment: 10 pages, 3 figures
♻ ☆ Scaling laws for learning with real and surrogate data
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as `surrogate data.' We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein's paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
comment: Added new experiments
♻ ☆ Distributed Speculative Inference of Large Language Models
Accelerating the inference of large language models (LLMs) is an important challenge in artificial intelligence. This paper introduces distributed speculative inference (DSI), a novel distributed inference algorithm that is provably faster than speculative inference (SI) [leviathan2023fast, chen2023accelerating, miao2023specinfer] and traditional autoregressive inference (non-SI). Like other SI algorithms, DSI works on frozen LLMs, requiring no training or architectural modifications, and it preserves the target distribution. Prior studies on SI have demonstrated empirical speedups (compared to non-SI) but require a fast and accurate drafter LLM. In practice, off-the-shelf LLMs often do not have matching drafters that are sufficiently fast and accurate. We show a gap: SI gets slower than non-SI when using slower or less accurate drafters. We close this gap by proving that DSI is faster than both SI and non-SI given any drafters. By orchestrating multiple instances of the target and drafters, DSI is not only faster than SI but also supports LLMs that cannot be accelerated with SI. Our simulations show speedups of off-the-shelf LLMs in realistic settings: DSI is 1.29-1.92x faster than SI.
♻ ☆ Dynamic planning in hierarchical active inference
By dynamic planning, we refer to the ability of the human brain to infer and impose motor trajectories related to cognitive decisions. A recent paradigm, active inference, brings fundamental insights into the adaptation of biological organisms, constantly striving to minimize prediction errors to restrict themselves to life-compatible states. Over the past years, many studies have shown how human and animal behavior could be explained in terms of an active inferential process - either as discrete decision-making or continuous motor control - inspiring innovative solutions in robotics and artificial intelligence. Still, the literature lacks a comprehensive outlook on how to effectively plan actions in changing environments. Setting ourselves the goal of modeling tool use, we delve into the topic of dynamic planning in active inference, keeping in mind two crucial aspects of biological goal-directed behavior: the capacity to understand and exploit affordances for object manipulation, and to learn the hierarchical interactions between the self and the environment, including other agents. We start from a simple unit and gradually describe more advanced structures, comparing recently proposed design choices and providing basic examples for each section. This study distances itself from traditional views centered on neural networks and reinforcement learning, and points toward a yet unexplored direction in active inference: hybrid representations in hierarchical models.
♻ ☆ Improved Monte Carlo tree search (MCTS) formulation with multiple root nodes for discrete sizing optimization of truss structures
This paper proposes a new method for discrete optimum design of truss structures utilizing Monte Carlo tree search (MCTS) with update process, the best reward, accelerating technique, and terminal condition. An improved MCTS formulation with multiple root nodes is developed in this study. Update process means that once a final solution is found, it is used as the initial solution for next search tree. The best reward is used in the backpropagation step. Accelerating technique is introduced by decreasing the width of search tree and reducing maximum number of iterations. The agent is trained to minimize the total structural weight under various constraints until the terminal condition is satisfied. Then, optimal solution is the minimum value of all solutions found by search trees. These numerical examples show that the agent can find optimal solution with low computational cost, stably produces an optimal design, and is suitable for practical engineering problems.
comment: 28 pages, 20 figures, 14 tables
♻ ☆ The Intelligible and Effective Graph Neural Additive Networks
Graph Neural Networks (GNNs) have emerged as the predominant approach for learning over graph-structured data. However, most GNNs operate as black-box models and require post-hoc explanations, which may not suffice in high-stakes scenarios where transparency is crucial. In this paper, we present a GNN that is interpretable by design. Our model, Graph Neural Additive Network (GNAN), is a novel extension of the interpretable class of Generalized Additive Models, and can be visualized and fully understood by humans. GNAN is designed to be fully interpretable, allowing both global and local explanations at the feature and graph levels through direct visualization of the model. These visualizations describe the exact way the model uses the relationships between the target variable, the features, and the graph. We demonstrate the intelligibility of GNANs in a series of examples on different tasks and datasets. In addition, we show that the accuracy of GNAN is on par with black-box GNNs, making it suitable for critical applications where transparency is essential, alongside high accuracy.
♻ ☆ Generative AI-Driven Human Digital Twin in IoT-Healthcare: A Comprehensive Survey
The Internet of things (IoT) can significantly enhance the quality of human life, specifically in healthcare, attracting extensive attentions to IoT-healthcare services. Meanwhile, the human digital twin (HDT) is proposed as an innovative paradigm that can comprehensively characterize the replication of the individual human body in the digital world and reflect its physical status in real time. Naturally, HDT is envisioned to empower IoT-healthcare beyond the application of healthcare monitoring by acting as a versatile and vivid human digital testbed, simulating the outcomes and guiding the practical treatments. However, successfully establishing HDT requires high-fidelity virtual modeling and strong information interactions but possibly with scarce, biased and noisy data. Fortunately, a recent popular technology called generative artificial intelligence (GAI) may be a promising solution because it can leverage advanced AI algorithms to automatically create, manipulate, and modify valuable while diverse data. This survey particularly focuses on the implementation of GAI-driven HDT in IoT-healthcare. We start by introducing the background of IoT-healthcare and the potential of GAI-driven HDT. Then, we delve into the fundamental techniques and present the overall framework of GAI-driven HDT. After that, we explore the realization of GAI-driven HDT in detail, including GAI-enabled data acquisition, communication, data management, digital modeling, and data analysis. Besides, we discuss typical IoT-healthcare applications that can be revolutionized by GAI-driven HDT, namely personalized health monitoring and diagnosis, personalized prescription, and personalized rehabilitation. Finally, we conclude this survey by highlighting some future research directions.
♻ ☆ Solving a Real-World Package Delivery Routing Problem Using Quantum Annealers
Research focused on the conjunction between quantum computing and routing problems has been very prolific in recent years. Most of the works revolve around classical problems such as the Traveling Salesman Problem or the Vehicle Routing Problem. The real-world applicability of these problems is dependent on the objectives and constraints considered. Anyway, it is undeniable that it is often difficult to translate complex requirements into these classical formulations.The main objective of this research is to present a solving scheme for dealing with realistic instances while maintaining all the characteristics and restrictions of the original real-world problem. Thus, a quantum-classical strategy has been developed, coined Q4RPD, that considers a set of real constraints such as a heterogeneous fleet of vehicles, priority deliveries, and capacities characterized by two values: weight and dimensions of the packages. Q4RPD resorts to the Leap Constrained Quadratic Model Hybrid Solver of D-Wave. To demonstrate the application of Q4RPD, an experimentation composed of six different instances has been conducted, aiming to serve as illustrative examples.
comment: 16 pages, 11 figures and 4 tables. Paper submitted for review in Scientific Reports
♻ ☆ Networked Communication for Decentralised Agents in Mean-Field Games
We introduce networked communication to the mean-field game framework, in particular to oracle-free settings where $N$ decentralised agents learn along a single, non-episodic run of the empirical system. We prove that our architecture, with only a few reasonable assumptions about network structure, has sample guarantees bounded between those of the centralised- and independent-learning cases. We discuss how the sample guarantees of the three theoretical algorithms do not actually result in practical convergence. We therefore show that in practical settings where the theoretical parameters are not observed (leading to poor estimation of the Q-function), our communication scheme significantly accelerates convergence over the independent case (and often even the centralised case), without relying on the assumption of a centralised learner. We contribute further practical enhancements to all three theoretical algorithms, allowing us to present their first empirical demonstrations. Our experiments confirm that we can remove several of the theoretical assumptions of the algorithms, and display the empirical convergence benefits brought by our new networked communication. We additionally show that the networked approach has significant advantages, over both the centralised and independent alternatives, in terms of robustness to unexpected learning failures and to changes in population size.
♻ ☆ Spike Accumulation Forwarding for Effective Training of Spiking Neural Networks
In this article, we propose a new paradigm for training spiking neural networks (SNNs), spike accumulation forwarding (SAF). It is known that SNNs are energy-efficient but difficult to train. Consequently, many researchers have proposed various methods to solve this problem, among which online training through time (OTTT) is a method that allows inferring at each time step while suppressing the memory cost. However, to compute efficiently on GPUs, OTTT requires operations with spike trains and weighted summation of spike trains during forwarding. In addition, OTTT has shown a relationship with the Spike Representation, an alternative training method, though theoretical agreement with Spike Representation has yet to be proven. Our proposed method can solve these problems; namely, SAF can halve the number of operations during the forward process, and it can be theoretically proven that SAF is consistent with the Spike Representation and OTTT, respectively. Furthermore, we confirmed the above contents through experiments and showed that it is possible to reduce memory and training time while maintaining accuracy.
comment: 14 pages, 5 figures, Appendix:10 pages, 2 figures, v6:Published in Transactions on Machine Learning Research
♻ ☆ UINav: A Practical Approach to Train On-Device Automation Agents
Automation systems that can autonomously drive application user interfaces to complete user tasks are of great benefit, especially when users are situationally or permanently impaired. Prior automation systems do not produce generalizable models while AI-based automation agents work reliably only in simple, hand-crafted applications or incur high computation costs. We propose UINav, a demonstration-based approach to train automation agents that fit mobile devices, yet achieving high success rates with modest numbers of demonstrations. To reduce the demonstration overhead, UINav uses a referee model that provides users with immediate feedback on tasks where the agent fails, and automatically augments human demonstrations to increase diversity in training data. Our evaluation shows that with only 10 demonstrations UINav can achieve 70% accuracy, and that with enough demonstrations it can surpass 90% accuracy.
♻ ☆ Zero-Shot Reasoning: Personalized Content Generation Without the Cold Start Problem
Procedural content generation uses algorithmic techniques to create large amounts of new content for games at much lower production costs. In newer approaches, procedural content generation utilizes machine learning. However, these methods usually require expensive collection of large amounts of data, as well as the development and training of fairly complex learning models, which can be both extremely time-consuming and expensive. The core of our research is to explore whether we can lower the barrier to the use of personalized procedural content generation through a more practical and generalizable approach with large language models. Matching game content with player preferences benefits both players, who enjoy the game more, and developers, who increasingly depend on players enjoying the game before being able to monetize it. Therefore, this paper presents a novel approach to achieving personalization by using large language models to propose levels based on the gameplay data continuously collected from individual players. We compared the levels generated using our approach with levels generated with more traditional procedural generation techniques. Our easily reproducible method has proven viable in a production setting and outperformed levels generated by traditional methods in the probability that a player will not quit the game mid-level.
comment: 9 pages, 6 figures. Paper accepted to IEEE Transactions on Games
♻ ☆ TimeBench: A Comprehensive Evaluation of Temporal Reasoning Abilities in Large Language Models ACL 2024
Grasping the concept of time is a fundamental facet of human cognition, indispensable for truly comprehending the intricacies of the world. Previous studies typically focus on specific aspects of time, lacking a comprehensive temporal reasoning benchmark. To address this, we propose TimeBench, a comprehensive hierarchical temporal reasoning benchmark that covers a broad spectrum of temporal reasoning phenomena. TimeBench provides a thorough evaluation for investigating the temporal reasoning capabilities of large language models. We conduct extensive experiments on GPT-4, LLaMA2, and other popular LLMs under various settings. Our experimental results indicate a significant performance gap between the state-of-the-art LLMs and humans, highlighting that there is still a considerable distance to cover in temporal reasoning. Besides, LLMs exhibit capability discrepancies across different reasoning categories. Furthermore, we thoroughly analyze the impact of multiple aspects on temporal reasoning and emphasize the associated challenges. We aspire for TimeBench to serve as a comprehensive benchmark, fostering research in temporal reasoning. Resources are available at: https://github.com/zchuz/TimeBench
comment: Accepted to ACL 2024
♻ ☆ A Multimodal Foundation Agent for Financial Trading: Tool-Augmented, Diversified, and Generalist
Financial trading is a crucial component of the markets, informed by a multimodal information landscape encompassing news, prices, and Kline charts, and encompasses diverse tasks such as quantitative trading and high-frequency trading with various assets. While advanced AI techniques like deep learning and reinforcement learning are extensively utilized in finance, their application in financial trading tasks often faces challenges due to inadequate handling of multimodal data and limited generalizability across various tasks. To address these challenges, we present FinAgent, a multimodal foundational agent with tool augmentation for financial trading. FinAgent's market intelligence module processes a diverse range of data-numerical, textual, and visual-to accurately analyze the financial market. Its unique dual-level reflection module not only enables rapid adaptation to market dynamics but also incorporates a diversified memory retrieval system, enhancing the agent's ability to learn from historical data and improve decision-making processes. The agent's emphasis on reasoning for actions fosters trust in its financial decisions. Moreover, FinAgent integrates established trading strategies and expert insights, ensuring that its trading approaches are both data-driven and rooted in sound financial principles. With comprehensive experiments on 6 financial datasets, including stocks and Crypto, FinAgent significantly outperforms 9 state-of-the-art baselines in terms of 6 financial metrics with over 36% average improvement on profit. Specifically, a 92.27% return (a 84.39% relative improvement) is achieved on one dataset. Notably, FinAgent is the first advanced multimodal foundation agent designed for financial trading tasks.
♻ ☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. While many effective IFT datasets have been introduced recently, they predominantly focus on high-resource languages like English. To better align LLMs across a broad spectrum of languages and tasks, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual. It is constructed by first selecting a diverse set of seed examples and then utilizing the proposed Evol taxonomy to convert these seeds into complex and challenging multi-turn instructions. We demonstrate the effectiveness of M2Lingual by training LLMs of varying sizes and showcasing the enhanced performance across a diverse set of languages. We contribute the 2 step Evol taxonomy with the guided generation code: https://github.com/ServiceNow/M2Lingual, as well as the first fully synthetic, general and task-oriented, multi-turn, multilingual dataset built with Evol - M2Lingual: https://huggingface.co/datasets/ServiceNow-AI/ M2Lingual - containing 182K total IFT pairs, covering 70 languages and 17+ NLP tasks.
comment: 39 pages
♻ ☆ BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation
In this paper, we present a new embedding model, called M3-Embedding, which is distinguished for its versatility in Multi-Linguality, Multi-Functionality, and Multi-Granularity. It can support more than 100 working languages, leading to new state-of-the-art performances on multi-lingual and cross-lingual retrieval tasks. It can simultaneously perform the three common retrieval functionalities of embedding model: dense retrieval, multi-vector retrieval, and sparse retrieval, which provides a unified model foundation for real-world IR applications. It is able to process inputs of different granularities, spanning from short sentences to long documents of up to 8192 tokens. The effective training of M3-Embedding involves the following technical contributions. We propose a novel self-knowledge distillation approach, where the relevance scores from different retrieval functionalities can be integrated as the teacher signal to enhance the training quality. We also optimize the batching strategy, enabling a large batch size and high training throughput to ensure the discriminativeness of embeddings. To the best of our knowledge, M3-Embedding is the first embedding model which realizes such a strong versatility. The model and code will be publicly available at https://github.com/FlagOpen/FlagEmbedding.
♻ ☆ Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach
Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
♻ ☆ DLRover-RM: Resource Optimization for Deep Recommendation Models Training in the Cloud VLDB'24
Deep learning recommendation models (DLRM) rely on large embedding tables to manage categorical sparse features. Expanding such embedding tables can significantly enhance model performance, but at the cost of increased GPU/CPU/memory usage. Meanwhile, tech companies have built extensive cloud-based services to accelerate training DLRM models at scale. In this paper, we conduct a deep investigation of the DLRM training platforms at AntGroup and reveal two critical challenges: low resource utilization due to suboptimal configurations by users and the tendency to encounter abnormalities due to an unstable cloud environment. To overcome them, we introduce DLRover-RM, an elastic training framework for DLRMs designed to increase resource utilization and handle the instability of a cloud environment. DLRover-RM develops a resource-performance model by considering the unique characteristics of DLRMs and a three-stage heuristic strategy to automatically allocate and dynamically adjust resources for DLRM training jobs for higher resource utilization. Further, DLRover-RM develops multiple mechanisms to ensure efficient and reliable execution of DLRM training jobs. Our extensive evaluation shows that DLRover-RM reduces job completion times by 31%, increases the job completion rate by 6%, enhances CPU usage by 15%, and improves memory utilization by 20%, compared to state-of-the-art resource scheduling frameworks. DLRover-RM has been widely deployed at AntGroup and processes thousands of DLRM training jobs on a daily basis. DLRover-RM is open-sourced and has been adopted by 10+ companies.
comment: Accepted in VLDB'24
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labelling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes multiple feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that the use of LLM embedding in the event detection framework has significantly improved the results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy in news analysis and reporting.
♻ ☆ SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.
♻ ☆ ULLER: A Unified Language for Learning and Reasoning
The field of neuro-symbolic artificial intelligence (NeSy), which combines learning and reasoning, has recently experienced significant growth. There now are a wide variety of NeSy frameworks, each with its own specific language for expressing background knowledge and how to relate it to neural networks. This heterogeneity hinders accessibility for newcomers and makes comparing different NeSy frameworks challenging. We propose a language for NeSy, which we call ULLER, a Unfied Language for LEarning and Reasoning. ULLER encompasses a wide variety of settings, while ensuring that knowledge described in it can be used in existing NeSy systems. ULLER has a first-order logic syntax specialised for NeSy for which we provide example semantics including classical FOL, fuzzy logic, and probabilistic logic. We believe ULLER is a first step towards making NeSy research more accessible and comparable, paving the way for libraries that streamline training and evaluation across a multitude of semantics, knowledge bases, and NeSy systems.
comment: Accepted at NeSy 2024
♻ ☆ Position: Explain to Question not to Justify
Explainable Artificial Intelligence (XAI) is a young but very promising field of research. Unfortunately, the progress in this field is currently slowed down by divergent and incompatible goals. We separate various threads tangled within the area of XAI into two complementary cultures of human/value-oriented explanations (BLUE XAI) and model/validation-oriented explanations (RED XAI). This position paper argues that the area of RED XAI is currently under-explored, i.e., more methods for explainability are desperately needed to question models (e.g., extract knowledge from well-performing models as well as spotting and fixing bugs in faulty models), and the area of RED XAI hides great opportunities and potential for important research necessary to ensure the safety of AI systems. We conclude this paper by presenting promising challenges in this area.
♻ ☆ SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models ICML 2024
Most of the existing Large Language Model (LLM) benchmarks on scientific problem reasoning focus on problems grounded in high-school subjects and are confined to elementary algebraic operations. To systematically examine the reasoning capabilities required for solving complex scientific problems, we introduce an expansive benchmark suite SciBench for LLMs. SciBench contains a carefully curated dataset featuring a range of collegiate-level scientific problems from mathematics, chemistry, and physics domains. Based on the dataset, we conduct an in-depth benchmarking study of representative open-source and proprietary LLMs with various prompting strategies. The results reveal that the current LLMs fall short of delivering satisfactory performance, with the best overall score of merely 43.22%. Furthermore, through a detailed user study, we categorize the errors made by LLMs into ten problem-solving abilities. Our analysis indicates that no single prompting strategy significantly outperforms the others and some strategies that demonstrate improvements in certain problem-solving skills could result in declines in other skills. We envision that SciBench will catalyze further developments in the reasoning abilities of LLMs, thereby ultimately contributing to scientific research and discovery.
comment: To appear at ICML 2024
♻ ☆ Active Preference Learning for Large Language Models
As large language models (LLMs) become more capable, fine-tuning techniques for aligning with human intent are increasingly important. A key consideration for aligning these models is how to most effectively use human resources, or model resources in the case where LLMs themselves are used as oracles. Reinforcement learning from Human or AI preferences (RLHF/RLAIF) is the most prominent example of such a technique, but is complex and often unstable. Direct Preference Optimization (DPO) has recently been proposed as a simpler and more stable alternative. In this work, we develop an active learning strategy for DPO to make better use of preference labels. We propose a practical acquisition function for prompt/completion pairs based on the predictive entropy of the language model and a measure of certainty of the implicit preference model optimized by DPO. We demonstrate how our approach improves both the rate of learning and final performance of fine-tuning on pairwise preference data.
comment: 13 pages, 5 figures, 6 tables
♻ ☆ Concept-aware Data Construction Improves In-context Learning of Language Models ACL 2024
Many recent language models (LMs) are capable of in-context learning (ICL), manifested in the LMs' ability to perform a new task solely from natural-language instruction. Previous work curating in-context learners assumes that ICL emerges from a vast over-parametrization or the scale of multi-task training. However, recent theoretical work attributes the ICL ability to concept-dependent training data and creates functional in-context learners even in small-scale, synthetic settings. In this work, we practically explore this newly identified axis of ICL quality. We propose Concept-aware Training (CoAT), a framework for constructing training scenarios that make it beneficial for the LM to learn to utilize the analogical reasoning concepts from demonstrations. We find that by using CoAT, pre-trained transformers can learn to better utilise new latent concepts from demonstrations and that such ability makes ICL more robust to the functional deficiencies of the previous models. Finally, we show that concept-aware in-context learning is more effective for a majority of new tasks when compared to traditional instruction tuning, resulting in a performance comparable to the previous in-context learners using magnitudes of more training data.
comment: Long paper to appear in Findings of ACL 2024
♻ ☆ Assessment of Sentinel-2 spatial and temporal coverage based on the scene classification layer
Since the launch of the Sentinel-2 (S2) satellites, many ML models have used the data for diverse applications. The scene classification layer (SCL) inside the S2 product provides rich information for training, such as filtering images with high cloud coverage. However, there is more potential in this. We propose a technique to assess the clean optical coverage of a region, expressed by a SITS and calculated with the S2-based SCL data. With a manual threshold and specific labels in the SCL, the proposed technique assigns a percentage of spatial and temporal coverage across the time series and a high/low assessment. By evaluating the AI4EO challenge for Enhanced Agriculture, we show that the assessment is correlated to the predictive results of ML models. The classification results in a region with low spatial and temporal coverage is worse than in a region with high coverage. Finally, we applied the technique across all continents of the global dataset LandCoverNet.
comment: Accepted at IEEE International Geoscience and Remote Sensing Symposium 2024
♻ ☆ Logical Closed Loop: Uncovering Object Hallucinations in Large Vision-Language Models ACL 2024
Object hallucination has been an Achilles' heel which hinders the broader applications of large vision-language models (LVLMs). Object hallucination refers to the phenomenon that the LVLMs claim non-existent objects in the image. To mitigate the object hallucinations, instruction tuning and external model-based detection methods have been proposed, which either require large-scare computational resources or depend on the detection result of external models. However, there remains an under-explored field to utilize the LVLM itself to alleviate object hallucinations. In this work, we adopt the intuition that the LVLM tends to respond logically consistently for existent objects but inconsistently for hallucinated objects. Therefore, we propose a Logical Closed Loop-based framework for Object Hallucination Detection and Mitigation, namely LogicCheckGPT. In specific, we devise logical consistency probing to raise questions with logical correlations, inquiring about attributes from objects and vice versa. Whether their responses can form a logical closed loop serves as an indicator of object hallucination. As a plug-and-play method, it can be seamlessly applied to all existing LVLMs. Comprehensive experiments conducted on three benchmarks across four LVLMs have demonstrated significant improvements brought by our method, indicating its effectiveness and generality.
comment: Accept to ACL 2024; 19 Pages, 15 Figures, 6 Tables
♻ ☆ ANLS* -- A Universal Document Processing Metric for Generative Large Language Models
Traditionally, discriminative models have been the predominant choice for tasks like document classification and information extraction. These models make predictions that fall into a limited number of predefined classes, facilitating a binary true or false evaluation and enabling the direct calculation of metrics such as the F1 score. However, recent advancements in generative large language models (GLLMs) have prompted a shift in the field due to their enhanced zero-shot capabilities, which eliminate the need for a downstream dataset and computationally expensive fine-tuning. However, evaluating GLLMs presents a challenge as the binary true or false evaluation used for discriminative models is not applicable to the predictions made by GLLMs. This paper introduces a new metric for generative models called ANLS* for evaluating a wide variety of tasks, including information extraction and classification tasks. The ANLS* metric extends existing ANLS metrics as a drop-in-replacement and is still compatible with previously reported ANLS scores. An evaluation of 7 different datasets, and more than 10 different GLLMs together with 3 different prompting methods using the ANLS* metric is also provided, demonstrating the importance of the proposed metric. We also benchmark a novel approach to generate prompts for documents, called SFT, against other prompting techniques such as LATIN. In 6 out of 7 cases, SFT outperforms other techniques and improves the state-of-the-art, sometimes by as much as $10$ percentage points. Sources are available at https://github.com/deepopinion/anls_star_metric
♻ ☆ Apollo: A Lightweight Multilingual Medical LLM towards Democratizing Medical AI to 6B People
Despite the vast repository of global medical knowledge predominantly being in English, local languages are crucial for delivering tailored healthcare services, particularly in areas with limited medical resources. To extend the reach of medical AI advancements to a broader population, we aim to develop medical LLMs across the six most widely spoken languages, encompassing a global population of 6.1 billion. This effort culminates in the creation of the ApolloCorpora multilingual medical dataset and the XMedBench benchmark. In the multilingual medical benchmark, the released Apollo models, at various relatively-small sizes (i.e., 0.5B, 1.8B, 2B, 6B, and 7B), achieve the best performance among models of equivalent size. Especially, Apollo-7B is the state-of-the-art multilingual medical LLMs up to 70B. Additionally, these lite models could be used to improve the multi-lingual medical capabilities of larger models without fine-tuning in a proxy-tuning fashion. We will open-source training corpora, code, model weights and evaluation benchmark.
comment: Preprint
♻ ☆ Transcendence: Generative Models Can Outperform The Experts That Train Them
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
comment: Code, models, and data at https://transcendence.eddie.win
♻ ☆ Behavior Generation with Latent Actions
Generative modeling of complex behaviors from labeled datasets has been a longstanding problem in decision making. Unlike language or image generation, decision making requires modeling actions - continuous-valued vectors that are multimodal in their distribution, potentially drawn from uncurated sources, where generation errors can compound in sequential prediction. A recent class of models called Behavior Transformers (BeT) addresses this by discretizing actions using k-means clustering to capture different modes. However, k-means struggles to scale for high-dimensional action spaces or long sequences, and lacks gradient information, and thus BeT suffers in modeling long-range actions. In this work, we present Vector-Quantized Behavior Transformer (VQ-BeT), a versatile model for behavior generation that handles multimodal action prediction, conditional generation, and partial observations. VQ-BeT augments BeT by tokenizing continuous actions with a hierarchical vector quantization module. Across seven environments including simulated manipulation, autonomous driving, and robotics, VQ-BeT improves on state-of-the-art models such as BeT and Diffusion Policies. Importantly, we demonstrate VQ-BeT's improved ability to capture behavior modes while accelerating inference speed 5x over Diffusion Policies. Videos and code can be found https://sjlee.cc/vq-bet
comment: Github repo: https://github.com/jayLEE0301/vq_bet_official
♻ ☆ WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model's utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WD). We focus on two mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn's theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs' predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.
comment: 26 pages, including reference and appendix sections, 8 figures, and 16 tables
♻ ☆ Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse
Recently, a vast amount of literature has focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples' features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the "lazy regime". Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature mapping learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.
comment: 34 pages, 14 figures
♻ ☆ AIGB: Generative Auto-bidding via Diffusion Modeling KDD 2024
Auto-bidding plays a crucial role in facilitating online advertising by automatically providing bids for advertisers. Reinforcement learning (RL) has gained popularity for auto-bidding. However, most current RL auto-bidding methods are modeled through the Markovian Decision Process (MDP), which assumes the Markovian state transition. This assumption restricts the ability to perform in long horizon scenarios and makes the model unstable when dealing with highly random online advertising environments. To tackle this issue, this paper introduces AI-Generated Bidding (AIGB), a novel paradigm for auto-bidding through generative modeling. In this paradigm, we propose DiffBid, a conditional diffusion modeling approach for bid generation. DiffBid directly models the correlation between the return and the entire trajectory, effectively avoiding error propagation across time steps in long horizons. Additionally, DiffBid offers a versatile approach for generating trajectories that maximize given targets while adhering to specific constraints. Extensive experiments conducted on the real-world dataset and online A/B test on Alibaba advertising platform demonstrate the effectiveness of DiffBid, achieving 2.81% increase in GMV and 3.36% increase in ROI.
comment: Accepted by KDD 2024
♻ ☆ FAdam: Adam is a natural gradient optimizer using diagonal empirical Fisher information
This paper establishes a mathematical foundation for the Adam optimizer, elucidating its connection to natural gradient descent through Riemannian and information geometry. We rigorously analyze the diagonal empirical Fisher information matrix (FIM) in Adam, clarifying all detailed approximations and advocating for the use of log probability functions as loss, which should be based on discrete distributions, due to the limitations of empirical FIM. Our analysis uncovers flaws in the original Adam algorithm, leading to proposed corrections such as enhanced momentum calculations, adjusted bias corrections, adaptive epsilon, and gradient clipping. We refine the weight decay term based on our theoretical framework. Our modified algorithm, Fisher Adam (FAdam), demonstrates superior performance across diverse domains including LLM, ASR, and VQ-VAE, achieving state-of-the-art results in ASR.
comment: 21 pages, 4 figures, 6 tables
♻ ☆ MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm
In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: https://github.com/YUNBLAK/MeGA-Merging-Multiple-Independently-Trained-Neural-Networks-Based-on-Genetic-Algorithm
♻ ☆ Robust Model-Based Optimization for Challenging Fitness Landscapes
Protein design, a grand challenge of the day, involves optimization on a fitness landscape, and leading methods adopt a model-based approach where a model is trained on a training set (protein sequences and fitness) and proposes candidates to explore next. These methods are challenged by sparsity of high-fitness samples in the training set, a problem that has been in the literature. A less recognized but equally important problem stems from the distribution of training samples in the design space: leading methods are not designed for scenarios where the desired optimum is in a region that is not only poorly represented in training data, but also relatively far from the highly represented low-fitness regions. We show that this problem of "separation" in the design space is a significant bottleneck in existing model-based optimization tools and propose a new approach that uses a novel VAE as its search model to overcome the problem. We demonstrate its advantage over prior methods in robustly finding improved samples, regardless of the imbalance and separation between low- and high-fitness samples. Our comprehensive benchmark on real and semi-synthetic protein datasets as well as solution design for physics-informed neural networks, showcases the generality of our approach in discrete and continuous design spaces. Our implementation is available at https://github.com/sabagh1994/PGVAE.
♻ ☆ Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods. Our code will be released at https://github.com/Character-Adapter/Character-Adapte
♻ ☆ Context Graph
Knowledge Graphs (KGs) are foundational structures in many AI applications, representing entities and their interrelations through triples. However, triple-based KGs lack the contextual information of relational knowledge, like temporal dynamics and provenance details, which are crucial for comprehensive knowledge representation and effective reasoning. Instead, \textbf{Context Graphs} (CGs) expand upon the conventional structure by incorporating additional information such as time validity, geographic location, and source provenance. This integration provides a more nuanced and accurate understanding of knowledge, enabling KGs to offer richer insights and support more sophisticated reasoning processes. In this work, we first discuss the inherent limitations of triple-based KGs and introduce the concept of CGs, highlighting their advantages in knowledge representation and reasoning. We then present a context graph reasoning \textbf{CGR$^3$} paradigm that leverages large language models (LLMs) to retrieve candidate entities and related contexts, rank them based on the retrieved information, and reason whether sufficient information has been obtained to answer a query. Our experimental results demonstrate that CGR$^3$ significantly improves performance on KG completion (KGC) and KG question answering (KGQA) tasks, validating the effectiveness of incorporating contextual information on KG representation and reasoning.
♻ ☆ MolX: Enhancing Large Language Models for Molecular Learning with A Multi-Modal Extension
Recently, Large Language Models (LLMs) with their strong task-handling capabilities have shown remarkable advancements across a spectrum of fields, moving beyond natural language understanding. However, their proficiency within the chemistry domain remains restricted, especially in solving professional molecule-related tasks. This challenge is attributed to their inherent limitations in comprehending molecules using only common textual representations, i.e., SMILES strings. In this study, we seek to enhance the ability of LLMs to comprehend molecules by designing and equipping them with a multi-modal external module, namely MolX. In particular, instead of directly using a SMILES string to represent a molecule, we utilize specific encoders to extract fine-grained features from both SMILES string and 2D molecular graph representations for feeding into an LLM. Moreover, a human-defined molecular fingerprint is incorporated to leverage its embedded domain knowledge. Then, to establish an alignment between MolX and the LLM's textual input space, the whole model in which the LLM is frozen, is pre-trained with a versatile strategy including a diverse set of tasks. Extensive experimental evaluations demonstrate that our proposed method only introduces a small number of trainable parameters while outperforming baselines on various downstream molecule-related tasks ranging from molecule-to-text translation to retrosynthesis, with and without fine-tuning the LLM.
♻ ☆ Navigating LLM Ethics: Advancements, Challenges, and Future Directions
This study addresses ethical issues surrounding Large Language Models (LLMs) within the field of artificial intelligence. It explores the common ethical challenges posed by both LLMs and other AI systems, such as privacy and fairness, as well as ethical challenges uniquely arising from LLMs. It highlights challenges such as hallucination, verifiable accountability, and decoding censorship complexity, which are unique to LLMs and distinct from those encountered in traditional AI systems. The study underscores the need to tackle these complexities to ensure accountability, reduce biases, and enhance transparency in the influential role that LLMs play in shaping information dissemination. It proposes mitigation strategies and future directions for LLM ethics, advocating for interdisciplinary collaboration. It recommends ethical frameworks tailored to specific domains and dynamic auditing systems adapted to diverse contexts. This roadmap aims to guide responsible development and integration of LLMs, envisioning a future where ethical considerations govern AI advancements in society.
♻ ☆ The global landscape of academic guidelines for generative AI and Large Language Models
The integration of Generative Artificial Intelligence (GAI) and Large Language Models (LLMs) in academia has spurred a global discourse on their potential pedagogical benefits and ethical considerations. Positive reactions highlight some potential, such as collaborative creativity, increased access to education, and empowerment of trainers and trainees. However, negative reactions raise concerns about ethical complexities, balancing innovation and academic integrity, unequal access, and misinformation risks. Through a systematic survey and text-mining-based analysis of global and national directives, insights from independent research, and eighty university-level guidelines, this study provides a nuanced understanding of the opportunities and challenges posed by GAI and LLMs in education. It emphasizes the importance of balanced approaches that harness the benefits of these technologies while addressing ethical considerations and ensuring equitable access and educational outcomes. The paper concludes with recommendations for fostering responsible innovation and ethical practices to guide the integration of GAI and LLMs in academia.
♻ ☆ Data Augmentation using LLMs: Data Perspectives, Learning Paradigms and Challenges
In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.
♻ ☆ Exploring the Dynamics between Cobot's Production Rhythm, Locus of Control and Emotional State in a Collaborative Assembly Scenario
In industrial scenarios, there is widespread use of collaborative robots (cobots), and growing interest is directed at evaluating and measuring the impact of some characteristics of the cobot on the human factor. In the present pilot study, the effect that the production rhythm (C1 - Slow, C2 - Fast, C3 - Adapted to the participant's pace) of a cobot has on the Experiential Locus of Control (ELoC) and the emotional state of 31 participants has been examined. The operators' performance, the degree of basic internal Locus of Control, and the attitude towards the robots were also considered. No difference was found regarding the emotional state and the ELoC in the three conditions, but considering the other psychological variables, a more complex situation emerges. Overall, results seem to indicate a need to consider the person's psychological characteristics to offer a differentiated and optimal interaction experience.
comment: Accepted to 4th IEEE International Conference on Human-Machine Systems
♻ ☆ FloorSet -- a VLSI Floorplanning Dataset with Design Constraints of Real-World SoCs
Floorplanning for systems-on-a-chip (SoCs) and its sub-systems is a crucial and non-trivial step of the physical design flow. It represents a difficult combinatorial optimization problem. A typical large scale SoC with 120 partitions generates a search-space of nearly 10E250. As novel machine learning (ML) approaches emerge to tackle such problems, there is a growing need for a modern benchmark that comprises a large training dataset and performance metrics that better reflect real-world constraints and objectives compared to existing benchmarks. To address this need, we present FloorSet -- two comprehensive datasets of synthetic fixed-outline floorplan layouts that reflect the distribution of real SoCs. Each dataset has 1M training samples and 100 test samples where each sample is a synthetic floor-plan. FloorSet-Prime comprises fully-abutted rectilinear partitions and near-optimal wire-length. A simplified dataset that reflects early design phases, FloorSet-Lite comprises rectangular partitions, with under 5 percent white-space and near-optimal wire-length. Both datasets define hard constraints seen in modern design flows such as shape constraints, edge-affinity, grouping constraints, and pre-placement constraints. FloorSet is intended to spur fundamental research on large-scale constrained optimization problems. Crucially, FloorSet alleviates the core issue of reproducibility in modern ML driven solutions to such problems. FloorSet is available as an open-source repository for the research community.
comment: 10 pages, 11 figures
Information Retrieval 9
☆ Interactive Topic Models with Optimal Transport
Topic models are widely used to analyze document collections. While they are valuable for discovering latent topics in a corpus when analysts are unfamiliar with the corpus, analysts also commonly start with an understanding of the content present in a corpus. This may be through categories obtained from an initial pass over the corpus or a desire to analyze the corpus through a predefined set of categories derived from a high level theoretical framework (e.g. political ideology). In these scenarios analysts desire a topic modeling approach which incorporates their understanding of the corpus while supporting various forms of interaction with the model. In this work, we present EdTM, as an approach for label name supervised topic modeling. EdTM models topic modeling as an assignment problem while leveraging LM/LLM based document-topic affinities and using optimal transport for making globally coherent topic-assignments. In experiments, we show the efficacy of our framework compared to few-shot LLM classifiers, and topic models based on clustering and LDA. Further, we show EdTM's ability to incorporate various forms of analyst feedback and while remaining robust to noisy analyst inputs.
comment: Pre-print; Work in progress
☆ Rateless Stochastic Coding for Delay-constrained Semantic Communication
We consider the problem of joint source-channel coding with distortion and perception constraints from a rateless perspective, the purpose of which is to settle the balance between reliability (distortion/perception) and effectiveness (rate) of transmission over uncertain channels. We find a new finite-blocklength bound for the achievable joint source-channel code rate with the above two constraints. To achieve a superior rateless characteristic of JSCC coding, we perform multi-level optimization on various finite-blocklength codes. Based on these two, we then propose a new JSCC coding scheme called rateless stochastic coding (RSC). We experimentally demonstrate that the proposed RSC can achieve variable rates of transmission maintaining an excellent trade-off between distortion and perception.
☆ Learning Interpretable Legal Case Retrieval via Knowledge-Guided Case Reformulation
Legal case retrieval for sourcing similar cases is critical in upholding judicial fairness. Different from general web search, legal case retrieval involves processing lengthy, complex, and highly specialized legal documents. Existing methods in this domain often overlook the incorporation of legal expert knowledge, which is crucial for accurately understanding and modeling legal cases, leading to unsatisfactory retrieval performance. This paper introduces KELLER, a legal knowledge-guided case reformulation approach based on large language models (LLMs) for effective and interpretable legal case retrieval. By incorporating professional legal knowledge about crimes and law articles, we enable large language models to accurately reformulate the original legal case into concise sub-facts of crimes, which contain the essential information of the case. Extensive experiments on two legal case retrieval benchmarks demonstrate superior retrieval performance and robustness on complex legal case queries of KELLER over existing methods.
☆ Doc2Token: Bridging Vocabulary Gap by Predicting Missing Tokens for E-commerce Search SIGIR 2024
Addressing the "vocabulary mismatch" issue in information retrieval is a central challenge for e-commerce search engines, because product pages often miss important keywords that customers search for. Doc2Query[1] is a popular document-expansion technique that predicts search queries for a document and includes the predicted queries with the document for retrieval. However, this approach can be inefficient for e-commerce search, because the predicted query tokens are often already present in the document. In this paper, we propose Doc2Token, a technique that predicts relevant tokens (instead of queries) that are missing from the document and includes these tokens in the document for retrieval. For the task of predicting missing tokens, we introduce a new metric, "novel ROUGE score". Doc2Token is demonstrated to be superior to Doc2Query in terms of novel ROUGE score and diversity of predictions. Doc2Token also exhibits efficiency gains by reducing both training and inference times. We deployed the feature to production and observed significant revenue gain in an online A/B test, and launched the feature to full traffic on Walmart.com. [1] R. Nogueira, W. Yang, J. Lin, K. Cho, Document expansion by query prediction, arXiv preprint arXiv:1904.08375 (2019)
comment: 9 pages, 1 figure, SIGIR 2024 Workshop on eCommerce
♻ ☆ GEO: Generative Engine Optimization KDD 2024
The advent of large language models (LLMs) has ushered in a new paradigm of search engines that use generative models to gather and summarize information to answer user queries. This emerging technology, which we formalize under the unified framework of generative engines (GEs), can generate accurate and personalized responses, rapidly replacing traditional search engines like Google and Bing. Generative Engines typically satisfy queries by synthesizing information from multiple sources and summarizing them using LLMs. While this shift significantly improves $\textit{user}$ utility and $\textit{generative search engine}$ traffic, it poses a huge challenge for the third stakeholder -- website and content creators. Given the black-box and fast-moving nature of generative engines, content creators have little to no control over $\textit{when}$ and $\textit{how}$ their content is displayed. With generative engines here to stay, we must ensure the creator economy is not disadvantaged. To address this, we introduce Generative Engine Optimization (GEO), the first novel paradigm to aid content creators in improving their content visibility in generative engine responses through a flexible black-box optimization framework for optimizing and defining visibility metrics. We facilitate systematic evaluation by introducing GEO-bench, a large-scale benchmark of diverse user queries across multiple domains, along with relevant web sources to answer these queries. Through rigorous evaluation, we demonstrate that GEO can boost visibility by up to $40\%$ in generative engine responses. Moreover, we show the efficacy of these strategies varies across domains, underscoring the need for domain-specific optimization methods. Our work opens a new frontier in information discovery systems, with profound implications for both developers of generative engines and content creators.
comment: Accepted to KDD 2024
♻ ☆ JMLR: Joint Medical LLM and Retrieval Training for Enhancing Reasoning and Professional Question Answering Capability
Large Language Models (LLMs) have demonstrated a remarkable potential in medical knowledge acquisition and question-answering. However, LLMs can potentially hallucinate and yield factually incorrect outcomes, even with domain-specific pretraining. Previously, retrieval augmented generation (RAG) has limited success in addressing hallucinations. Unlike previous methods in RAG where the retrieval model was trained separately from the LLM, we introduce JMLR (for Jointly trains LLM and information Retrieval) during the fine-tuning phase. The synchronized training mechanism enhances JMLR's ability to retrieve clinical guidelines and leverage medical knowledge to reason and answer questions and reduces the demand for computational resources. We evaluated JMLR on the important medical question-answering application. Our experimental results demonstrate that JMLR-13B (70.5%) outperforms a previous state-of-the-art open-source model using conventional pre-training and fine-tuning Meditron-70B (68.9%) and Llama2-13B with RAG (67.7%) on a medical question-answering dataset. Comprehensive evaluations reveal JMLR-13B enhances reasoning quality and reduces hallucinations better than Claude3-Opus. Additionally, JMLR-13B (148 GPU hours) also trains much faster than Meditron-70B (42630 GPU hours). Through this work, we provide a new and efficient knowledge enhancement method for healthcare, demonstrating the potential of integrating retrieval and LLM training for medical question-answering systems.
♻ ☆ Transparency, Privacy, and Fairness in Recommender Systems
Recommender systems have become a pervasive part of our daily online experience, and are one of the most widely used applications of artificial intelligence and machine learning. Therefore, regulations and requirements for trustworthy artificial intelligence, for example, the European AI Act, which includes notions such as transparency, privacy, and fairness are also highly relevant for the design of recommender systems in practice. This habilitation elaborates on aspects related to these three notions in the light of recommender systems, namely: (i) transparency and cognitive models, (ii) privacy and limited preference information, and (iii) fairness and popularity bias in recommender systems. Specifically, with respect to aspect (i), we highlight the usefulness of incorporating psychological theories for a transparent design process of recommender systems. We term this type of systems psychology-informed recommender systems. In aspect (ii), we study and address the trade-off between accuracy and privacy in differentially-private recommendations. We design a novel recommendation approach for collaborative filtering based on an efficient neighborhood reuse concept, which reduces the number of users that need to be protected with differential privacy. Furthermore, we address the related issue of limited availability of user preference information, e.g., click data, in the settings of session-based and cold-start recommendations. With respect to aspect (iii), we analyze popularity bias in recommender systems. We find that the recommendation frequency of an item is positively correlated with this item's popularity. This also leads to the unfair treatment of users with little interest in popular content. Finally, we study long-term fairness dynamics in algorithmic decision support in the labor market using agent-based modeling techniques.
comment: Habilitation (post-doctoral thesis) at Graz University of Technology for the scientific subject "Applied Computer Science" (accepted in June 2024)
♻ ☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts ACL 2024
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
♻ ☆ GRILLBot In Practice: Lessons and Tradeoffs Deploying Large Language Models for Adaptable Conversational Task Assistants KDD
We tackle the challenge of building real-world multimodal assistants for complex real-world tasks. We describe the practicalities and challenges of developing and deploying GRILLBot, a leading (first and second prize winning in 2022 and 2023) system deployed in the Alexa Prize TaskBot Challenge. Building on our Open Assistant Toolkit (OAT) framework, we propose a hybrid architecture that leverages Large Language Models (LLMs) and specialised models tuned for specific subtasks requiring very low latency. OAT allows us to define when, how and which LLMs should be used in a structured and deployable manner. For knowledge-grounded question answering and live task adaptations, we show that LLM reasoning abilities over task context and world knowledge outweigh latency concerns. For dialogue state management, we implement a code generation approach and show that specialised smaller models have 84% effectiveness with 100x lower latency. Overall, we provide insights and discuss tradeoffs for deploying both traditional models and LLMs to users in complex real-world multimodal environments in the Alexa TaskBot challenge. These experiences will continue to evolve as LLMs become more capable and efficient -- fundamentally reshaping OAT and future assistant architectures.
comment: 11 pages, KDD Preprint
Multimedia 7
☆ MetaDesigner: Advancing Artistic Typography through AI-Driven, User-Centric, and Multilingual WordArt Synthesis
MetaDesigner revolutionizes artistic typography synthesis by leveraging the strengths of Large Language Models (LLMs) to drive a design paradigm centered around user engagement. At the core of this framework lies a multi-agent system comprising the Pipeline, Glyph, and Texture agents, which collectively enable the creation of customized WordArt, ranging from semantic enhancements to the imposition of complex textures. MetaDesigner incorporates a comprehensive feedback mechanism that harnesses insights from multimodal models and user evaluations to refine and enhance the design process iteratively. Through this feedback loop, the system adeptly tunes hyperparameters to align with user-defined stylistic and thematic preferences, generating WordArt that not only meets but exceeds user expectations of visual appeal and contextual relevance. Empirical validations highlight MetaDesigner's capability to effectively serve diverse WordArt applications, consistently producing aesthetically appealing and context-sensitive results.
comment: 18 pages, 16 figures, Project: https://modelscope.cn/studios/WordArt/WordArt
☆ MDF: A Dynamic Fusion Model for Multi-modal Fake News Detection
Fake news detection has received increasing attention from researchers in recent years, especially multi-modal fake news detection containing both text and images.However, many previous works have fed two modal features, text and image, into a binary classifier after a simple concatenation or attention mechanism, in which the features contain a large amount of noise inherent in the data,which in turn leads to intra- and inter-modal uncertainty.In addition, although many methods based on simply splicing two modalities have achieved more prominent results, these methods ignore the drawback of holding fixed weights across modalities, which would lead to some features with higher impact factors being ignored.To alleviate the above problems, we propose a new dynamic fusion framework dubbed MDF for fake news detection.As far as we know, it is the first attempt of dynamic fusion framework in the field of fake news detection.Specifically, our model consists of two main components:(1) UEM as an uncertainty modeling module employing a multi-head attention mechanism to model intra-modal uncertainty; and (2) DFN is a dynamic fusion module based on D-S evidence theory for dynamically fusing the weights of two modalities, text and image.In order to present better results for the dynamic fusion framework, we use GAT for inter-modal uncertainty and weight modeling before DFN.Extensive experiments on two benchmark datasets demonstrate the effectiveness and superior performance of the MDF framework.We also conducted a systematic ablation study to gain insight into our motivation and architectural design.We make our model publicly available to:https://github.com/CoisiniStar/MDF
☆ MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance
In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications. However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology. In this work, we propose a controllable video generation framework, dubbed MimicMotion, which can generate high-quality videos of arbitrary length mimicking specific motion guidance. Compared with previous methods, our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which significantly reduces image distortion. Lastly, for generating long and smooth videos, we propose a progressive latent fusion strategy. By this means, we can produce videos of arbitrary length with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in various aspects. Detailed results and comparisons are available on our project page: https://tencent.github.io/MimicMotion .
☆ Network Bending of Diffusion Models for Audio-Visual Generation
In this paper we present the first steps towards the creation of a tool which enables artists to create music visualizations using pre-trained, generative, machine learning models. First, we investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models by utilizing a range of point-wise, tensor-wise, and morphological operators. We identify a number of visual effects that result from various operators, including some that are not easily recreated with standard image editing tools. We find that this process allows for continuous, fine-grain control of image generation which can be helpful for creative applications. Next, we generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators. Finally, we comment on certain transforms which radically shift the image and the possibilities of learning more about the latent space of Stable Diffusion based on these transforms.
comment: 8 pages, 5 figures, to be published in the proceedings of the 27th International Conference on Digital Audio Effects (DAFx24), for additional image and video examples see https://dzluke.github.io/DAFX2024/
♻ ☆ Kandinsky 3.0 Technical Report
We present Kandinsky 3.0, a large-scale text-to-image generation model based on latent diffusion, continuing the series of text-to-image Kandinsky models and reflecting our progress to achieve higher quality and realism of image generation. In this report we describe the architecture of the model, the data collection procedure, the training technique, and the production system for user interaction. We focus on the key components that, as we have identified as a result of a large number of experiments, had the most significant impact on improving the quality of our model compared to the others. We also describe extensions and applications of our model, including super resolution, inpainting, image editing, image-to-video generation, and a distilled version of Kandinsky 3.0 - Kandinsky 3.1, which does inference in 4 steps of the reverse process and 20 times faster without visual quality decrease. By side-by-side human preferences comparison, Kandinsky becomes better in text understanding and works better on specific domains. The code is available at https://github.com/ai-forever/Kandinsky-3
comment: Project page: https://ai-forever.github.io/Kandinsky-3
♻ ☆ Viewport Prediction for Volumetric Video Streaming by Exploring Video Saliency and Trajectory Information
Volumetric video, also known as hologram video, is a novel medium that portrays natural content in Virtual Reality (VR), Augmented Reality (AR), and Mixed Reality (MR). It is expected to be the next-gen video technology and a prevalent use case for 5G and beyond wireless communication. Considering that each user typically only watches a section of the volumetric video, known as the viewport, it is essential to have precise viewport prediction for optimal performance. However, research on this topic is still in its infancy. In the end, this paper presents and proposes a novel approach, named Saliency and Trajectory Viewport Prediction (STVP), which aims to improve the precision of viewport prediction in volumetric video streaming. The STVP extensively utilizes video saliency information and viewport trajectory. To our knowledge, this is the first comprehensive study of viewport prediction in volumetric video streaming. In particular, we introduce a novel sampling method, Uniform Random Sampling (URS), to reduce computational complexity while still preserving video features in an efficient manner. Then we present a saliency detection technique that incorporates both spatial and temporal information for detecting static, dynamic geometric, and color salient regions. Finally, we intelligently fuse saliency and trajectory information to achieve more accurate viewport prediction. We conduct extensive simulations to evaluate the effectiveness of our proposed viewport prediction methods using state-of-the-art volumetric video sequences. The experimental results show the superiority of the proposed method over existing schemes. The dataset and source code will be publicly accessible after acceptance.
♻ ☆ MIntRec2.0: A Large-scale Benchmark Dataset for Multimodal Intent Recognition and Out-of-scope Detection in Conversations ICLR 2024
Multimodal intent recognition poses significant challenges, requiring the incorporation of non-verbal modalities from real-world contexts to enhance the comprehension of human intentions. Existing benchmark datasets are limited in scale and suffer from difficulties in handling out-of-scope samples that arise in multi-turn conversational interactions. We introduce MIntRec2.0, a large-scale benchmark dataset for multimodal intent recognition in multi-party conversations. It contains 1,245 dialogues with 15,040 samples, each annotated within a new intent taxonomy of 30 fine-grained classes. Besides 9,304 in-scope samples, it also includes 5,736 out-of-scope samples appearing in multi-turn contexts, which naturally occur in real-world scenarios. Furthermore, we provide comprehensive information on the speakers in each utterance, enriching its utility for multi-party conversational research. We establish a general framework supporting the organization of single-turn and multi-turn dialogue data, modality feature extraction, multimodal fusion, as well as in-scope classification and out-of-scope detection. Evaluation benchmarks are built using classic multimodal fusion methods, ChatGPT, and human evaluators. While existing methods incorporating nonverbal information yield improvements, effectively leveraging context information and detecting out-of-scope samples remains a substantial challenge. Notably, large language models exhibit a significant performance gap compared to humans, highlighting the limitations of machine learning methods in the cognitive intent understanding task. We believe that MIntRec2.0 will serve as a valuable resource, providing a pioneering foundation for research in human-machine conversational interactions, and significantly facilitating related applications. The full dataset and codes are available at https://github.com/thuiar/MIntRec2.0.
comment: Accepted by ICLR 2024, Long Paper; The abstract is slightly modified due to the length limitation
Computer Vision and Pattern Recognition 134
Dataset Size Recovery from LoRA Weights
Model inversion and membership inference attacks aim to reconstruct and verify the data which a model was trained on. However, they are not guaranteed to find all training samples as they do not know the size of the training set. In this paper, we introduce a new task: dataset size recovery, that aims to determine the number of samples used to train a model, directly from its weights. We then propose DSiRe, a method for recovering the number of images used to fine-tune a model, in the common case where fine-tuning uses LoRA. We discover that both the norm and the spectrum of the LoRA matrices are closely linked to the fine-tuning dataset size; we leverage this finding to propose a simple yet effective prediction algorithm. To evaluate dataset size recovery of LoRA weights, we develop and release a new benchmark, LoRA-WiSE, consisting of over 25000 weight snapshots from more than 2000 diverse LoRA fine-tuned models. Our best classifier can predict the number of fine-tuning images with a mean absolute error of 0.36 images, establishing the feasibility of this attack.
☆ HUWSOD: Holistic Self-training for Unified Weakly Supervised Object Detection
Most WSOD methods rely on traditional object proposals to generate candidate regions and are confronted with unstable training, which easily gets stuck in a poor local optimum. In this paper, we introduce a unified, high-capacity weakly supervised object detection (WSOD) network called HUWSOD, which utilizes a comprehensive self-training framework without needing external modules or additional supervision. HUWSOD innovatively incorporates a self-supervised proposal generator and an autoencoder proposal generator with a multi-rate resampling pyramid to replace traditional object proposals, enabling end-to-end WSOD training and inference. Additionally, we implement a holistic self-training scheme that refines detection scores and coordinates through step-wise entropy minimization and consistency-constraint regularization, ensuring consistent predictions across stochastic augmentations of the same image. Extensive experiments on PASCAL VOC and MS COCO demonstrate that HUWSOD competes with state-of-the-art WSOD methods, eliminating the need for offline proposals and additional data. The peak performance of HUWSOD approaches that of fully-supervised Faster R-CNN. Our findings also indicate that randomly initialized boxes, although significantly different from well-designed offline object proposals, are effective for WSOD training.
☆ Looking 3D: Anomaly Detection with 2D-3D Alignment CVPR'24
Automatic anomaly detection based on visual cues holds practical significance in various domains, such as manufacturing and product quality assessment. This paper introduces a new conditional anomaly detection problem, which involves identifying anomalies in a query image by comparing it to a reference shape. To address this challenge, we have created a large dataset, BrokenChairs-180K, consisting of around 180K images, with diverse anomalies, geometries, and textures paired with 8,143 reference 3D shapes. To tackle this task, we have proposed a novel transformer-based approach that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection. Our approach has been rigorously evaluated through comprehensive experiments, serving as a benchmark for future research in this domain.
comment: Accepted at CVPR'24. Codes & dataset available at https://github.com/VICO-UoE/Looking3D
☆ ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
☆ Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads
Visual perception tasks are predominantly solved by Vision Transformer (ViT) architectures, which, despite their effectiveness, encounter a computational bottleneck due to the quadratic complexity of computing self-attention. This inefficiency is largely due to the self-attention heads capturing redundant token interactions, reflecting inherent redundancy within visual data. Many works have aimed to reduce the computational complexity of self-attention in ViTs, leading to the development of efficient and sparse transformer architectures. In this paper, viewing through the efficiency lens, we realized that introducing any sparse self-attention strategy in ViTs can keep the computational overhead low. However, these strategies are sub-optimal as they often fail to capture fine-grained visual details. This observation leads us to propose a general, efficient, sparse architecture, named Fibottention, for approximating self-attention with superlinear complexity that is built upon Fibonacci sequences. The key strategies in Fibottention include: it excludes proximate tokens to reduce redundancy, employs structured sparsity by design to decrease computational demands, and incorporates inception-like diversity across attention heads. This diversity ensures the capture of complementary information through non-overlapping token interactions, optimizing both performance and resource utilization in ViTs for visual representation learning. We embed our Fibottention mechanism into multiple state-of-the-art transformer architectures dedicated to visual tasks. Leveraging only 2-6% of the elements in the self-attention heads, Fibottention in conjunction with ViT and its variants, consistently achieves significant performance boosts compared to standard ViTs in nine datasets across three domains $\unicode{x2013}$ image classification, video understanding, and robot learning tasks.
comment: The code is publicly available at https://github.com/Charlotte-CharMLab/Fibottention
☆ SALVe: Semantic Alignment Verification for Floorplan Reconstruction from Sparse Panoramas ECCV 2022
We propose a new system for automatic 2D floorplan reconstruction that is enabled by SALVe, our novel pairwise learned alignment verifier. The inputs to our system are sparsely located 360$^\circ$ panoramas, whose semantic features (windows, doors, and openings) are inferred and used to hypothesize pairwise room adjacency or overlap. SALVe initializes a pose graph, which is subsequently optimized using GTSAM. Once the room poses are computed, room layouts are inferred using HorizonNet, and the floorplan is constructed by stitching the most confident layout boundaries. We validate our system qualitatively and quantitatively as well as through ablation studies, showing that it outperforms state-of-the-art SfM systems in completeness by over 200%, without sacrificing accuracy. Our results point to the significance of our work: poses of 81% of panoramas are localized in the first 2 connected components (CCs), and 89% in the first 3 CCs. Code and models are publicly available at https://github.com/zillow/salve.
comment: Accepted at ECCV 2022
☆ OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
Current universal segmentation methods demonstrate strong capabilities in pixel-level image and video understanding. However, they lack reasoning abilities and cannot be controlled via text instructions. In contrast, large vision-language multimodal models exhibit powerful vision-based conversation and reasoning capabilities but lack pixel-level understanding and have difficulty accepting visual prompts for flexible user interaction. This paper proposes OMG-LLaVA, a new and elegant framework combining powerful pixel-level vision understanding with reasoning abilities. It can accept various visual and text prompts for flexible user interaction. Specifically, we use a universal segmentation method as the visual encoder, integrating image information, perception priors, and visual prompts into visual tokens provided to the LLM. The LLM is responsible for understanding the user's text instructions and providing text responses and pixel-level segmentation results based on the visual information. We propose perception prior embedding to better integrate perception priors with image features. OMG-LLaVA achieves image-level, object-level, and pixel-level reasoning and understanding in a single model, matching or surpassing the performance of specialized methods on multiple benchmarks. Rather than using LLM to connect each specialist, our work aims at end-to-end training on one encoder, one decoder, and one LLM. The code and model have been released for further research.
☆ Taming Data and Transformers for Audio Generation
Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.
comment: Project Webpage: https://snap-research.github.io/GenAU/
☆ Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model
Transformer-based segmentation methods face the challenge of efficient inference when dealing with high-resolution images. Recently, several linear attention architectures, such as Mamba and RWKV, have attracted much attention as they can process long sequences efficiently. In this work, we focus on designing an efficient segment-anything model by exploring these different architectures. Specifically, we design a mixed backbone that contains convolution and RWKV operation, which achieves the best for both accuracy and efficiency. In addition, we design an efficient decoder to utilize the multiscale tokens to obtain high-quality masks. We denote our method as RWKV-SAM, a simple, effective, fast baseline for SAM-like models. Moreover, we build a benchmark containing various high-quality segmentation datasets and jointly train one efficient yet high-quality segmentation model using this benchmark. Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models. For example, compared with the same-scale transformer model, RWKV-SAM achieves more than 2x speedup and can achieve better segmentation performance on various datasets. In addition, RWKV-SAM outperforms recent vision Mamba models with better classification and semantic segmentation results. Code and models will be publicly available.
comment: 16 pages; 8 figures
☆ SimTxtSeg: Weakly-Supervised Medical Image Segmentation with Simple Text Cues
Weakly-supervised medical image segmentation is a challenging task that aims to reduce the annotation cost while keep the segmentation performance. In this paper, we present a novel framework, SimTxtSeg, that leverages simple text cues to generate high-quality pseudo-labels and study the cross-modal fusion in training segmentation models, simultaneously. Our contribution consists of two key components: an effective Textual-to-Visual Cue Converter that produces visual prompts from text prompts on medical images, and a text-guided segmentation model with Text-Vision Hybrid Attention that fuses text and image features. We evaluate our framework on two medical image segmentation tasks: colonic polyp segmentation and MRI brain tumor segmentation, and achieve consistent state-of-the-art performance.
☆ STAL3D: Unsupervised Domain Adaptation for 3D Object Detection via Collaborating Self-Training and Adversarial Learning
Existing 3D object detection suffers from expensive annotation costs and poor transferability to unknown data due to the domain gap, Unsupervised Domain Adaptation (UDA) aims to generalize detection models trained in labeled source domains to perform robustly on unexplored target domains, providing a promising solution for cross-domain 3D object detection. Although Self-Training (ST) based cross-domain 3D detection methods with the assistance of pseudo-labeling techniques have achieved remarkable progress, they still face the issue of low-quality pseudo-labels when there are significant domain disparities due to the absence of a process for feature distribution alignment. While Adversarial Learning (AL) based methods can effectively align the feature distributions of the source and target domains, the inability to obtain labels in the target domain forces the adoption of asymmetric optimization losses, resulting in a challenging issue of source domain bias. To overcome these limitations, we propose a novel unsupervised domain adaptation framework for 3D object detection via collaborating ST and AL, dubbed as STAL3D, unleashing the complementary advantages of pseudo labels and feature distribution alignment. Additionally, a Background Suppression Adversarial Learning (BS-AL) module and a Scale Filtering Module (SFM) are designed tailored for 3D cross-domain scenes, effectively alleviating the issues of the large proportion of background interference and source domain size bias. Our STAL3D achieves state-of-the-art performance on multiple cross-domain tasks and even surpasses the Oracle results on Waymo $\rightarrow$ KITTI and Waymo $\rightarrow$ KITTI-rain.
comment: Accepted by IEEE-TIV
☆ CORE4D: A 4D Human-Object-Human Interaction Dataset for Collaborative Object REarrangement
Understanding how humans cooperatively rearrange household objects is critical for VR/AR and human-robot interaction. However, in-depth studies on modeling these behaviors are under-researched due to the lack of relevant datasets. We fill this gap by presenting CORE4D, a novel large-scale 4D human-object-human interaction dataset focusing on collaborative object rearrangement, which encompasses diverse compositions of various object geometries, collaboration modes, and 3D scenes. With 1K human-object-human motion sequences captured in the real world, we enrich CORE4D by contributing an iterative collaboration retargeting strategy to augment motions to a variety of novel objects. Leveraging this approach, CORE4D comprises a total of 11K collaboration sequences spanning 3K real and virtual object shapes. Benefiting from extensive motion patterns provided by CORE4D, we benchmark two tasks aiming at generating human-object interaction: human-object motion forecasting and interaction synthesis. Extensive experiments demonstrate the effectiveness of our collaboration retargeting strategy and indicate that CORE4D has posed new challenges to existing human-object interaction generation methodologies. Our dataset and code are available at https://github.com/leolyliu/CORE4D-Instructions.
☆ Learning Visual Conditioning Tokens to Correct Domain Shift for Fully Test-time Adaptation
Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. This work is based on the following interesting finding: in transformer-based image classification, the class token at the first transformer encoder layer can be learned to capture the domain-specific characteristics of target samples during test-time adaptation. This learned token, when combined with input image patch embeddings, is able to gradually remove the domain-specific information from the feature representations of input samples during the transformer encoding process, thereby significantly improving the test-time adaptation performance of the source model across different domains. We refer to this class token as visual conditioning token (VCT). To successfully learn the VCT, we propose a bi-level learning approach to capture the long-term variations of domain-specific characteristics while accommodating local variations of instance-specific characteristics. Experimental results on the benchmark datasets demonstrate that our proposed bi-level visual conditioning token learning method is able to achieve significantly improved test-time adaptation performance by up to 1.9%.
comment: accepted by TMM
☆ LiverUSRecon: Automatic 3D Reconstruction and Volumetry of the Liver with a Few Partial Ultrasound Scans MICCAI 2024
3D reconstruction of the liver for volumetry is important for qualitative analysis and disease diagnosis. Liver volumetry using ultrasound (US) scans, although advantageous due to less acquisition time and safety, is challenging due to the inherent noisiness in US scans, blurry boundaries, and partial liver visibility. We address these challenges by using the segmentation masks of a few incomplete sagittal-plane US scans of the liver in conjunction with a statistical shape model (SSM) built using a set of CT scans of the liver. We compute the shape parameters needed to warp this canonical SSM to fit the US scans through a parametric regression network. The resulting 3D liver reconstruction is accurate and leads to automatic liver volume calculation. We evaluate the accuracy of the estimated liver volumes with respect to CT segmentation volumes using RMSE. Our volume computation is statistically much closer to the volume estimated using CT scans than the volume computed using Childs' method by radiologists: p-value of 0.094 (>0.05) says that there is no significant difference between CT segmentation volumes and ours in contrast to Childs' method. We validate our method using investigations (ablation studies) on the US image resolution, the number of CT scans used for SSM, the number of principal components, and the number of input US scans. To the best of our knowledge, this is the first automatic liver volumetry system using a few incomplete US scans given a set of CT scans of livers for SSM.
comment: 10 pages, Accepted to MICCAI 2024
☆ Efficient World Models with Context-Aware Tokenization ICML 2024
Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $\Delta$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $\Delta$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
comment: ICML 2024
☆ Enhanced Data Transfer Cooperating with Artificial Triplets for Scene Graph Generation
This work focuses on training dataset enhancement of informative relational triplets for Scene Graph Generation (SGG). Due to the lack of effective supervision, the current SGG model predictions perform poorly for informative relational triplets with inadequate training samples. Therefore, we propose two novel training dataset enhancement modules: Feature Space Triplet Augmentation (FSTA) and Soft Transfer. FSTA leverages a feature generator trained to generate representations of an object in relational triplets. The biased prediction based sampling in FSTA efficiently augments artificial triplets focusing on the challenging ones. In addition, we introduce Soft Transfer, which assigns soft predicate labels to general relational triplets to make more supervisions for informative predicate classes effectively. Experimental results show that integrating FSTA and Soft Transfer achieve high levels of both Recall and mean Recall in Visual Genome dataset. The mean of Recall and mean Recall is the highest among all the existing model-agnostic methods.
comment: Accepted to IEICE Transactions on Information and Systems in April 2024
☆ Mapping Land Naturalness from Sentinel-2 using Deep Contextual and Geographical Priors ICLR 2024
In recent decades, the causes and consequences of climate change have accelerated, affecting our planet on an unprecedented scale. This change is closely tied to the ways in which humans alter their surroundings. As our actions continue to impact natural areas, using satellite images to observe and measure these effects has become crucial for understanding and combating climate change. Aiming to map land naturalness on the continuum of modern human pressure, we have developed a multi-modal supervised deep learning framework that addresses the unique challenges of satellite data and the task at hand. We incorporate contextual and geographical priors, represented by corresponding coordinate information and broader contextual information, including and surrounding the immediate patch to be predicted. Our framework improves the model's predictive performance in mapping land naturalness from Sentinel-2 data, a type of multi-spectral optical satellite imagery. Recognizing that our protective measures are only as effective as our understanding of the ecosystem, quantifying naturalness serves as a crucial step toward enhancing our environmental stewardship.
comment: 6 pages, 3 figures, ICLR 2024 Tackling Climate Change with Machine Learning Workshop
☆ PNeRV: A Polynomial Neural Representation for Videos
Extracting Implicit Neural Representations (INRs) on video data poses unique challenges due to the additional temporal dimension. In the context of videos, INRs have predominantly relied on a frame-only parameterization, which sacrifices the spatiotemporal continuity observed in pixel-level (spatial) representations. To mitigate this, we introduce Polynomial Neural Representation for Videos (PNeRV), a parameter-wise efficient, patch-wise INR for videos that preserves spatiotemporal continuity. PNeRV leverages the modeling capabilities of Polynomial Neural Networks to perform the modulation of a continuous spatial (patch) signal with a continuous time (frame) signal. We further propose a custom Hierarchical Patch-wise Spatial Sampling Scheme that ensures spatial continuity while retaining parameter efficiency. We also employ a carefully designed Positional Embedding methodology to further enhance PNeRV's performance. Our extensive experimentation demonstrates that PNeRV outperforms the baselines in conventional Implicit Neural Representation tasks like compression along with downstream applications that require spatiotemporal continuity in the underlying representation. PNeRV not only addresses the challenges posed by video data in the realm of INRs but also opens new avenues for advanced video processing and analysis.
comment: 25 pages, 17 figures, published at TMLR, Feb 2024
☆ Compositional Image Decomposition with Diffusion Models ICML 2024
Given an image of a natural scene, we are able to quickly decompose it into a set of components such as objects, lighting, shadows, and foreground. We can then envision a scene where we combine certain components with those from other images, for instance a set of objects from our bedroom and animals from a zoo under the lighting conditions of a forest, even if we have never encountered such a scene before. In this paper, we present a method to decompose an image into such compositional components. Our approach, Decomp Diffusion, is an unsupervised method which, when given a single image, infers a set of different components in the image, each represented by a diffusion model. We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects. We further illustrate how inferred factors can be flexibly composed, even with factors inferred from other models, to generate a variety of scenes sharply different than those seen in training time. Website and code at https://energy-based-model.github.io/decomp-diffusion.
comment: ICML 2024, Webpage: https://energy-based-model.github.io/decomp-diffusion
☆ Enhancing Continual Learning in Visual Question Answering with Modality-Aware Feature Distillation
Continual learning focuses on incrementally training a model on a sequence of tasks with the aim of learning new tasks while minimizing performance drop on previous tasks. Existing approaches at the intersection of Continual Learning and Visual Question Answering (VQA) do not study how the multimodal nature of the input affects the learning dynamics of a model. In this paper, we demonstrate that each modality evolves at different rates across a continuum of tasks and that this behavior occurs in established encoder-only models as well as modern recipes for developing Vision & Language (VL) models. Motivated by this observation, we propose a modality-aware feature distillation (MAFED) approach which outperforms existing baselines across models of varying scale in three multimodal continual learning settings. Furthermore, we provide ablations showcasing that modality-aware distillation complements experience replay. Overall, our results emphasize the importance of addressing modality-specific dynamics to prevent forgetting in multimodal continual learning.
☆ Human Modelling and Pose Estimation Overview
Human modelling and pose estimation stands at the crossroads of Computer Vision, Computer Graphics, and Machine Learning. This paper presents a thorough investigation of this interdisciplinary field, examining various algorithms, methodologies, and practical applications. It explores the diverse range of sensor technologies relevant to this domain and delves into a wide array of application areas. Additionally, we discuss the challenges and advancements in 2D and 3D human modelling methodologies, along with popular datasets, metrics, and future research directions. The main contribution of this paper lies in its up-to-date comparison of state-of-the-art (SOTA) human pose estimation algorithms in both 2D and 3D domains. By providing this comprehensive overview, the paper aims to enhance understanding of 3D human modelling and pose estimation, offering insights into current SOTA achievements, challenges, and future prospects within the field.
☆ HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
☆ Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io
☆ Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
comment: Accepted by IEEE TPAMI 2024
☆ Local Manifold Learning for No-Reference Image Quality Assessment
Contrastive learning has considerably advanced the field of Image Quality Assessment (IQA), emerging as a widely adopted technique. The core mechanism of contrastive learning involves minimizing the distance between quality-similar (positive) examples while maximizing the distance between quality-dissimilar (negative) examples. Despite its successes, current contrastive learning methods often neglect the importance of preserving the local manifold structure. This oversight can result in a high degree of similarity among hard examples within the feature space, thereby impeding effective differentiation and assessment. To address this issue, we propose an innovative framework that integrates local manifold learning with contrastive learning for No-Reference Image Quality Assessment (NR-IQA). Our method begins by sampling multiple crops from a given image, identifying the most visually salient crop. This crop is then used to cluster other crops from the same image as the positive class, while crops from different images are treated as negative classes to increase inter-class distance. Uniquely, our approach also considers non-saliency crops from the same image as intra-class negative classes to preserve their distinctiveness. Additionally, we employ a mutual learning framework, which further enhances the model's ability to adaptively learn and identify visual saliency regions. Our approach demonstrates a better performance compared to state-of-the-art methods in 7 standard datasets, achieving PLCC values of 0.942 (compared to 0.908 in TID2013) and 0.914 (compared to 0.894 in LIVEC).
☆ ALMA: a mathematics-driven approach for determining tuning parameters in generalized LASSO problems, with applications to MRI
Magnetic Resonance Imaging (MRI) is a powerful technique employed for non-invasive in vivo visualization of internal structures. Sparsity is often deployed to accelerate the signal acquisition or overcome the presence of motion artifacts, improving the quality of image reconstruction. Image reconstruction algorithms use TV-regularized LASSO (Total Variation-regularized LASSO) to retrieve the missing information of undersampled signals, by cleaning the data of noise and while optimizing sparsity. A tuning parameter moderates the balance between these two aspects; its choice affecting the quality of the reconstructions. Currently, there is a lack of general deterministic techniques to choose these parameters, which are oftentimes manually selected and thus hinder the reliability of the reconstructions. Here, we present ALMA (Algorithm for Lagrange Multipliers Approximation), an iterative mathematics-inspired technique that computes tuning parameters for generalized LASSO problems during MRI reconstruction. We analyze quantitatively the performance of these parameters for imaging reconstructions via TV-LASSO in an MRI context on phantoms. Although our study concentrates on TV-LASSO, the techniques developed here hold significant promise for a wide array of applications. ALMA is not only adaptable to more generalized LASSO problems but is also robust to accommodate other forms of regularization beyond total variation. Moreover, it extends effectively to handle non-Cartesian sampling trajectories, broadening its utility in complex data reconstruction scenarios. More generally, ALMA provides a powerful tool for numerically solving constrained optimization problems across various disciplines, offering a versatile and impactful solution for advanced computational challenges.
☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
☆ Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
comment: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/
☆ ProtoGMM: Multi-prototype Gaussian-Mixture-based Domain Adaptation Model for Semantic Segmentation
Domain adaptive semantic segmentation aims to generate accurate and dense predictions for an unlabeled target domain by leveraging a supervised model trained on a labeled source domain. The prevalent self-training approach involves retraining the dense discriminative classifier of $p(class|pixel feature)$ using the pseudo-labels from the target domain. While many methods focus on mitigating the issue of noisy pseudo-labels, they often overlook the underlying data distribution p(pixel feature|class) in both the source and target domains. To address this limitation, we propose the multi-prototype Gaussian-Mixture-based (ProtoGMM) model, which incorporates the GMM into contrastive losses to perform guided contrastive learning. Contrastive losses are commonly executed in the literature using memory banks, which can lead to class biases due to underrepresented classes. Furthermore, memory banks often have fixed capacities, potentially restricting the model's ability to capture diverse representations of the target/source domains. An alternative approach is to use global class prototypes (i.e. averaged features per category). However, the global prototypes are based on the unimodal distribution assumption per class, disregarding within-class variation. To address these challenges, we propose the ProtoGMM model. This novel approach involves estimating the underlying multi-prototype source distribution by utilizing the GMM on the feature space of the source samples. The components of the GMM model act as representative prototypes. To achieve increased intra-class semantic similarity, decreased inter-class similarity, and domain alignment between the source and target domains, we employ multi-prototype contrastive learning between source distribution and target samples. The experiments show the effectiveness of our method on UDA benchmarks.
☆ Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.
comment: 8 pages, 4 figures
☆ Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data
In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.
☆ Single Image Estimation of Cell Migration Direction by Deep Circular Regression
In this paper we study the problem of estimating the migration direction of cells based on a single image. To the best of our knowledge, there is only one related work that uses a classification CNN for four classes (quadrants). This approach does not allow detailed directional resolution. We solve the single image estimation problem using deep circular regression with special attention to cycle-sensitive methods. On two databases we achieve an average accuracy of $\sim$17 degrees, which is a significant improvement over the previous work.
☆ RAVEN: Multitask Retrieval Augmented Vision-Language Learning
The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
☆ BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision MICCAI 2024
Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focus on those background features instead of on the image content. We propose a simple, yet effective random background augmentation method called BackMix, which samples random backgrounds from other examples in the training set. By enforcing the background to be uncorrelated with the outcome, the model learns to focus on the data within the ultrasound sector and becomes invariant to the regions outside this. We extend our method in a semi-supervised setting, finding that the positive effects of BackMix are maintained with as few as 5% of segmentation labels. A loss weighting mechanism, wBackMix, is also proposed to increase the contribution of the augmented examples. We validate our method on both in-distribution and out-of-distribution datasets, demonstrating significant improvements in classification accuracy, region focus and generalisability. Our source code is available at: https://github.com/kitbransby/BackMix
comment: Accepted at MICCAI 2024 (Pre-print)
☆ CELLO: Causal Evaluation of Large Vision-Language Models
Causal reasoning is fundamental to human intelligence and crucial for effective decision-making in real-world environments. Despite recent advancements in large vision-language models (LVLMs), their ability to comprehend causality remains unclear. Previous work typically focuses on commonsense causality between events and/or actions, which is insufficient for applications like embodied agents and lacks the explicitly defined causal graphs required for formal causal reasoning. To overcome these limitations, we introduce a fine-grained and unified definition of causality involving interactions between humans and/or objects. Building on the definition, we construct a novel dataset, CELLO, consisting of 14,094 causal questions across all four levels of causality: discovery, association, intervention, and counterfactual. This dataset surpasses traditional commonsense causality by including explicit causal graphs that detail the interactions between humans and objects. Extensive experiments on CELLO reveal that current LVLMs still struggle with causal reasoning tasks, but they can benefit significantly from our proposed CELLO-CoT, a causally inspired chain-of-thought prompting strategy. Both quantitative and qualitative analyses from this study provide valuable insights for future research. Our project page is at https://github.com/OpenCausaLab/CELLO.
☆ Evidential Concept Embedding Models: Towards Reliable Concept Explanations for Skin Disease Diagnosis MICCAI 2024
Due to the high stakes in medical decision-making, there is a compelling demand for interpretable deep learning methods in medical image analysis. Concept Bottleneck Models (CBM) have emerged as an active interpretable framework incorporating human-interpretable concepts into decision-making. However, their concept predictions may lack reliability when applied to clinical diagnosis, impeding concept explanations' quality. To address this, we propose an evidential Concept Embedding Model (evi-CEM), which employs evidential learning to model the concept uncertainty. Additionally, we offer to leverage the concept uncertainty to rectify concept misalignments that arise when training CBMs using vision-language models without complete concept supervision. With the proposed methods, we can enhance concept explanations' reliability for both supervised and label-efficient settings. Furthermore, we introduce concept uncertainty for effective test-time intervention. Our evaluation demonstrates that evi-CEM achieves superior performance in terms of concept prediction, and the proposed concept rectification effectively mitigates concept misalignments for label-efficient training. Our code is available at https://github.com/obiyoag/evi-CEM.
comment: accepted by MICCAI 2024
☆ FDLite: A Single Stage Lightweight Face Detector Network
Face detection is frequently attempted by using heavy pre-trained backbone networks like ResNet-50/101/152 and VGG16/19. Few recent works have also proposed lightweight detectors with customized backbones, novel loss functions and efficient training strategies. The novelty of this work lies in the design of a lightweight detector while training with only the commonly used loss functions and learning strategies. The proposed face detector grossly follows the established RetinaFace architecture. The first contribution of this work is the design of a customized lightweight backbone network (BLite) having 0.167M parameters with 0.52 GFLOPs. The second contribution is the use of two independent multi-task losses. The proposed lightweight face detector (FDLite) has 0.26M parameters with 0.94 GFLOPs. The network is trained on the WIDER FACE dataset. FDLite is observed to achieve 92.3\%, 89.8\%, and 82.2\% Average Precision (AP) on the easy, medium, and hard subsets of the WIDER FACE validation dataset, respectively.
comment: 10 pages, 14 figures
☆ DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
Current multimodal large language models (MLLMs) face significant challenges in visual document understanding (VDU) tasks due to the high resolution, dense text, and complex layouts typical of document images. These characteristics demand a high level of detail perception ability from MLLMs. While increasing input resolution improves detail perception, it also leads to longer sequences of visual tokens, increasing computational costs and straining the models' ability to handle long contexts. To address these challenges, we introduce DocKylin, a document-centric MLLM that performs visual content slimming at both the pixel and token levels, thereby reducing token sequence length in VDU scenarios. DocKylin utilizes an Adaptive Pixel Slimming (APS) preprocessing module to perform pixel-level slimming, increasing the proportion of informative pixels. Moreover, DocKylin incorporates a novel Dynamic Token Slimming (DTS) module to conduct token-level slimming, filtering essential tokens and removing others to create a compressed, adaptive visual sequence. Experiments demonstrate DocKylin's promising performance across various VDU benchmarks. Notably, both the proposed APS and DTS are parameter-free, facilitating easy integration into existing MLLMs, and our experiments indicate their potential for broader applications.
☆ Dimensions underlying the representational alignment of deep neural networks with humans
Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.
☆ Unsupervised Latent Stain Adaption for Digital Pathology MICCAI2024
In digital pathology, deep learning (DL) models for tasks such as segmentation or tissue classification are known to suffer from domain shifts due to different staining techniques. Stain adaptation aims to reduce the generalization error between different stains by training a model on source stains that generalizes to target stains. Despite the abundance of target stain data, a key challenge is the lack of annotations. To address this, we propose a joint training between artificially labeled and unlabeled data including all available stained images called Unsupervised Latent Stain Adaption (ULSA). Our method uses stain translation to enrich labeled source images with synthetic target images in order to increase supervised signals. Moreover, we leverage unlabeled target stain images using stain-invariant feature consistency learning. With ULSA we present a semi-supervised strategy for efficient stain adaption without access to annotated target stain data. Remarkably, ULSA is task agnostic in patch-level analysis for whole slide images (WSIs). Through extensive evaluation on external datasets, we demonstrate that ULSA achieves state-of-the-art (SOTA) performance in kidney tissue segmentation and breast cancer classification across a spectrum of staining variations. Our findings suggest that ULSA is an important framework towards stain adaption in digital pathology.
comment: Accepted in MICCAI2024
☆ FAGhead: Fully Animate Gaussian Head from Monocular Videos
High-fidelity reconstruction of 3D human avatars has a wild application in visual reality. In this paper, we introduce FAGhead, a method that enables fully controllable human portraits from monocular videos. We explicit the traditional 3D morphable meshes (3DMM) and optimize the neutral 3D Gaussians to reconstruct with complex expressions. Furthermore, we employ a novel Point-based Learnable Representation Field (PLRF) with learnable Gaussian point positions to enhance reconstruction performance. Meanwhile, to effectively manage the edges of avatars, we introduced the alpha rendering to supervise the alpha value of each pixel. Extensive experimental results on the open-source datasets and our capturing datasets demonstrate that our approach is able to generate high-fidelity 3D head avatars and fully control the expression and pose of the virtual avatars, which is outperforming than existing works.
☆ Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO
Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.
☆ SimpleFusion: A Simple Fusion Framework for Infrared and Visible Images
Integrating visible and infrared images into one high-quality image, also known as visible and infrared image fusion, is a challenging yet critical task for many downstream vision tasks. Most existing works utilize pretrained deep neural networks or design sophisticated frameworks with strong priors for this task, which may be unsuitable or lack flexibility. This paper presents SimpleFusion, a simple yet effective framework for visible and infrared image fusion. Our framework follows the decompose-and-fusion paradigm, where the visible and the infrared images are decomposed into reflectance and illumination components via Retinex theory and followed by the fusion of these corresponding elements. The whole framework is designed with two plain convolutional neural networks without downsampling, which can perform image decomposition and fusion efficiently. Moreover, we introduce decomposition loss and a detail-to-semantic loss to preserve the complementary information between the two modalities for fusion. We conduct extensive experiments on the challenging benchmarks, verifying the superiority of our method over previous state-of-the-arts. Code is available at \href{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}{https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images}
comment: code:https://github.com/hxwxss/SimpleFusion-A-Simple-Fusion-Framework-for-Infrared-and-Visible-Images
☆ BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection
3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
comment: 8 pages, 5 figures
☆ CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
comment: 19 pages, 3 figures, 2 tables
☆ Using diffusion model as constraint: Empower Image Restoration Network Training with Diffusion Model
Image restoration has made marvelous progress with the advent of deep learning. Previous methods usually rely on designing powerful network architecture to elevate performance, however, the natural visual effect of the restored results is limited by color and texture distortions. Besides the visual perceptual quality, the semantic perception recovery is an important but often overlooked perspective of restored image, which is crucial for the deployment in high-level tasks. In this paper, we propose a new perspective to resort these issues by introducing a naturalness-oriented and semantic-aware optimization mechanism, dubbed DiffLoss. Specifically, inspired by the powerful distribution coverage capability of the diffusion model for natural image generation, we exploit the Markov chain sampling property of diffusion model and project the restored results of existing networks into the sampling space. Besides, we reveal that the bottleneck feature of diffusion models, also dubbed h-space feature, is a natural high-level semantic space. We delve into this property and propose a semantic-aware loss to further unlock its potential of semantic perception recovery, which paves the way to connect image restoration task and downstream high-level recognition task. With these two strategies, the DiffLoss can endow existing restoration methods with both more natural and semantic-aware results. We verify the effectiveness of our method on substantial common image restoration tasks and benchmarks. Code will be available at https://github.com/JosephTiTan/DiffLoss.
☆ VideoMambaPro: A Leap Forward for Mamba in Video Understanding
Video understanding requires the extraction of rich spatio-temporal representations, which transformer models achieve through self-attention. Unfortunately, self-attention poses a computational burden. In NLP, Mamba has surfaced as an efficient alternative for transformers. However, Mamba's successes do not trivially extend to computer vision tasks, including those in video analysis. In this paper, we theoretically analyze the differences between self-attention and Mamba. We identify two limitations in Mamba's token processing: historical decay and element contradiction. We propose VideoMambaPro (VMP) that solves the identified limitations by adding masked backward computation and elemental residual connections to a VideoMamba backbone. VideoMambaPro shows state-of-the-art video action recognition performance compared to transformer models, and surpasses VideoMamba by clear margins: 7.9% and 8.1% top-1 on Kinetics-400 and Something-Something V2, respectively. Our VideoMambaPro-M model achieves 91.9% top-1 on Kinetics-400, only 0.2% below InternVideo2-6B but with only 1.2% of its parameters. The combination of high performance and efficiency makes VideoMambaPro an interesting alternative for transformer models.
☆ Improving Taxonomic Image-based Out-of-distribution Detection With DNA Barcodes
Image-based species identification could help scaling biodiversity monitoring to a global scale. Many challenges still need to be solved in order to implement these systems in real-world applications. A reliable image-based monitoring system must detect out-of-distribution (OOD) classes it has not been presented before. This is challenging especially with fine-grained classes. Emerging environmental monitoring techniques, DNA metabarcoding and eDNA, can help by providing information on OOD classes that are present in a sample. In this paper, we study if DNA barcodes can also support in finding the outlier images based on the outlier DNA sequence's similarity to the seen classes. We propose a re-ordering approach that can be easily applied on any pre-trained models and existing OOD detection methods. We experimentally show that the proposed approach improves taxonomic OOD detection compared to all common baselines. We also show that the method works thanks to a correlation between visual similarity and DNA barcode proximity. The code and data are available at https://github.com/mikkoim/dnaimg-ood.
comment: Accepted to EUSIPCO 2024
☆ Zero-shot domain adaptation based on dual-level mix and contrast
Zero-shot domain adaptation (ZSDA) is a domain adaptation problem in the situation that labeled samples for a target task (task of interest) are only available from the source domain at training time, but for a task different from the task of interest (irrelevant task), labeled samples are available from both source and target domains. In this situation, classical domain adaptation techniques can only learn domain-invariant features in the irrelevant task. However, due to the difference in sample distribution between the two tasks, domain-invariant features learned in the irrelevant task are biased and not necessarily domain-invariant in the task of interest. To solve this problem, this paper proposes a new ZSDA method to learn domain-invariant features with low task bias. To this end, we propose (1) data augmentation with dual-level mixups in both task and domain to fill the absence of target task-of-interest data, (2) an extension of domain adversarial learning to learn domain-invariant features with less task bias, and (3) a new dual-level contrastive learning method that enhances domain-invariance and less task biasedness of features. Experimental results show that our proposal achieves good performance on several benchmarks.
comment: Accepted by IEEE conference on Artificial intelligence 2024
☆ Semi-supervised Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.
comment: 17 pages
☆ RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton
Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 88.7% to 96.2%, and in the $ABC \to D$ setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview
☆ Structural Attention: Rethinking Transformer for Unpaired Medical Image Synthesis MICCAI2024
Unpaired medical image synthesis aims to provide complementary information for an accurate clinical diagnostics, and address challenges in obtaining aligned multi-modal medical scans. Transformer-based models excel in imaging translation tasks thanks to their ability to capture long-range dependencies. Although effective in supervised training settings, their performance falters in unpaired image synthesis, particularly in synthesizing structural details. This paper empirically demonstrates that, lacking strong inductive biases, Transformer can converge to non-optimal solutions in the absence of paired data. To address this, we introduce UNet Structured Transformer (UNest), a novel architecture incorporating structural inductive biases for unpaired medical image synthesis. We leverage the foundational Segment-Anything Model to precisely extract the foreground structure and perform structural attention within the main anatomy. This guides the model to learn key anatomical regions, thus improving structural synthesis under the lack of supervision in unpaired training. Evaluated on two public datasets, spanning three modalities, i.e., MR, CT, and PET, UNest improves recent methods by up to 19.30% across six medical image synthesis tasks. Our code is released at https://github.com/HieuPhan33/MICCAI2024-UNest.
comment: MICCAI2024 - Early Accept Top 11%
☆ AnyControl: Create Your Artwork with Versatile Control on Text-to-Image Generation
The field of text-to-image (T2I) generation has made significant progress in recent years, largely driven by advancements in diffusion models. Linguistic control enables effective content creation, but struggles with fine-grained control over image generation. This challenge has been explored, to a great extent, by incorporating additional user-supplied spatial conditions, such as depth maps and edge maps, into pre-trained T2I models through extra encoding. However, multi-control image synthesis still faces several challenges. Specifically, current approaches are limited in handling free combinations of diverse input control signals, overlook the complex relationships among multiple spatial conditions, and often fail to maintain semantic alignment with provided textual prompts. This can lead to suboptimal user experiences. To address these challenges, we propose AnyControl, a multi-control image synthesis framework that supports arbitrary combinations of diverse control signals. AnyControl develops a novel Multi-Control Encoder that extracts a unified multi-modal embedding to guide the generation process. This approach enables a holistic understanding of user inputs, and produces high-quality, faithful results under versatile control signals, as demonstrated by extensive quantitative and qualitative evaluations. Our project page is available in \url{https://any-control.github.io}.
☆ MMR-Mamba: Multi-Contrast MRI Reconstruction with Mamba and Spatial-Frequency Information Fusion
Multi-contrast MRI acceleration has become prevalent in MR imaging, enabling the reconstruction of high-quality MR images from under-sampled k-space data of the target modality, using guidance from a fully-sampled auxiliary modality. The main crux lies in efficiently and comprehensively integrating complementary information from the auxiliary modality. Existing methods either suffer from quadratic computational complexity or fail to capture long-range correlated features comprehensively. In this work, we propose MMR-Mamba, a novel framework that achieves comprehensive integration of multi-contrast features through Mamba and spatial-frequency information fusion. Firstly, we design the \textit{Target modality-guided Cross Mamba} (TCM) module in the spatial domain, which maximally restores the target modality information by selectively absorbing useful information from the auxiliary modality. Secondly, leveraging global properties of the Fourier domain, we introduce the \textit{Selective Frequency Fusion} (SFF) module to efficiently integrate global information in the frequency domain and recover high-frequency signals for the reconstruction of structure details. Additionally, we present the \textit{Adaptive Spatial-Frequency Fusion} (ASFF) module, which enhances fused features by supplementing less informative features from one domain with corresponding features from the other domain. These innovative strategies ensure efficient feature fusion across spatial and frequency domains, avoiding the introduction of redundant information and facilitating the reconstruction of high-quality target images. Extensive experiments on the BraTS and fastMRI knee datasets demonstrate the superiority of the proposed MMR-Mamba over state-of-the-art MRI reconstruction methods.
comment: 10 pages, 5 figure
☆ Investigating and Defending Shortcut Learning in Personalized Diffusion Models
Personalized diffusion models have gained popularity for adapting pre-trained text-to-image models to generate images of specific topics with only a few images. However, recent studies find that these models are vulnerable to minor adversarial perturbation, and the fine-tuning performance is largely degraded on corrupted datasets. Such characteristics are further exploited to craft protective perturbation on sensitive images like portraits that prevent unauthorized generation. In response, diffusion-based purification methods have been proposed to remove these perturbations and retain generation performance. However, existing works lack detailed analysis of the fundamental shortcut learning vulnerability of personalized diffusion models and also turn to over-purifying the images cause information loss. In this paper, we take a closer look at the fine-tuning process of personalized diffusion models through the lens of shortcut learning and propose a hypothesis that could explain the underlying manipulation mechanisms of existing perturbation methods. Specifically, we find that the perturbed images are greatly shifted from their original paired prompt in the CLIP-based latent space. As a result, training with this mismatched image-prompt pair creates a construction that causes the models to dump their out-of-distribution noisy patterns to the identifier, thus causing serious performance degradation. Based on this observation, we propose a systematic approach to retain the training performance with purification that realigns the latent image and its semantic meaning and also introduces contrastive learning with a negative token to decouple the learning of wanted clean identity and the unwanted noisy pattern, that shows strong potential capacity against further adaptive perturbation.
comment: Preprint
☆ CLIP3D-AD: Extending CLIP for 3D Few-Shot Anomaly Detection with Multi-View Images Generation
Few-shot anomaly detection methods can effectively address data collecting difficulty in industrial scenarios. Compared to 2D few-shot anomaly detection (2D-FSAD), 3D few-shot anomaly detection (3D-FSAD) is still an unexplored but essential task. In this paper, we propose CLIP3D-AD, an efficient 3D-FSAD method extended on CLIP. We successfully transfer strong generalization ability of CLIP into 3D-FSAD. Specifically, we synthesize anomalous images on given normal images as sample pairs to adapt CLIP for 3D anomaly classification and segmentation. For classification, we introduce an image adapter and a text adapter to fine-tune global visual features and text features. Meanwhile, we propose a coarse-to-fine decoder to fuse and facilitate intermediate multi-layer visual representations of CLIP. To benefit from geometry information of point cloud and eliminate modality and data discrepancy when processed by CLIP, we project and render point cloud to multi-view normal and anomalous images. Then we design multi-view fusion module to fuse features of multi-view images extracted by CLIP which are used to facilitate visual representations for further enhancing vision-language correlation. Extensive experiments demonstrate that our method has a competitive performance of 3D few-shot anomaly classification and segmentation on MVTec-3D AD dataset.
comment: 10 pages, 7 figures
☆ RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation
Fisheye images are categorized fisheye into central and deviated based on the optical center position. Existing rectification methods are limited to central fisheye images, while this paper proposes a novel method that extends to deviated fisheye image rectification. The challenge lies in the variant global distortion distribution pattern caused by the random optical center position. To address this challenge, we propose a distortion vector map (DVM) that measures the degree and direction of local distortion. By learning the DVM, the model can independently identify local distortions at each pixel without relying on global distortion patterns. The model adopts a pre-training and fine-tuning training paradigm. In the pre-training stage, it predicts the distortion vector map and perceives the local distortion features of each pixel. In the fine-tuning stage, it predicts a pixel-wise flow map for deviated fisheye image rectification. We also propose a data augmentation method mixing central, deviated, and distorted-free images. Such data augmentation promotes the model performance in rectifying both central and deviated fisheye images, compared with models trained on single-type fisheye images. Extensive experiments demonstrate the effectiveness and superiority of the proposed method.
☆ Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We collect and release VisArgs, an annotated corpus designed to make explicit the (usually implicit) structures underlying visual arguments. VisArgs includes 1,611 images accompanied by three types of textual annotations: 5,112 visual premises (with region annotations), 5,574 commonsense premises, and reasoning trees connecting them to a broader argument. We propose three tasks over VisArgs to probe machine capacity for visual argument understanding: localization of premises, identification of premises, and deduction of conclusions. Experiments demonstrate that 1) machines cannot fully identify the relevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy of only 78.5%, whereas humans reached 98.0%. All models showed a performance drop, with an average decrease in accuracy of 19.5%, when the comparison set was changed from objects outside the image to irrelevant objects within the image. Furthermore, 2) this limitation is the greatest factor impacting their performance in understanding visual arguments. Most models improved the most when given relevant visual premises as additional inputs, compared to other inputs, for deducing the conclusion of the visual argument.
comment: 12 pages, 5 figures
☆ Classification of Carotid Plaque with Jellyfish Sign Through Convolutional and Recurrent Neural Networks Utilizing Plaque Surface Edges
In carotid arteries, plaque can develop as localized elevated lesions. The Jellyfish sign, marked by fluctuating plaque surfaces with blood flow pulsation, is a dynamic characteristic of these plaques that has recently attracted attention. Detecting this sign is vital, as it is often associated with cerebral infarction. This paper proposes an ultrasound video-based classification method for the Jellyfish sign, using deep neural networks. The proposed method first preprocesses carotid ultrasound videos to separate the movement of the vascular wall from plaque movements. These preprocessed videos are then combined with plaque surface information and fed into a deep learning model comprising convolutional and recurrent neural networks, enabling the efficient classification of the Jellyfish sign. The proposed method was verified using ultrasound video images from 200 patients. Ablation studies demonstrated the effectiveness of each component of the proposed method.
comment: 4 pages, 3 figures, accepted at IEEE EMBC 2024
☆ Manipulate-Anything: Automating Real-World Robots using Vision-Language Models
Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies. We believe \methodLong\ can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.
comment: Project page: https://robo-point.github.io/
☆ A Universal Railway Obstacle Detection System based on Semi-supervised Segmentation And Optical Flow
Detecting obstacles in railway scenarios is both crucial and challenging due to the wide range of obstacle categories and varying ambient conditions such as weather and light. Given the impossibility of encompassing all obstacle categories during the training stage, we address this out-of-distribution (OOD) issue with a semi-supervised segmentation approach guided by optical flow clues. We reformulate the task as a binary segmentation problem instead of the traditional object detection approach. To mitigate data shortages, we generate highly realistic synthetic images using Segment Anything (SAM) and YOLO, eliminating the need for manual annotation to produce abundant pixel-level annotations. Additionally, we leverage optical flow as prior knowledge to train the model effectively. Several experiments are conducted, demonstrating the feasibility and effectiveness of our approach.
☆ Autoencoder based approach for the mitigation of spurious correlations
Deep neural networks (DNNs) have exhibited remarkable performance across various tasks, yet their susceptibility to spurious correlations poses a significant challenge for out-of-distribution (OOD) generalization. Spurious correlations refer to erroneous associations in data that do not reflect true underlying relationships but are instead artifacts of dataset characteristics or biases. These correlations can lead DNNs to learn patterns that are not robust across diverse datasets or real-world scenarios, hampering their ability to generalize beyond training data. In this paper, we propose an autoencoder-based approach to analyze the nature of spurious correlations that exist in the Global Wheat Head Detection (GWHD) 2021 dataset. We then use inpainting followed by Weighted Boxes Fusion (WBF) to achieve a 2% increase in the Average Domain Accuracy (ADA) over the YOLOv5 baseline and consistently show that our approach has the ability to suppress some of the spurious correlations in the GWHD 2021 dataset. The key advantage of our approach is that it is more suitable in scenarios where there is limited scope to adapt or fine-tune the trained model in unseen test environments.
☆ 360 in the Wild: Dataset for Depth Prediction and View Synthesis
The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{\circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.
☆ AlignIT: Enhancing Prompt Alignment in Customization of Text-to-Image Models
We consider the problem of customizing text-to-image diffusion models with user-supplied reference images. Given new prompts, the existing methods can capture the key concept from the reference images but fail to align the generated image with the prompt. In this work, we seek to address this key issue by proposing new methods that can easily be used in conjunction with existing customization methods that optimize the embeddings/weights at various intermediate stages of the text encoding process. The first contribution of this paper is a dissection of the various stages of the text encoding process leading up to the conditioning vector for text-to-image models. We take a holistic view of existing customization methods and notice that key and value outputs from this process differs substantially from their corresponding baseline (non-customized) models (e.g., baseline stable diffusion). While this difference does not impact the concept being customized, it leads to other parts of the generated image not being aligned with the prompt (see first row in Fig 1). Further, we also observe that these keys and values allow independent control various aspects of the final generation, enabling semantic manipulation of the output. Taken together, the features spanning these keys and values, serve as the basis for our next contribution where we fix the aforementioned issues with existing methods. We propose a new post-processing algorithm, \textbf{AlignIT}, that infuses the keys and values for the concept of interest while ensuring the keys and values for all other tokens in the input prompt are unchanged. Our proposed method can be plugged in directly to existing customization methods, leading to a substantial performance improvement in the alignment of the final result with the input prompt while retaining the customization quality.
comment: 10 pages, 9 figures
☆ Advancing Cross-domain Discriminability in Continual Learning of Vison-Language Models
Continual learning (CL) with Vision-Language Models (VLMs) has overcome the constraints of traditional CL, which only focuses on previously encountered classes. During the CL of VLMs, we need not only to prevent the catastrophic forgetting on incrementally learned knowledge but also to preserve the zero-shot ability of VLMs. However, existing methods require additional reference datasets to maintain such zero-shot ability and rely on domain-identity hints to classify images across different domains. In this study, we propose Regression-based Analytic Incremental Learning (RAIL), which utilizes a recursive ridge regression-based adapter to learn from a sequence of domains in a non-forgetting manner and decouple the cross-domain correlations by projecting features to a higher-dimensional space. Cooperating with a training-free fusion module, RAIL absolutely preserves the VLM's zero-shot ability on unseen domains without any reference data. Additionally, we introduce Cross-domain Task-Agnostic Incremental Learning (X-TAIL) setting. In this setting, a CL learner is required to incrementally learn from multiple domains and classify test images from both seen and unseen domains without any domain-identity hint. We theoretically prove RAIL's absolute memorization on incrementally learned domains. Experiment results affirm RAIL's state-of-the-art performance in both X-TAIL and existing Multi-domain Task-Incremental Learning settings. The code will be released upon acceptance.
☆ Learning Modality Knowledge Alignment for Cross-Modality Transfer ICML 2024
Cross-modality transfer aims to leverage large pretrained models to complete tasks that may not belong to the modality of pretraining data. Existing works achieve certain success in extending classical finetuning to cross-modal scenarios, yet we still lack understanding about the influence of modality gap on the transfer. In this work, a series of experiments focusing on the source representation quality during transfer are conducted, revealing the connection between larger modality gap and lesser knowledge reuse which means ineffective transfer. We then formalize the gap as the knowledge misalignment between modalities using conditional distribution P(Y|X). Towards this problem, we present Modality kNowledge Alignment (MoNA), a meta-learning approach that learns target data transformation to reduce the modality knowledge discrepancy ahead of the transfer. Experiments show that out method enables better reuse of source modality knowledge in cross-modality transfer, which leads to improvements upon existing finetuning methods.
comment: ICML 2024
☆ Dysca: A Dynamic and Scalable Benchmark for Evaluating Perception Ability of LVLMs
Currently many benchmarks have been proposed to evaluate the perception ability of the Large Vision-Language Models (LVLMs). However, most benchmarks conduct questions by selecting images from existing datasets, resulting in the potential data leakage. Besides, these benchmarks merely focus on evaluating LVLMs on the realistic style images and clean scenarios, leaving the multi-stylized images and noisy scenarios unexplored. In response to these challenges, we propose a dynamic and scalable benchmark named Dysca for evaluating LVLMs by leveraging synthesis images. Specifically, we leverage Stable Diffusion and design a rule-based method to dynamically generate novel images, questions and the corresponding answers. We consider 51 kinds of image styles and evaluate the perception capability in 20 subtasks. Moreover, we conduct evaluations under 4 scenarios (i.e., Clean, Corruption, Print Attacking and Adversarial Attacking) and 3 question types (i.e., Multi-choices, True-or-false and Free-form). Thanks to the generative paradigm, Dysca serves as a scalable benchmark for easily adding new subtasks and scenarios. A total of 8 advanced open-source LVLMs with 10 checkpoints are evaluated on Dysca, revealing the drawbacks of current LVLMs. The benchmark is released in \url{https://github.com/Benchmark-Dysca/Dysca}.
☆ Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition
Existing event stream-based pattern recognition models usually represent the event stream as the point cloud, voxel, image, etc., and design various deep neural networks to learn their features. Although considerable results can be achieved in simple cases, however, the model performance may be limited by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this paper, we propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be learned separately by utilizing Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be fed into the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on multiple widely used event stream-based classification datasets. Specifically, we achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51\%$, which exceeds the second place by $+2.21\%$. The source code of this paper has been released on \url{https://github.com/Event-AHU/EFV_event_classification/tree/EFVpp}.
comment: In Peer Review, Journal Extension of PRCV 2023
☆ Revisiting Backdoor Attacks against Large Vision-Language Models
Instruction tuning enhances large vision-language models (LVLMs) but raises security risks through potential backdoor attacks due to their openness. Previous backdoor studies focus on enclosed scenarios with consistent training and testing instructions, neglecting the practical domain gaps that could affect attack effectiveness. This paper empirically examines the generalizability of backdoor attacks during the instruction tuning of LVLMs for the first time, revealing certain limitations of most backdoor strategies in practical scenarios. We quantitatively evaluate the generalizability of six typical backdoor attacks on image caption benchmarks across multiple LVLMs, considering both visual and textual domain offsets. Our findings indicate that attack generalizability is positively correlated with the backdoor trigger's irrelevance to specific images/models and the preferential correlation of the trigger pattern. Additionally, we modify existing backdoor attacks based on the above key observations, demonstrating significant improvements in cross-domain scenario generalizability (+86% attack success rate). Notably, even without access to the instruction datasets, a multimodal instruction set can be successfully poisoned with a very low poisoning rate (0.2%), achieving an attack success rate of over 97%. This paper underscores that even simple traditional backdoor strategies pose a serious threat to LVLMs, necessitating more attention and in-depth research.
comment: 23 pages, 8 figures
☆ Dense Monocular Motion Segmentation Using Optical Flow and Pseudo Depth Map: A Zero-Shot Approach
Motion segmentation from a single moving camera presents a significant challenge in the field of computer vision. This challenge is compounded by the unknown camera movements and the lack of depth information of the scene. While deep learning has shown impressive capabilities in addressing these issues, supervised models require extensive training on massive annotated datasets, and unsupervised models also require training on large volumes of unannotated data, presenting significant barriers for both. In contrast, traditional methods based on optical flow do not require training data, however, they often fail to capture object-level information, leading to over-segmentation or under-segmentation. In addition, they also struggle in complex scenes with substantial depth variations and non-rigid motion, due to the overreliance of optical flow. To overcome these challenges, we propose an innovative hybrid approach that leverages the advantages of both deep learning methods and traditional optical flow based methods to perform dense motion segmentation without requiring any training. Our method initiates by automatically generating object proposals for each frame using foundation models. These proposals are then clustered into distinct motion groups using both optical flow and relative depth maps as motion cues. The integration of depth maps derived from state-of-the-art monocular depth estimation models significantly enhances the motion cues provided by optical flow, particularly in handling motion parallax issues. Our method is evaluated on the DAVIS-Moving and YTVOS-Moving datasets, and the results demonstrate that our method outperforms the best unsupervised method and closely matches with the state-of-theart supervised methods.
comment: For the offical publication, see https://crv.pubpub.org/pub/iunjzl55
☆ Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs ICIP 2024
This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.
comment: Accepted as a conference paper in IEEE ICIP 2024
☆ Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis CVPR 2024
This paper presents a novel non-rigid point set registration method that is inspired by unsupervised clustering analysis. Unlike previous approaches that treat the source and target point sets as separate entities, we develop a holistic framework where they are formulated as clustering centroids and clustering members, separately. We then adopt Tikhonov regularization with an $\ell_1$-induced Laplacian kernel instead of the commonly used Gaussian kernel to ensure smooth and more robust displacement fields. Our formulation delivers closed-form solutions, theoretical guarantees, independence from dimensions, and the ability to handle large deformations. Subsequently, we introduce a clustering-improved Nystr\"om method to effectively reduce the computational complexity and storage of the Gram matrix to linear, while providing a rigorous bound for the low-rank approximation. Our method achieves high accuracy results across various scenarios and surpasses competitors by a significant margin, particularly on shapes with substantial deformations. Additionally, we demonstrate the versatility of our method in challenging tasks such as shape transfer and medical registration.
comment: [CVPR 2024 Highlight] Project and code at: https://github.com/zikai1/CVPR24_PointSetReg
☆ Divide, Ensemble and Conquer: The Last Mile on Unsupervised Domain Adaptation for On-Board Semantic Segmentation
The last mile of unsupervised domain adaptation (UDA) for semantic segmentation is the challenge of solving the syn-to-real domain gap. Recent UDA methods have progressed significantly, yet they often rely on strategies customized for synthetic single-source datasets (e.g., GTA5), which limits their generalisation to multi-source datasets. Conversely, synthetic multi-source datasets hold promise for advancing the last mile of UDA but remain underutilized in current research. Thus, we propose DEC, a flexible UDA framework for multi-source datasets. Following a divide-and-conquer strategy, DEC simplifies the task by categorizing semantic classes, training models for each category, and fusing their outputs by an ensemble model trained exclusively on synthetic datasets to obtain the final segmentation mask. DEC can integrate with existing UDA methods, achieving state-of-the-art performance on Cityscapes, BDD100K, and Mapillary Vistas, significantly narrowing the syn-to-real domain gap.
☆ PathAlign: A vision-language model for whole slide images in histopathology
Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
comment: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures
☆ What Matters in Detecting AI-Generated Videos like Sora?
Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/
☆ Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera's data as ground truth and our camera's data as input. We demonstrated our camera's ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.
☆ Robustness Testing of Black-Box Models Against CT Degradation Through Test-Time Augmentation
Deep learning models for medical image segmentation and object detection are becoming increasingly available as clinical products. However, as details are rarely provided about the training data, models may unexpectedly fail when cases differ from those in the training distribution. An approach allowing potential users to independently test the robustness of a model, treating it as a black box and using only a few cases from their own site, is key for adoption. To address this, a method to test the robustness of these models against CT image quality variation is presented. In this work we present this framework by demonstrating that given the same training data, the model architecture and data pre processing greatly affect the robustness of several frequently used segmentation and object detection methods to simulated CT imaging artifacts and degradation. Our framework also addresses the concern about the sustainability of deep learning models in clinical use, by considering future shifts in image quality due to scanner deterioration or imaging protocol changes which are not reflected in a limited local test dataset.
☆ BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain Diseases
Recent advances have enabled the study of human brain development using brain organoids derived from stem cells. Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies. We introduce BOrg, a dataset designed to study mitotic events in the embryonic development of the brain using confocal microscopy images of brain organoids. BOrg utilizes an efficient annotation pipeline with sparse point annotations and techniques that minimize expert effort, overcoming limitations of standard deep learning approaches on sparse data. We adapt and benchmark state-of-the-art object detection and cell counting models on BOrg for detecting and analyzing mitotic cells across prophase, metaphase, anaphase, and telophase stages. Our results demonstrate these adapted models significantly improve mitosis analysis efficiency and accuracy for brain organoid research compared to existing methods. BOrg facilitates the development of automated tools to quantify statistics like mitosis rates, aiding mechanistic studies of neurodevelopmental processes and disorders. Data and code are available at https://github.com/awaisrauf/borg.
☆ Weighted Circle Fusion: Ensembling Circle Representation from Different Object Detection Results
Recently, the use of circle representation has emerged as a method to improve the identification of spherical objects (such as glomeruli, cells, and nuclei) in medical imaging studies. In traditional bounding box-based object detection, combining results from multiple models improves accuracy, especially when real-time processing isn't crucial. Unfortunately, this widely adopted strategy is not readily available for combining circle representations. In this paper, we propose Weighted Circle Fusion (WCF), a simple approach for merging predictions from various circle detection models. Our method leverages confidence scores associated with each proposed bounding circle to generate averaged circles. Our method undergoes thorough evaluation on a proprietary dataset for glomerular detection in object detection within whole slide imaging (WSI). The findings reveal a performance gain of 5 %, respectively, compared to existing ensemble methods. Furthermore, the Weighted Circle Fusion technique not only improves the precision of object detection in medical images but also notably decreases false detections, presenting a promising direction for future research and application in pathological image analysis.
☆ Comparative Analysis Of Color Models For Human Perception And Visual Color Difference
Color is integral to human experience, influencing emotions, decisions, and perceptions. This paper presents a comparative analysis of various color models' alignment with human visual perception. The study evaluates color models such as RGB, HSV, HSL, XYZ, CIELAB, and CIELUV to assess their effectiveness in accurately representing how humans perceive color. We evaluate each model based on its ability to accurately reflect visual color differences and dominant palette extraction compatible with the human eye. In image processing, accurate assessment of color difference is essential for applications ranging from digital design to quality control. Current color difference metrics do not always match how people see colors, causing issues in accurately judging subtle differences. Understanding how different color models align with human visual perception is crucial for various applications in image processing, digital media, and design.
comment: The paper has been submitted to EJMCA journal for consideration. Current version is a preprint
☆ Stereo Vision Based Robot for Remote Monitoring with VR Support
The machine vision systems have been playing a significant role in visual monitoring systems. With the help of stereovision and machine learning, it will be able to mimic human-like visual system and behaviour towards the environment. In this paper, we present a stereo vision based 3-DOF robot which will be used to monitor places from remote using cloud server and internet devices. The 3-DOF robot will transmit human-like head movements, i.e., yaw, pitch, roll and produce 3D stereoscopic video and stream it in Real-time. This video stream is sent to the user through any generic internet devices with VR box support, i.e., smartphones giving the user a First-person real-time 3D experience and transfers the head motion of the user to the robot also in Real-time. The robot will also be able to track moving objects and faces as a target using deep neural networks which enables it to be a standalone monitoring robot. The user will be able to choose specific subjects to monitor in a space. The stereovision enables us to track the depth information of different objects detected and will be used to track human interest objects with its distances and sent to the cloud. A full working prototype is developed which showcases the capabilities of a monitoring system based on stereo vision, robotics, and machine learning.
comment: 6 Pages, 10 Figures
☆ High-resolution segmentations of the hypothalamus and its subregions for training of segmentation models
Segmentation of brain structures on magnetic resonance imaging (MRI) is a highly relevant neuroimaging topic, as it is a prerequisite for different analyses such as volumetry or shape analysis. Automated segmentation facilitates the study of brain structures in larger cohorts when compared with manual segmentation, which is time-consuming. However, the development of most automated methods relies on large and manually annotated datasets, which limits the generalizability of these methods. Recently, new techniques using synthetic images have emerged, reducing the need for manual annotation. Here we provide HELM, Hypothalamic ex vivo Label Maps, a dataset composed of label maps built from publicly available ultra-high resolution ex vivo MRI from 10 whole hemispheres, which can be used to develop segmentation methods using synthetic data. The label maps are obtained with a combination of manual labels for the hypothalamic regions and automated segmentations for the rest of the brain, and mirrored to simulate entire brains. We also provide the pre-processed ex vivo scans, as this dataset can support future projects to include other structures after these are manually segmented.
☆ GAPNet: Granularity Attention Network with Anatomy-Prior-Constraint for Carotid Artery Segmentation
Atherosclerosis is a chronic, progressive disease that primarily affects the arterial walls. It is one of the major causes of cardiovascular disease. Magnetic Resonance (MR) black-blood vessel wall imaging (BB-VWI) offers crucial insights into vascular disease diagnosis by clearly visualizing vascular structures. However, the complex anatomy of the neck poses challenges in distinguishing the carotid artery (CA) from surrounding structures, especially with changes like atherosclerosis. In order to address these issues, we propose GAPNet, which is a consisting of a novel geometric prior deduced from.
☆ ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data
Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations. Project website: https://mani-wav.github.io/
☆ Efficient and Distributed Large-Scale 3D Map Registration using Tomographic Features
A robust, resource-efficient, distributed, and minimally parameterized 3D map matching and merging algorithm is proposed. The suggested algorithm utilizes tomographic features from 2D projections of horizontal cross-sections of gravity-aligned local maps, and matches these projection slices at all possible height differences, enabling the estimation of four degrees of freedom in an efficient and parallelizable manner. The advocated algorithm improves state-of-the-art feature extraction and registration pipelines by an order of magnitude in memory use and execution time. Experimental studies are offered to investigate the efficiency of this 3D map merging scheme.
comment: Submitted to Elsevier Journal: Robotics and Autonomous Systems (RAS)
☆ A Sanity Check for AI-generated Image Detection
With the rapid development of generative models, discerning AI-generated content has evoked increasing attention from both industry and academia. In this paper, we conduct a sanity check on "whether the task of AI-generated image detection has been solved". To start with, we present Chameleon dataset, consisting AIgenerated images that are genuinely challenging for human perception. To quantify the generalization of existing methods, we evaluate 9 off-the-shelf AI-generated image detectors on Chameleon dataset. Upon analysis, almost all models classify AI-generated images as real ones. Later, we propose AIDE (AI-generated Image DEtector with Hybrid Features), which leverages multiple experts to simultaneously extract visual artifacts and noise patterns. Specifically, to capture the high-level semantics, we utilize CLIP to compute the visual embedding. This effectively enables the model to discern AI-generated images based on semantics or contextual information; Secondly, we select the highest frequency patches and the lowest frequency patches in the image, and compute the low-level patchwise features, aiming to detect AI-generated images by low-level artifacts, for example, noise pattern, anti-aliasing, etc. While evaluating on existing benchmarks, for example, AIGCDetectBenchmark and GenImage, AIDE achieves +3.5% and +4.6% improvements to state-of-the-art methods, and on our proposed challenging Chameleon benchmarks, it also achieves the promising results, despite this problem for detecting AI-generated images is far from being solved. The dataset, codes, and pre-train models will be published at https://github.com/shilinyan99/AIDE.
comment: Project page: https://shilinyan99.github.io/AIDE Code: https://github.com/shilinyan99/AIDE
♻ ☆ Towards Semantic Equivalence of Tokenization in Multimodal LLM
Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.
comment: Technical Report. The project page: https://chocowu.github.io/SeTok-web/
♻ ☆ Physics-Guided Neural Networks for Intraventricular Vector Flow Mapping
Intraventricular vector flow mapping (iVFM) seeks to enhance and quantify color Doppler in cardiac imaging. In this study, we propose novel alternatives to the traditional iVFM optimization scheme by utilizing physics-informed neural networks (PINNs) and a physics-guided nnU-Net-based supervised approach. When evaluated on simulated color Doppler images derived from a patient-specific computational fluid dynamics model and in vivo Doppler acquisitions, both approaches demonstrate comparable reconstruction performance to the original iVFM algorithm. The efficiency of PINNs is boosted through dual-stage optimization and pre-optimized weights. On the other hand, the nnU-Net method excels in generalizability and real-time capabilities. Notably, nnU-Net shows superior robustness on sparse and truncated Doppler data while maintaining independence from explicit boundary conditions. Overall, our results highlight the effectiveness of these methods in reconstructing intraventricular vector blood flow. The study also suggests potential applications of PINNs in ultrafast color Doppler imaging and the incorporation of fluid dynamics equations to derive biomarkers for cardiovascular diseases based on blood flow.
comment: 12 pages, accepted for publication in IEEE TUFFC; camera ready corrections, corrected acknowledgments
♻ ☆ VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/
comment: update reference
♻ ☆ SpatialBot: Precise Spatial Understanding with Vision Language Models
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding, however they are still struggling with spatial understanding which is the foundation of Embodied AI. In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images. Additionally, we have constructed the SpatialQA dataset, which involves multi-level depth-related questions to train VLMs for depth understanding. Finally, we present SpatialBench to comprehensively evaluate VLMs' capabilities in spatial understanding at different levels. Extensive experiments on our spatial-understanding benchmark, general VLM benchmarks and Embodied AI tasks, demonstrate the remarkable improvements of SpatialBot trained on SpatialQA. The model, code and data are available at https://github.com/BAAI-DCAI/SpatialBot.
♻ ☆ Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA ACL 2024
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs' performance with synthetic data and offer insights for enhancement. Code and data are released at https://sites.google.com/view/multipanelvqa/home.
comment: ACL 2024
♻ ☆ Shortcut Learning in Medical Image Segmentation MICCAI 2024
Shortcut learning is a phenomenon where machine learning models prioritize learning simple, potentially misleading cues from data that do not generalize well beyond the training set. While existing research primarily investigates this in the realm of image classification, this study extends the exploration of shortcut learning into medical image segmentation. We demonstrate that clinical annotations such as calipers, and the combination of zero-padded convolutions and center-cropped training sets in the dataset can inadvertently serve as shortcuts, impacting segmentation accuracy. We identify and evaluate the shortcut learning on two different but common medical image segmentation tasks. In addition, we suggest strategies to mitigate the influence of shortcut learning and improve the generalizability of the segmentation models. By uncovering the presence and implications of shortcuts in medical image segmentation, we provide insights and methodologies for evaluating and overcoming this pervasive challenge and call for attention in the community for shortcuts in segmentation. Our code is public at https://github.com/nina-weng/shortcut_skinseg .
comment: 11 pages, 6 figures, accepted at MICCAI 2024
♻ ☆ S4: Self-Supervised Sensing Across the Spectrum
Satellite image time series (SITS) segmentation is crucial for many applications like environmental monitoring, land cover mapping and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine grained annotation. We propose S4 a new self-supervised pre-training approach that significantly reduces the requirement for labeled training data by utilizing two new insights: (a) Satellites capture images in different parts of the spectrum such as radio frequencies, and visible frequencies. (b) Satellite imagery is geo-registered allowing for fine-grained spatial alignment. We use these insights to formulate pre-training tasks in S4. We also curate m2s2-SITS, a large-scale dataset of unlabeled, spatially-aligned, multi-modal and geographic specific SITS that serves as representative pre-training data for S4. Finally, we evaluate S4 on multiple SITS segmentation datasets and demonstrate its efficacy against competing baselines while using limited labeled data.
♻ ☆ Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods
Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies "in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
comment: 21 pages, 3 figures, 14 tables
♻ ☆ SRC-Net: Bi-Temporal Spatial Relationship Concerned Network for Change Detection
Change detection (CD) in remote sensing imagery is a crucial task with applications in environmental monitoring, urban development, and disaster management. CD involves utilizing bi-temporal images to identify changes over time. The bi-temporal spatial relationships between features at the same location at different times play a key role in this process. However, existing change detection networks often do not fully leverage these spatial relationships during bi-temporal feature extraction and fusion. In this work, we propose SRC-Net: a bi-temporal spatial relationship concerned network for CD. The proposed SRC-Net includes a Perception and Interaction Module that incorporates spatial relationships and establishes a cross-branch perception mechanism to enhance the precision and robustness of feature extraction. Additionally, a Patch-Mode joint Feature Fusion Module is introduced to address information loss in current methods. It considers different change modes and concerns about spatial relationships, resulting in more expressive fusion features. Furthermore, we construct a novel network using these two relationship concerned modules and conducted experiments on the LEVIR-CD and WHU Building datasets. The experimental results demonstrate that our network outperforms state-of-the-art (SOTA) methods while maintaining a modest parameter count. We believe our approach sets a new paradigm for change detection and will inspire further advancements in the field. The code and models are publicly available at https://github.com/Chnja/SRCNet.
comment: 13 pages, 12 figures, IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2024)
♻ ☆ VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks MICCAI 2024
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at: https://github.com/naamiinepal/vlsm-adapter.
comment: Accepted at MICCAI 2024, the 27th International Conference on Medical Image Computing and Computer Assisted Intervention
♻ ☆ MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning
Prompt learning has demonstrated impressive efficacy in the fine-tuning of multimodal large models to a wide range of downstream tasks. Nonetheless, applying existing prompt learning methods for the diagnosis of neurological disorder still suffers from two issues: (i) existing methods typically treat all patches equally, despite the fact that only a small number of patches in neuroimaging are relevant to the disease, and (ii) they ignore the structural information inherent in the brain connection network which is crucial for understanding and diagnosing neurological disorders. To tackle these issues, we introduce a novel prompt learning model by learning graph prompts during the fine-tuning process of multimodal large models for diagnosing neurological disorders. Specifically, we first leverage GPT-4 to obtain relevant disease concepts and compute semantic similarity between these concepts and all patches. Secondly, we reduce the weight of irrelevant patches according to the semantic similarity between each patch and disease-related concepts. Moreover, we construct a graph among tokens based on these concepts and employ a graph convolutional network layer to extract the structural information of the graph, which is used to prompt the pre-trained multimodal large models for diagnosing neurological disorders. Extensive experiments demonstrate that our method achieves superior performance for neurological disorder diagnosis compared with state-of-the-art methods and validated by clinicians.
♻ ☆ Self-Supervised Detection of Perfect and Partial Input-Dependent Symmetries
Group equivariance can overly constrain models if the symmetries in the group differ from those observed in data. While common methods address this by determining the appropriate level of symmetry at the dataset level, they are limited to supervised settings and ignore scenarios in which multiple levels of symmetry co-exist in the same dataset. In this paper, we propose a method able to detect the level of symmetry of each input without the need for labels. Our framework is general enough to accommodate different families of both continuous and discrete symmetry distributions, such as arbitrary unimodal, symmetric distributions and discrete groups. We validate the effectiveness of our approach on synthetic datasets with different per-class levels of symmetries, and demonstrate practical applications such as the detection of out-of-distribution symmetries. Our code is publicly available at https://github.com/aurban0/ssl-sym.
comment: 19 pages, 8 figures, corrected typos, revised argument in Appendix B.1, results unchanged
♻ ☆ WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images
Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (PathText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues and achieve competitive performance on certain slide-level tasks. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping surpassing previous state-of-the-art approaches. Our collected dataset and related code are available.
♻ ☆ BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents, while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone, minimal GPU memory is left for facilitating effective temporal modeling, which is crucial for comprehending and providing feedback on videos. To this end, we propose Branching Temporal Adapter (BT-Adapter), a novel method for extending image-language pretrained models into the video domain. Specifically, BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder, which is tuned while keeping the backbone frozen. Just pretrained once, BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP, enabling video conversations without the need for video instructions. Besides, we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter, facilitating faster convergence and better results. Thanks to BT-Adapter, we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles, BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning, outperforming previous SOTAs by a large margin.
♻ ☆ Examining Common Paradigms in Multi-Task Learning
While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
comment: -
♻ ☆ Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models
Diffusion inversion is the problem of taking an image and a text prompt that describes it and finding a noise latent that would generate the image. Most current inversion techniques operate by approximately solving an implicit equation and may converge slowly or yield poor reconstructed images. Here, we formulate the problem as finding the roots of an implicit equation and design a method to solve it efficiently. Our solution is based on Newton-Raphson (NR), a well-known technique in numerical analysis. A naive application of NR may be computationally infeasible and tends to converge to incorrect solutions. We describe an efficient regularized formulation that converges quickly to a solution that provides high-quality reconstructions. We also identify a source of inconsistency stemming from prompt conditioning during the inversion process, which significantly degrades the inversion quality. To address this, we introduce a prompt-aware adjustment of the encoding, effectively correcting this issue. Our solution, Regularized Newton-Raphson Inversion, inverts an image within 0.5 sec for latent consistency models, opening the door for interactive image editing. We further demonstrate improved results in image interpolation and generation of rare objects.
♻ ☆ Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process MICCAI 2024
Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg
comment: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg
♻ ☆ Bayesian Uncertainty Estimation by Hamiltonian Monte Carlo: Applications to Cardiac MRI Segmentation
Deep learning (DL)-based methods have achieved state-of-the-art performance for many medical image segmentation tasks. Nevertheless, recent studies show that deep neural networks (DNNs) can be miscalibrated and overconfident, leading to "silent failures" that are risky for clinical applications. Bayesian DL provides an intuitive approach to DL failure detection, based on posterior probability estimation. However, the posterior is intractable for large medical image segmentation DNNs. To tackle this challenge, we propose a Bayesian learning framework using Hamiltonian Monte Carlo (HMC), tempered by cold posterior (CP) to accommodate medical data augmentation, named HMC-CP. For HMC computation, we further propose a cyclical annealing strategy, capturing both local and global geometries of the posterior distribution, enabling highly efficient Bayesian DNN training with the same computational budget as training a single DNN. The resulting Bayesian DNN outputs an ensemble segmentation along with the segmentation uncertainty. We evaluate the proposed HMC-CP extensively on cardiac magnetic resonance image (MRI) segmentation, using in-domain steady-state free precession (SSFP) cine images as well as out-of-domain datasets of quantitative T1 and T2 mapping. Our results show that the proposed method improves both segmentation accuracy and uncertainty estimation for in- and out-of-domain data, compared with well-established baseline methods such as Monte Carlo Dropout and Deep Ensembles. Additionally, we establish a conceptual link between HMC and the commonly known stochastic gradient descent (SGD) and provide general insight into the uncertainty of DL. This uncertainty is implicitly encoded in the training dynamics but often overlooked. With reliable uncertainty estimation, our method provides a promising direction toward trustworthy DL in clinical applications.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:011
♻ ☆ Optimal Transport Aggregation for Visual Place Recognition
The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places, relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone, in order to form a global descriptor for each image. In this context, we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors), which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative, enhancing the overall descriptor quality. Additionally, we leverage and fine-tune DINOv2 as a backbone, which provides enhanced description power for the local features, and dramatically reduces the required training time. As a result, our single-stage method not only surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost. Code and models are available at https://github.com/serizba/salad.
♻ ☆ Towards AI Lesion Tracking in PET/CT Imaging: A Siamese-based CNN Pipeline applied on PSMA PET/CT Scans
Assessing tumor response to systemic therapies is one of the main applications of PET/CT. Routinely, only a small subset of index lesions out of multiple lesions is analyzed. However, this operator dependent selection may bias the results due to possible significant inter-metastatic heterogeneity of response to therapy. Automated, AI based approaches for lesion tracking hold promise in enabling the analysis of many more lesions and thus providing a better assessment of tumor response. This work introduces a Siamese CNN approach for lesion tracking between PET/CT scans. Our approach is applied on the laborious task of tracking a high number of bone lesions in full-body baseline and follow-up [68Ga]Ga- or [18F]F-PSMA PET/CT scans after two cycles of [177Lu]Lu-PSMA therapy of metastatic castration resistant prostate cancer patients. Data preparation includes lesion segmentation and affine registration. Our algorithm extracts suitable lesion patches and forwards them into a Siamese CNN trained to classify the lesion patch pairs as corresponding or non-corresponding lesions. Experiments have been performed with different input patch types and a Siamese network in 2D and 3D. The CNN model successfully learned to classify lesion assignments, reaching a lesion tracking accuracy of 83 % in its best configuration with an AUC = 0.91. For remaining lesions the pipeline accomplished a re-identification rate of 89 %. We proved that a CNN may facilitate the tracking of multiple lesions in PSMA PET/CT scans. Future clinical studies are necessary if this improves the prediction of the outcome of therapies.
comment: 25 pages, 9 figures, 3 tables
♻ ☆ Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection MICCAI 2024
Laparoscopic liver surgery poses a complex intraoperative dynamic environment for surgeons, where remains a significant challenge to distinguish critical or even hidden structures inside the liver. Liver anatomical landmarks, e.g., ridge and ligament, serve as important markers for 2D-3D alignment, which can significantly enhance the spatial perception of surgeons for precise surgery. To facilitate the detection of laparoscopic liver landmarks, we collect a novel dataset called L3D, which comprises 1,152 frames with elaborated landmark annotations from surgical videos of 39 patients across two medical sites. For benchmarking purposes, 12 mainstream detection methods are selected and comprehensively evaluated on L3D. Further, we propose a depth-driven geometric prompt learning network, namely D2GPLand. Specifically, we design a Depth-aware Prompt Embedding (DPE) module that is guided by self-supervised prompts and generates semantically relevant geometric information with the benefit of global depth cues extracted from SAM-based features. Additionally, a Semantic-specific Geometric Augmentation (SGA) scheme is introduced to efficiently merge RGB-D spatial and geometric information through reverse anatomic perception. The experimental results indicate that D2GPLand obtains state-of-the-art performance on L3D, with 63.52% DICE and 48.68% IoU scores. Together with 2D-3D fusion technology, our method can directly provide the surgeon with intuitive guidance information in laparoscopic scenarios.
comment: This paper has been accepted by MICCAI 2024
♻ ☆ Continuous 3D Myocardial Motion Tracking via Echocardiography
Myocardial motion tracking stands as an essential clinical tool in the prevention and detection of cardiovascular diseases (CVDs), the foremost cause of death globally. However, current techniques suffer from incomplete and inaccurate motion estimation of the myocardium in both spatial and temporal dimensions, hindering the early identification of myocardial dysfunction. To address these challenges, this paper introduces the Neural Cardiac Motion Field (NeuralCMF). NeuralCMF leverages implicit neural representation (INR) to model the 3D structure and the comprehensive 6D forward/backward motion of the heart. This method surpasses pixel-wise limitations by offering the capability to continuously query the precise shape and motion of the myocardium at any specific point throughout the cardiac cycle, enhancing the detailed analysis of cardiac dynamics beyond traditional speckle tracking. Notably, NeuralCMF operates without the need for paired datasets, and its optimization is self-supervised through the physics knowledge priors in both space and time dimensions, ensuring compatibility with both 2D and 3D echocardiogram video inputs. Experimental validations across three representative datasets support the robustness and innovative nature of the NeuralCMF, marking significant advantages over existing state-of-the-art methods in cardiac imaging and motion tracking.
comment: 18 pages, 11 figures
♻ ☆ DeepFake-O-Meter v2.0: An Open Platform for DeepFake Detection
Deepfakes, as AI-generated media, have increasingly threatened media integrity and personal privacy with realistic yet fake digital content. In this work, we introduce an open-source and user-friendly online platform, DeepFake-O-Meter v2.0, that integrates state-of-the-art methods for detecting Deepfake images, videos, and audio. Built upon DeepFake-O-Meter v1.0, we have made significant upgrades and improvements in platform architecture design, including user interaction, detector integration, job balancing, and security management. The platform aims to offer everyday users a convenient service for analyzing DeepFake media using multiple state-of-the-art detection algorithms. It ensures secure and private delivery of the analysis results. Furthermore, it serves as an evaluation and benchmarking platform for researchers in digital media forensics to compare the performance of multiple algorithms on the same input. We have also conducted detailed usage analysis based on the collected data to gain deeper insights into our platform's statistics. This involves analyzing two-month trends in user activity and evaluating the processing efficiency of each detector.
♻ ☆ AdaTreeFormer: Few Shot Domain Adaptation for Tree Counting from a Single High-Resolution Image
The process of estimating and counting tree density using only a single aerial or satellite image is a difficult task in the fields of photogrammetry and remote sensing. However, it plays a crucial role in the management of forests. The huge variety of trees in varied topography severely hinders tree counting models to perform well. The purpose of this paper is to propose a framework that is learnt from the source domain with sufficient labeled trees and is adapted to the target domain with only a limited number of labeled trees. Our method, termed as AdaTreeFormer, contains one shared encoder with a hierarchical feature extraction scheme to extract robust features from the source and target domains. It also consists of three subnets: two for extracting self-domain attention maps from source and target domains respectively and one for extracting cross-domain attention maps. For the latter, an attention-to-adapt mechanism is introduced to distill relevant information from different domains while generating tree density maps; a hierarchical cross-domain feature alignment scheme is proposed that progressively aligns the features from the source and target domains. We also adopt adversarial learning into the framework to further reduce the gap between source and target domains. Our AdaTreeFormer is evaluated on six designed domain adaptation tasks using three tree counting datasets, \ie Jiangsu, Yosemite, and London. Experimental results show that AdaTreeFormer significantly surpasses the state of the art, \eg in the cross domain from the Yosemite to Jiangsu dataset, it achieves a reduction of 15.9 points in terms of the absolute counting errors and an increase of 10.8\% in the accuracy of the detected trees' locations. The codes and datasets are available at https://github.com/HAAClassic/AdaTreeFormer.
♻ ☆ EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation CVPR 2024
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
comment: Champion solutions in the EgoVis CVPR 2024 workshop
♻ ☆ BSDA: Bayesian Random Semantic Data Augmentation for Medical Image Classification
Data augmentation is a crucial regularization technique for deep neural networks, particularly in medical image classification. Mainstream data augmentation (DA) methods are usually applied at the image level. Due to the specificity and diversity of medical imaging, expertise is often required to design effective DA strategies, and improper augmentation operations can degrade model performance. Although automatic augmentation methods exist, they are computationally intensive. Semantic data augmentation can implemented by translating features in feature space. However, over-translation may violate the image label. To address these issues, we propose \emph{Bayesian Random Semantic Data Augmentation} (BSDA), a computationally efficient and handcraft-free feature-level DA method. BSDA uses variational Bayesian to estimate the distribution of the augmentable magnitudes, and then a sample from this distribution is added to the original features to perform semantic data augmentation. We performed experiments on nine 2D and five 3D medical image datasets. Experimental results show that BSDA outperforms current DA methods. Additionally, BSDA can be easily assembled into CNNs or Transformers as a plug-and-play module, improving the network's performance. The code is available online at \url{https://github.com/YaoyaoZhu19/BSDA}.
♻ ☆ Inference Attacks: A Taxonomy, Survey, and Promising Directions
The prosperity of machine learning has also brought people's concerns about data privacy. Among them, inference attacks can implement privacy breaches in various MLaaS scenarios and model training/prediction phases. Specifically, inference attacks can perform privacy inference on undisclosed target training sets based on outputs of the target model, including but not limited to statistics, membership, semantics, data representation, etc. For instance, infer whether the target data has the characteristics of AIDS. In addition, the rapid development of the machine learning community in recent years, especially the surge of model types and application scenarios, has further stimulated the inference attacks' research. Thus, studying inference attacks and analyzing them in depth is urgent and significant. However, there is still a gap in the systematic discussion of inference attacks from taxonomy, global perspective, attack, and defense perspectives. This survey provides an in-depth and comprehensive inference of attacks and corresponding countermeasures in ML-as-a-service based on taxonomy and the latest researches. Without compromising researchers' intuition, we first propose the 3MP taxonomy based on the community research status, trying to normalize the confusing naming system of inference attacks. Also, we analyze the pros and cons of each type of inference attack, their workflow, countermeasure, and how they interact with other attacks. In the end, we point out several promising directions for researchers from a more comprehensive and novel perspective.
♻ ☆ AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.
comment: Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/
♻ ☆ Advancing Video Anomaly Detection: A Concise Review and a New Dataset
Video Anomaly Detection (VAD) finds widespread applications in security surveillance, traffic monitoring, industrial monitoring, and healthcare. Despite extensive research efforts, there remains a lack of concise reviews that provide insightful guidance for researchers. Such reviews would serve as quick references to grasp current challenges, research trends, and future directions. In this paper, we present such a review, examining models and datasets from various perspectives. We emphasize the critical relationship between model and dataset, where the quality and diversity of datasets profoundly influence model performance, and dataset development adapts to the evolving needs of emerging approaches. Our review identifies practical issues, including the absence of comprehensive datasets with diverse scenarios. To address this, we introduce a new dataset, Multi-Scenario Anomaly Detection (MSAD), comprising 14 distinct scenarios captured from various camera views. Our dataset has diverse motion patterns and challenging variations, such as different lighting and weather conditions, providing a robust foundation for training superior models. We conduct an in-depth analysis of recent representative models using MSAD and highlight its potential in addressing the challenges of detecting anomalies across diverse and evolving surveillance scenarios. Our dataset is available here.
comment: Research report
♻ ☆ MixerFlow: MLP-Mixer meets Normalising Flows ECML-PKDD 2024
Normalising flows are generative models that transform a complex density into a simpler density through the use of bijective transformations enabling both density estimation and data generation from a single model. %However, the requirement for bijectivity imposes the use of specialised architectures. In the context of image modelling, the predominant choice has been the Glow-based architecture, whereas alternative architectures remain largely unexplored in the research community. In this work, we propose a novel architecture called MixerFlow, based on the MLP-Mixer architecture, further unifying the generative and discriminative modelling architectures. MixerFlow offers an efficient mechanism for weight sharing for flow-based models. Our results demonstrate comparative or superior density estimation on image datasets and good scaling as the image resolution increases, making MixerFlow a simple yet powerful alternative to the Glow-based architectures. We also show that MixerFlow provides more informative embeddings than Glow-based architectures and can integrate many structured transformations such as splines or Kolmogorov-Arnold Networks.
comment: Alternative title: MixerFlow for Image Modelling; Accepted at ECML-PKDD 2024
♻ ☆ EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
comment: Github Repository: https://github.com/SAIS-FUXI/EvalAlign
♻ ☆ DiffExplainer: Unveiling Black Box Models Via Counterfactual Generation MICCAI 2024
In the field of medical imaging, particularly in tasks related to early disease detection and prognosis, understanding the reasoning behind AI model predictions is imperative for assessing their reliability. Conventional explanation methods encounter challenges in identifying decisive features in medical image classifications, especially when discriminative features are subtle or not immediately evident. To address this limitation, we propose an agent model capable of generating counterfactual images that prompt different decisions when plugged into a black box model. By employing this agent model, we can uncover influential image patterns that impact the black model's final predictions. Through our methodology, we efficiently identify features that influence decisions of the deep black box. We validated our approach in the rigorous domain of medical prognosis tasks, showcasing its efficacy and potential to enhance the reliability of deep learning models in medical image classification compared to existing interpretation methods. The code will be publicly available at https://github.com/ayanglab/DiffExplainer.
comment: MICCAI 2024
♻ ☆ Transfer Learning in ECG Diagnosis: Is It Effective?
The adoption of deep learning in ECG diagnosis is often hindered by the scarcity of large, well-labeled datasets in real-world scenarios, leading to the use of transfer learning to leverage features learned from larger datasets. Yet the prevailing assumption that transfer learning consistently outperforms training from scratch has never been systematically validated. In this study, we conduct the first extensive empirical study on the effectiveness of transfer learning in multi-label ECG classification, by investigating comparing the fine-tuning performance with that of training from scratch, covering a variety of ECG datasets and deep neural networks. We confirm that fine-tuning is the preferable choice for small downstream datasets; however, when the dataset is sufficiently large, training from scratch can achieve comparable performance, albeit requiring a longer training time to catch up. Furthermore, we find that transfer learning exhibits better compatibility with convolutional neural networks than with recurrent neural networks, which are the two most prevalent architectures for time-series ECG applications. Our results underscore the importance of transfer learning in ECG diagnosis, yet depending on the amount of available data, researchers may opt not to use it, considering the non-negligible cost associated with pre-training.
♻ ☆ Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
Large Vision-Language Models (LVLMs) have received widespread attention in advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on the multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this paper, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure to prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotation for human annotators, and for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all the open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.
comment: Project Page: https://coda-dataset.github.io/coda-lm/
♻ ☆ Speeding Up Image Classifiers with Little Companions
Scaling up neural networks has been a key recipe to the success of large language and vision models. However, in practice, up-scaled models can be disproportionately costly in terms of computations, providing only marginal improvements in performance; for example, EfficientViT-L3-384 achieves <2% improvement on ImageNet-1K accuracy over the base L1-224 model, while requiring $14\times$ more multiply-accumulate operations (MACs). In this paper, we investigate scaling properties of popular families of neural networks for image classification, and find that scaled-up models mostly help with "difficult" samples. Decomposing the samples by difficulty, we develop a simple model-agnostic two-pass Little-Big algorithm that first uses a light-weight "little" model to make predictions of all samples, and only passes the difficult ones for the "big" model to solve. Good little companion achieve drastic MACs reduction for a wide variety of model families and scales. Without loss of accuracy or modification of existing models, our Little-Big models achieve MACs reductions of 76% for EfficientViT-L3-384, 81% for EfficientNet-B7-600, 71% for DeiT3-L-384 on ImageNet-1K. Little-Big also speeds up the InternImage-G-512 model by 62% while achieving 90% ImageNet-1K top-1 accuracy, serving both as a strong baseline and as a simple practical method for large model compression.
♻ ☆ Adaptive Critical Subgraph Mining for Cognitive Impairment Conversion Prediction with T1-MRI-based Brain Network
Prediction the conversion to early-stage dementia is critical for mitigating its progression but remains challenging due to subtle cognitive impairments and structural brain changes. Traditional T1-weighted magnetic resonance imaging (T1-MRI) research focus on identifying brain atrophy regions but often fails to address the intricate connectivity between them. This limitation underscores the necessity of focuing on inter-regional connectivity for a comprehensive understand of the brain's complex network. Moreover, there is a pressing demand for methods that adaptively preserve and extract critical information, particularly specialized subgraph mining techniques for brain networks. These are essential for developing high-quality feature representations that reveal critical spatial impacts of structural brain changes and its topology. In this paper, we propose Brain-SubGNN, a novel graph representation network to mine and enhance critical subgraphs based on T1-MRI. This network provides a subgraph-level interpretation, enhancing interpretability and insights for graph analysis. The process begins by extracting node features and a correlation matrix between nodes to construct a task-oriented brain network. Brain-SubGNN then adaptively identifies and enhances critical subgraphs, capturing both loop and neighbor subgraphs. This method reflects the loop topology and local changes, indicative of long-range connections, and maintains local and global brain attributes. Extensive experiments validate the effectiveness and advantages of Brain-SubGNN, demonstrating its potential as a powerful tool for understanding and diagnosing early-stage dementia. Source code is available at https://github.com/Leng-10/Brain-SubGNN.
comment: 20 pages
♻ ☆ Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications. More importantly, less-acknowledged pragmatic issues, including accessibility, model cost, and tedious manual evaluation make it hard for clinicians to use state-of-the-art large models directly on private patient data. Here, we explore training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space, as exemplified by LLaVA-Med. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LlaVA-Rad (7B) model attains state-of-the-art results on standard radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
♻ ☆ Taking Training Seriously: Human Guidance and Management-Based Regulation of Artificial Intelligence
Fervent calls for more robust governance of the harms associated with artificial intelligence (AI) are leading to the adoption around the world of what regulatory scholars have called a management-based approach to regulation. Recent initiatives in the United States and Europe, as well as the adoption of major self-regulatory standards by the International Organization for Standardization, share in common a core management-based paradigm. These management-based initiatives seek to motivate an increase in human oversight of how AI tools are trained and developed. Refinements and systematization of human-guided training techniques will thus be needed to fit within this emerging era of management-based regulatory paradigm. If taken seriously, human-guided training can alleviate some of the technical and ethical pressures on AI, boosting AI performance with human intuition as well as better addressing the needs for fairness and effective explainability. In this paper, we discuss the connection between the emerging management-based regulatory frameworks governing AI and the need for human oversight during training. We broadly cover some of the technical components involved in human-guided training and then argue that the kinds of high-stakes use cases for AI that appear of most concern to regulators should lean more on human-guided training than on data-only training. We hope to foster a discussion between legal scholars and computer scientists involving how to govern a domain of technology that is vast, heterogenous, and dynamic in its applications and risks.
comment: 9 pages, 1 figure
♻ ☆ PlaNet-S: Automatic Semantic Segmentation of Placenta
[Purpose] To develop a fully automated semantic placenta segmentation model that integrates the U-Net and SegNeXt architectures through ensemble learning. [Methods] A total of 218 pregnant women with suspected placental anomalies who underwent magnetic resonance imaging (MRI) were enrolled, yielding 1090 annotated images for developing a deep learning model for placental segmentation. The images were standardized and divided into training and test sets. The performance of PlaNet-S, which integrates U-Net and SegNeXt within an ensemble framework, was assessed using Intersection over Union (IoU) and counting connected components (CCC) against the U-Net model. [Results] PlaNet-S had significantly higher IoU (0.73 +/- 0.13) than that of U-Net (0.78 +/- 0.010) (p<0.01). The CCC for PlaNet-S was significantly higher than that for U-Net (p<0.01), matching the ground truth in 86.0\% and 56.7\% of the cases, respectively. [Conclusion]PlaNet-S performed better than the traditional U-Net in placental segmentation tasks. This model addresses the challenges of time-consuming physician-assisted manual segmentation and offers the potential for diverse applications in placental imaging analyses.
comment: 11 pages, 5 figures, Shinnosuke Yamamoto and Isso Saito equally contributed to this work. In the original submission, there was a typographical error in the reported standard deviation for the Intersection over Union (IoU) values of the PlaNet-S model. The standard deviation was incorrectly listed as 0.01 instead of the correct value of 0.1. This has been corrected in the revised version
♻ ☆ Towards Open-set Camera 3D Object Detection
Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.
♻ ☆ MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.
♻ ☆ XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis
Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many of these scenarios occur infrequently on public roads. However, the evaluation of most existing NVS methods relies on sporadic sampling of image frames from the training data, comparing the rendered images with ground truth images using metrics. Unfortunately, this evaluation protocol falls short of meeting the actual requirements in closed-loop simulations. Specifically, the true application demands the capability to render novel views that extend beyond the original trajectory (such as cross-lane views), which are challenging to capture in the real world. To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations. This dataset is unique as it includes testing images captured by deviating from the training trajectory by 1-4 meters. It comprises six sequences encompassing various time and weather conditions. Each sequence contains 450 training images, 150 testing images, and their corresponding camera poses and intrinsic parameters. Leveraging this novel dataset, we establish the first realistic benchmark for evaluating existing NVS approaches under front-only and multi-camera settings. The experimental findings underscore the significant gap that exists in current approaches, revealing their inadequate ability to fulfill the demanding prerequisites of cross-lane or closed-loop simulation. Our dataset is released publicly at the project page: https://3d-aigc.github.io/XLD/.
comment: project page: https://3d-aigc.github.io/XLD/
♻ ☆ FDDM: Unsupervised Medical Image Translation with a Frequency-Decoupled Diffusion Model
Diffusion models have demonstrated significant potential in producing high-quality images in medical image translation to aid disease diagnosis, localization, and treatment. Nevertheless, current diffusion models have limited success in achieving faithful image translations that can accurately preserve the anatomical structures of medical images, especially for unpaired datasets. The preservation of structural and anatomical details is essential to reliable medical diagnosis and treatment planning, as structural mismatches can lead to disease misidentification and treatment errors. In this study, we introduce the Frequency Decoupled Diffusion Model (FDDM) for MR-to-CT conversion. FDDM first obtains the anatomical information of the CT image from the MR image through an initial conversion module. This anatomical information then guides a subsequent diffusion model to generate high-quality CT images. Our diffusion model uses a dual-path reverse diffusion process for low-frequency and high-frequency information, achieving a better balance between image quality and anatomical accuracy. We extensively evaluated FDDM using public datasets for brain MR-to-CT and pelvis MR-to-CT translations, demonstrating its superior performance to other GAN-based, VAE-based, and diffusion-based models. The evaluation metrics included Frechet Inception Distance (FID), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index Measure (SSIM). FDDM achieved the best scores on all metrics for both datasets, particularly excelling in FID, with scores of 25.9 for brain data and 29.2 for pelvis data, significantly outperforming other methods. These results demonstrate that FDDM can generate high-quality target domain images while maintaining the accuracy of translated anatomical structures.
♻ ☆ Utilizing Adversarial Examples for Bias Mitigation and Accuracy Enhancement
We propose a novel approach to mitigate biases in computer vision models by utilizing counterfactual generation and fine-tuning. While counterfactuals have been used to analyze and address biases in DNN models, the counterfactuals themselves are often generated from biased generative models, which can introduce additional biases or spurious correlations. To address this issue, we propose using adversarial images, that is images that deceive a deep neural network but not humans, as counterfactuals for fair model training. Our approach leverages a curriculum learning framework combined with a fine-grained adversarial loss to fine-tune the model using adversarial examples. By incorporating adversarial images into the training data, we aim to prevent biases from propagating through the pipeline. We validate our approach through both qualitative and quantitative assessments, demonstrating improved bias mitigation and accuracy compared to existing methods. Qualitatively, our results indicate that post-training, the decisions made by the model are less dependent on the sensitive attribute and our model better disentangles the relationship between sensitive attributes and classification variables.
♻ ☆ Semi-supervised variational autoencoder for cell feature extraction in multiplexed immunofluorescence images
Advancements in digital imaging technologies have sparked increased interest in using multiplexed immunofluorescence (mIF) images to visualise and identify the interactions between specific immunophenotypes with the tumour microenvironment at the cellular level. Current state-of-the-art multiplexed immunofluorescence image analysis pipelines depend on cell feature representations characterised by morphological and stain intensity-based metrics generated using simple statistical and machine learning-based tools. However, these methods are not capable of generating complex representations of cells. We propose a deep learning-based cell feature extraction model using a variational autoencoder with supervision using a latent subspace to extract cell features in mIF images. We perform cell phenotype classification using a cohort of more than 44,000 multiplexed immunofluorescence cell image patches extracted across 1,093 tissue microarray cores of breast cancer patients, to demonstrate the success of our model against current and alternative methods.
♻ ☆ FishNet: Deep Neural Networks for Low-Cost Fish Stock Estimation
Fish stock assessment often involves manual fish counting by taxonomy specialists, which is both time-consuming and costly. We propose FishNet, an automated computer vision system for both taxonomic classification and fish size estimation from images captured with a low-cost digital camera. The system first performs object detection and segmentation using a Mask R-CNN to identify individual fish from images containing multiple fish, possibly consisting of different species. Then each fish species is classified and the length is predicted using separate machine learning models. To develop the model, we use a dataset of 300,000 hand-labeled images containing 1.2M fish of 163 different species and ranging in length from 10cm to 250cm, with additional annotations and quality control methods used to curate high-quality training data. On held-out test data sets, our system achieves a 92% intersection over union on the fish segmentation task, a 89% top-1 classification accuracy on single fish species classification, and a 2.3cm mean absolute error on the fish length estimation task.
comment: IEEE COINS 2024
♻ ☆ Step Differences in Instructional Video CVPR 2024
Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However, current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations, and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences, and shows promising ability to perform general reasoning over multiple videos. Project page: https://github.com/facebookresearch/stepdiff
comment: CVPR 2024
♻ ☆ Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling
Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion.
Computation and Language 136
☆ Taming Data and Transformers for Audio Generation
Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.
comment: Project Webpage: https://snap-research.github.io/GenAU/
☆ The Remarkable Robustness of LLMs: Stages of Inference?
We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95\% of the original model's prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.
☆ Suri: Multi-constraint Instruction Following for Long-form Text Generation
Existing research on instruction following largely focuses on tasks with simple instructions and short responses. In this work, we explore multi-constraint instruction following for generating long-form text. We create Suri, a dataset with 20K human-written long-form texts paired with LLM-generated backtranslated instructions that contain multiple complex constraints. Because of prohibitive challenges associated with collecting human preference judgments on long-form texts, preference-tuning algorithms such as DPO are infeasible in our setting; thus, we propose Instructional ORPO (I-ORPO), an alignment method based on the ORPO algorithm. Instead of receiving negative feedback from dispreferred responses, I-ORPO obtains negative feedback from synthetically corrupted instructions generated by an LLM. Using Suri, we perform supervised and I-ORPO fine-tuning on Mistral-7b-Instruct-v0.2. The resulting models, Suri-SFT and Suri-I-ORPO, generate significantly longer texts (~5K tokens) than base models without significant quality deterioration. Our human evaluation shows that while both SFT and I-ORPO models satisfy most constraints, Suri-I-ORPO generations are generally preferred for their coherent and informative incorporation of the constraints. We release our code at https://github.com/chtmp223/suri.
☆ The Model Arena for Cross-lingual Sentiment Analysis: A Comparative Study in the Era of Large Language Models WASSA
Sentiment analysis serves as a pivotal component in Natural Language Processing (NLP). Advancements in multilingual pre-trained models such as XLM-R and mT5 have contributed to the increasing interest in cross-lingual sentiment analysis. The recent emergence in Large Language Models (LLM) has significantly advanced general NLP tasks, however, the capability of such LLMs in cross-lingual sentiment analysis has not been fully studied. This work undertakes an empirical analysis to compare the cross-lingual transfer capability of public Small Multilingual Language Models (SMLM) like XLM-R, against English-centric LLMs such as Llama-3, in the context of sentiment analysis across English, Spanish, French and Chinese. Our findings reveal that among public models, SMLMs exhibit superior zero-shot cross-lingual performance relative to LLMs. However, in few-shot cross-lingual settings, public LLMs demonstrate an enhanced adaptive potential. In addition, we observe that proprietary GPT-3.5 and GPT-4 lead in zero-shot cross-lingual capability, but are outpaced by public models in few-shot scenarios.
comment: Accepted to WASSA workshop at ACL2024
☆ DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
☆ Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky -- perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model editing nonetheless demands a solution, since we need to be able to control the knowledge within language models. With this goal in mind, this paper critiques the standard formulation of the model editing problem and proposes a formal testbed for model editing research. We first describe 12 open problems with model editing, based on challenges with (1) defining the problem, (2) developing benchmarks, and (3) assuming LLMs have editable beliefs in the first place. Many of these challenges are extremely difficult to address, e.g. determining far-reaching consequences of edits, labeling probabilistic entailments between facts, and updating beliefs of agent simulators. Next, we introduce a semi-synthetic dataset for model editing based on Wikidata, where we can evaluate edits against labels given by an idealized Bayesian agent. This enables us to say exactly how belief revision in language models falls short of a desirable epistemic standard. We encourage further research exploring settings where such a gold standard can be compared against. Our code is publicly available at: https://github.com/peterbhase/LLM-belief-revision
comment: 23 pages, 4 figures
☆ IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language
Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.
☆ Jump Starting Bandits with LLM-Generated Prior Knowledge
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
☆ The Odyssey of Commonsense Causality: From Foundational Benchmarks to Cutting-Edge Reasoning
Understanding commonsense causality is a unique mark of intelligence for humans. It helps people understand the principles of the real world better and benefits the decision-making process related to causation. For instance, commonsense causality is crucial in judging whether a defendant's action causes the plaintiff's loss in determining legal liability. Despite its significance, a systematic exploration of this topic is notably lacking. Our comprehensive survey bridges this gap by focusing on taxonomies, benchmarks, acquisition methods, qualitative reasoning, and quantitative measurements in commonsense causality, synthesizing insights from over 200 representative articles. Our work aims to provide a systematic overview, update scholars on recent advancements, provide a pragmatic guide for beginners, and highlight promising future research directions in this vital field.
comment: 42 pages
☆ From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
☆ HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
☆ VERISCORE: Evaluating the factuality of verifiable claims in long-form text generation
Existing metrics for evaluating the factuality of long-form text, such as FACTSCORE (Min et al., 2023) and SAFE (Wei et al., 2024), decompose an input text into "atomic claims" and verify each against a knowledge base like Wikipedia. These metrics are not suitable for most generation tasks because they assume that every claim is verifiable (i.e., can plausibly be proven true or false). We address this issue with VERISCORE, a metric for diverse long-form generation tasks that contain both verifiable and unverifiable content. VERISCORE can be effectively implemented with either closed or fine-tuned open-weight language models, and human evaluation confirms that VERISCORE's extracted claims are more sensible than those from competing methods across eight different long-form tasks. We use VERISCORE to evaluate generations from 16 different models across multiple long-form tasks and find that while GPT-4o is the best-performing model overall, open-weight models such as Mixtral-8x22 are closing the gap. We show that an LM's VERISCORE on one task (e.g., biography generation) does not necessarily correlate to its VERISCORE on a different task (e.g., long-form QA), highlighting the need for expanding factuality evaluation across tasks with varying fact density.
☆ AutoPureData: Automated Filtering of Web Data for LLM Fine-tuning
Up-to-date and reliable Large Language Models (LLMs) are consistently sought after. Typically, LLMs are trained on a fixed dataset and then deployed. However, the training data continually becomes outdated. Enable automatic training of AI using web data involves significant concerns regarding data quality and safety due to bias, spam, and other unsafe or unwanted text. Pure data is essential for producing reliable models. Training a model on impure data may result in undesirable outcomes. This research proposes a system that collects web data and automatically filters out unwanted text with the assistance of existing trusted AI models. In the experiment, a small sample of web data was collected and filtered, demonstrating the system's effectiveness in purifying the data.
comment: Initial version
☆ Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding
Graphical User Interfaces (GUIs) are central to our interaction with digital devices. Recently, growing efforts have been made to build models for various GUI understanding tasks. However, these efforts largely overlook an important GUI-referring task: screen reading based on user-indicated points, which we name the Screen Point-and-Read (SPR) task. This task is predominantly handled by rigid accessible screen reading tools, in great need of new models driven by advancements in Multimodal Large Language Models (MLLMs). In this paper, we propose a Tree-of-Lens (ToL) agent, utilizing a novel ToL grounding mechanism, to address the SPR task. Based on the input point coordinate and the corresponding GUI screenshot, our ToL agent constructs a Hierarchical Layout Tree. Based on the tree, our ToL agent not only comprehends the content of the indicated area but also articulates the layout and spatial relationships between elements. Such layout information is crucial for accurately interpreting information on the screen, distinguishing our ToL agent from other screen reading tools. We also thoroughly evaluate the ToL agent against other baselines on a newly proposed SPR benchmark, which includes GUIs from mobile, web, and operating systems. Last but not least, we test the ToL agent on mobile GUI navigation tasks, demonstrating its utility in identifying incorrect actions along the path of agent execution trajectories. Code and data: screen-point-and-read.github.io
☆ Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment
While pre-training large-scale video-language models (VLMs) has shown remarkable potential for various downstream video-language tasks, existing VLMs can still suffer from certain commonly seen limitations, e.g., coarse-grained cross-modal aligning , under-modeling of temporal dynamics, detached video-language view. In this work, we target enhancing VLMs with a fine-grained structural spatio-temporal alignment learning method (namely Finsta). First of all, we represent the input texts and videos with fine-grained scene graph (SG) structures, both of which are further unified into a holistic SG (HSG) for bridging two modalities. Then, an SG-based framework is built, where the textual SG (TSG) is encoded with a graph Transformer, while the video dynamic SG (DSG) and the HSG are modeled with a novel recurrent graph Transformer for spatial and temporal feature propagation. A spatial-temporal Gaussian differential graph Transformer is further devised to strengthen the sense of the changes in objects across spatial and temporal dimensions. Next, based on the fine-grained structural features of TSG and DSG, we perform object-centered spatial alignment and predicate-centered temporal alignment respectively, enhancing the video-language grounding in both the spatiality and temporality. We design our method as a plug&play system, which can be integrated into existing well-trained VLMs for further representation augmentation, without training from scratch or relying on SG annotations in downstream applications. On 6 representative VL modeling tasks over 12 datasets in both standard and long-form video scenarios, Finsta consistently improves the existing 13 strong-performing VLMs persistently, and refreshes the current state-of-the-art end task performance significantly in both the fine-tuning and zero-shot settings.
comment: Accepted by IEEE TPAMI 2024
☆ AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem and introduces a novel two-level Hierarchical MAB (Hier-MAB) method for efficient exploration of large search spaces. We conduct extensive experiments on tuning hyper-parameters, such as top-k retrieved documents, prompt compression ratio, and embedding methods, using the ALCE-ASQA and Natural Questions datasets. Our evaluation from jointly optimization all three hyper-parameters demonstrate that MAB-based online learning methods can achieve Recall@5 $\approx 0.8$ for scenarios with prominent gradients in search space, using only $\sim20\%$ of the LLM API calls required by the Grid Search approach. Additionally, the proposed Hier-MAB approach outperforms other baselines in more challenging optimization scenarios. The code will be made available at https://aka.ms/autorag.
☆ Revealing Fine-Grained Values and Opinions in Large Language Models
Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.
comment: 28 pages, 20 figures, 7 tables
☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
☆ RuBLiMP: Russian Benchmark of Linguistic Minimal Pairs
Minimal pairs are a well-established approach to evaluating the grammatical knowledge of language models. However, existing resources for minimal pairs address a limited number of languages and lack diversity of language-specific grammatical phenomena. This paper introduces the Russian Benchmark of Linguistic Minimal Pairs (RuBLiMP), which includes 45k pairs of sentences that differ in grammaticality and isolate a morphological, syntactic, or semantic phenomenon. In contrast to existing benchmarks of linguistic minimal pairs, RuBLiMP is created by applying linguistic perturbations to automatically annotated sentences from open text corpora and carefully curating test data. We describe the data collection protocol and present the results of evaluating 25 language models in various scenarios. We find that the widely used language models for Russian are sensitive to morphological and agreement-oriented contrasts but fall behind humans on phenomena requiring understanding of structural relations, negation, transitivity, and tense. RuBLiMP, the codebase, and other materials are publicly available.
☆ Spiking Convolutional Neural Networks for Text Classification
Spiking neural networks (SNNs) offer a promising pathway to implement deep neural networks (DNNs) in a more energy-efficient manner since their neurons are sparsely activated and inferences are event-driven. However, there have been very few works that have demonstrated the efficacy of SNNs in language tasks partially because it is non-trivial to represent words in the forms of spikes and to deal with variable-length texts by SNNs. This work presents a "conversion + fine-tuning" two-step method for training SNNs for text classification and proposes a simple but effective way to encode pre-trained word embeddings as spike trains. We show empirically that after fine-tuning with surrogate gradients, the converted SNNs achieve comparable results to their DNN counterparts with much less energy consumption across multiple datasets for both English and Chinese. We also show that such SNNs are more robust to adversarial attacks than DNNs.
☆ Tools Fail: Detecting Silent Errors in Faulty Tools
Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
comment: 18 pages, 12 figures
☆ Aligning Teacher with Student Preferences for Tailored Training Data Generation
Large Language Models (LLMs) have shown significant promise as copilots in various tasks. Local deployment of LLMs on edge devices is necessary when handling privacy-sensitive data or latency-sensitive tasks. The computational constraints of such devices make direct deployment of powerful large-scale LLMs impractical, necessitating the Knowledge Distillation from large-scale models to lightweight models. Lots of work has been done to elicit diversity and quality training examples from LLMs, but little attention has been paid to aligning teacher instructional content based on student preferences, akin to "responsive teaching" in pedagogy. Thus, we propose ARTE, dubbed Aligning TeacheR with StudenT PreferencEs, a framework that aligns the teacher model with student preferences to generate tailored training examples for Knowledge Distillation. Specifically, we elicit draft questions and rationales from the teacher model, then collect student preferences on these questions and rationales using students' performance with in-context learning as a proxy, and finally align the teacher model with student preferences. In the end, we repeat the first step with the aligned teacher model to elicit tailored training examples for the student model on the target task. Extensive experiments on academic benchmarks demonstrate the superiority of ARTE over existing instruction-tuning datasets distilled from powerful LLMs. Moreover, we thoroughly investigate the generalization of ARTE, including the generalization of fine-tuned student models in reasoning ability and the generalization of aligned teacher models to generate tailored training data across tasks and students. In summary, our contributions lie in proposing a novel framework for tailored training example generation, demonstrating its efficacy in experiments, and investigating the generalization of both student & aligned teacher models in ARTE.
☆ Simulating Classroom Education with LLM-Empowered Agents
Large language models (LLMs) have been employed in various intelligent educational tasks to assist teaching. While preliminary explorations have focused on independent LLM-empowered agents for specific educational tasks, the potential for LLMs within a multi-agent collaborative framework to simulate a classroom with real user participation remains unexplored. In this work, we propose SimClass, a multi-agent classroom simulation framework involving user participation. We recognize representative class roles and introduce a novel class control mechanism for automatic classroom teaching, and conduct user experiments in two real-world courses. Utilizing the Flanders Interactive Analysis System and Community of Inquiry theoretical frame works from educational analysis, we demonstrate that LLMs can simulate traditional classroom interaction patterns effectively while enhancing user's experience. We also observe emergent group behaviors among agents in SimClass, where agents collaborate to create enlivening interactions in classrooms to improve user learning process. We hope this work pioneers the application of LLM-empowered multi-agent systems in virtual classroom teaching.
☆ T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
☆ SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented Generation
This paper introduces Self-aware Knowledge Retrieval (SeaKR), a novel adaptive RAG model that extracts self-aware uncertainty of LLMs from their internal states. SeaKR activates retrieval when the LLMs present high self-aware uncertainty for generation. To effectively integrate retrieved knowledge snippets, SeaKR re-ranks them based on LLM's self-aware uncertainty to preserve the snippet that reduces their uncertainty to the utmost. To facilitate solving complex tasks that require multiple retrievals, SeaKR utilizes their self-aware uncertainty to choose among different reasoning strategies. Our experiments on both complex and simple Question Answering datasets show that SeaKR outperforms existing adaptive RAG methods. We release our code at https://github.com/THU-KEG/SeaKR.
☆ Annotation Errors and NER: A Study with OntoNotes 5.0 LREC 2022
Named Entity Recognition (NER) is a well-studied problem in NLP. However, there is much less focus on studying NER datasets, compared to developing new NER models. In this paper, we employed three simple techniques to detect annotation errors in the OntoNotes 5.0 corpus for English NER, which is the largest available NER corpus for English. Our techniques corrected ~10% of the sentences in train/dev/test data. In terms of entity mentions, we corrected the span and/or type of ~8% of mentions in the dataset, while adding/deleting/splitting/merging a few more. These are large numbers of changes, considering the size of OntoNotes. We used three NER libraries to train, evaluate and compare the models trained with the original and the re-annotated datasets, which showed an average improvement of 1.23% in overall F-scores, with large (>10%) improvements for some of the entity types. While our annotation error detection methods are not exhaustive and there is some manual annotation effort involved, they are largely language agnostic and can be employed with other NER datasets, and other sequence labelling tasks.
comment: Unpublished report. Originally submitted to LREC 2022
☆ The Illusion of Competence: Evaluating the Effect of Explanations on Users' Mental Models of Visual Question Answering Systems
We examine how users perceive the limitations of an AI system when it encounters a task that it cannot perform perfectly and whether providing explanations alongside its answers aids users in constructing an appropriate mental model of the system's capabilities and limitations. We employ a visual question answer and explanation task where we control the AI system's limitations by manipulating the visual inputs: during inference, the system either processes full-color or grayscale images. Our goal is to determine whether participants can perceive the limitations of the system. We hypothesize that explanations will make limited AI capabilities more transparent to users. However, our results show that explanations do not have this effect. Instead of allowing users to more accurately assess the limitations of the AI system, explanations generally increase users' perceptions of the system's competence - regardless of its actual performance.
comment: 16 pages (including Appendix); under review
☆ Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.
☆ CHEW: A Dataset of CHanging Events in Wikipedia
We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.
comment: Short Paper
☆ Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs ACL 2024
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
comment: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
☆ Fairness and Bias in Multimodal AI: A Survey
The importance of addressing fairness and bias in artificial intelligence (AI) systems cannot be over-emphasized. Mainstream media has been awashed with news of incidents around stereotypes and bias in many of these systems in recent years. In this survey, we fill a gap with regards to the minimal study of fairness and bias in Large Multimodal Models (LMMs) compared to Large Language Models (LLMs), providing 50 examples of datasets and models along with the challenges affecting them; we identify a new category of quantifying bias (preuse), in addition to the two well-known ones in the literature: intrinsic and extrinsic; we critically discuss the various ways researchers are addressing these challenges. Our method involved two slightly different search queries on Google Scholar, which revealed that 33,400 and 538,000 links are the results for the terms "Fairness and bias in Large Multimodal Models" and "Fairness and bias in Large Language Models", respectively. We believe this work contributes to filling this gap and providing insight to researchers and other stakeholders on ways to address the challenge of fairness and bias in multimodal A!.
comment: 8 pages
☆ AMBROSIA: A Benchmark for Parsing Ambiguous Questions into Database Queries
Practical semantic parsers are expected to understand user utterances and map them to executable programs, even when these are ambiguous. We introduce a new benchmark, AMBROSIA, which we hope will inform and inspire the development of text-to-SQL parsers capable of recognizing and interpreting ambiguous requests. Our dataset contains questions showcasing three different types of ambiguity (scope ambiguity, attachment ambiguity, and vagueness), their interpretations, and corresponding SQL queries. In each case, the ambiguity persists even when the database context is provided. This is achieved through a novel approach that involves controlled generation of databases from scratch. We benchmark various LLMs on AMBROSIA, revealing that even the most advanced models struggle to identify and interpret ambiguity in questions.
☆ EmPO: Theory-Driven Dataset Construction for Empathetic Response Generation through Preference Optimization ACL
Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. In this paper, we propose a novel approach where we construct theory-driven preference datasets and use them to align LLMs with preference optimization algorithms to address these challenges. To measure empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-EPITOME and BERTscore metrics, and evaluate the generalization performance on the MMLU benchmark. We make all datasets, source code, and models publicly available.
comment: v01, 4 pages short paper, ACL style
☆ STBench: Assessing the Ability of Large Language Models in Spatio-Temporal Analysis
The rapid evolution of large language models (LLMs) holds promise for reforming the methodology of spatio-temporal data mining. However, current works for evaluating the spatio-temporal understanding capability of LLMs are somewhat limited and biased. These works either fail to incorporate the latest language models or only focus on assessing the memorized spatio-temporal knowledge. To address this gap, this paper dissects LLMs' capability of spatio-temporal data into four distinct dimensions: knowledge comprehension, spatio-temporal reasoning, accurate computation, and downstream applications. We curate several natural language question-answer tasks for each category and build the benchmark dataset, namely STBench, containing 13 distinct tasks and over 60,000 QA pairs. Moreover, we have assessed the capabilities of 13 LLMs, such as GPT-4o, Gemma and Mistral. Experimental results reveal that existing LLMs show remarkable performance on knowledge comprehension and spatio-temporal reasoning tasks, with potential for further enhancement on other tasks through in-context learning, chain-of-though prompting, and fine-tuning. The code and datasets of STBench are released on https://github.com/LwbXc/STBench.
☆ Improving Weak-to-Strong Generalization with Reliability-Aware Alignment
Large language models (LLMs) are now rapidly advancing and surpassing human abilities on many natural language tasks. However, aligning these super-human LLMs with human knowledge remains challenging because the supervision signals from human annotators may be wrong. This issue, known as the "super-alignment" problem, requires enhancing weak-to-strong generalization, where a strong LLM must generalize from imperfect supervision provided by a weaker source. To address this issue, we propose an approach to improve weak-to-strong generalization by involving the reliability of weak supervision signals in the alignment process. In our method, we query the weak supervisor for multiple answers, estimate the answer reliability, and enhance the alignment process by filtering out uncertain data or re-weighting reliable data. Experiments on four datasets demonstrate that our methods effectively identify the quality of weak labels and significantly enhance weak-to-strong generalization. Our work presents effective techniques for error-robust model alignment, reducing error propagation from noisy supervision and enhancing the accuracy and reliability of LLMs. Codes are publicly available at http://github.com/Irenehere/ReliableAlignment.
☆ RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulaiton
Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 88.7% to 96.2%, and in the $ABC \to D$ setting from 82.4% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview
☆ Applying LLMs for Rescoring N-best ASR Hypotheses of Casual Conversations: Effects of Domain Adaptation and Context Carry-over
Large language models (LLMs) have been successfully applied for rescoring automatic speech recognition (ASR) hypotheses. However, their ability to rescore ASR hypotheses of casual conversations has not been sufficiently explored. In this study, we reveal it by performing N-best ASR hypotheses rescoring using Llama2 on the CHiME-7 distant ASR (DASR) task. Llama2 is one of the most representative LLMs, and the CHiME-7 DASR task provides datasets of casual conversations between multiple participants. We investigate the effects of domain adaptation of the LLM and context carry-over when performing N-best rescoring. Experimental results show that, even without domain adaptation, Llama2 outperforms a standard-size domain-adapted Transformer-LM, especially when using a long context. Domain adaptation shortens the context length needed with Llama2 to achieve its best performance, i.e., it reduces the computational cost of Llama2.
comment: 5 pages
☆ UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents UniGen, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. UniGen is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, UniGen incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by UniGen, and each module within UniGen plays a critical role in this enhancement. Additionally, UniGen is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that UniGen effectively supports dynamic and evolving benchmarking, and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills.
☆ The single-use restriction for register automata and transducers over infinite alphabets
This thesis studies the single-use restriction for register automata and transducers over infinite alphabets. The restriction requires that a read-access to a register should have the side effect of destroying its contents. This constraint results in robust classes of languages and transductions. For automata models, we show that one-way register automata, two-way register automata, and orbit-finite monoids have the same expressive power. For transducer models, we show that single-use Mealy machines and single-use two-way transducers admit versions of the Krohn-Rhodes decomposition theorem. Moreover, single-use Mealy machines are equivalent to an algebraic model called local algebraic semigroup transductions. Additionally, we show that single-use two-way transducers are equivalent to single-use streaming string transducers (SSTs) over infinite alphabets and to regular list functions with atoms. Compared with the previous work arXiv:1907.10504, this thesis offers a coherent narrative on the single-use restriction. We introduce an abstract notion of single-use functions and use them to define all the discussed single-use models. We also introduce and study the algebraic models of local semigroup transduction and local rational semigroup transduction.
comment: PhD Thesis at University of Warsaw. Supervisor: Miko{\l}aj Boja\'nczyk
☆ Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network INTERSPEECH 2024
In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.
comment: Accepted for publication at INTERSPEECH 2024
☆ Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding
Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We collect and release VisArgs, an annotated corpus designed to make explicit the (usually implicit) structures underlying visual arguments. VisArgs includes 1,611 images accompanied by three types of textual annotations: 5,112 visual premises (with region annotations), 5,574 commonsense premises, and reasoning trees connecting them to a broader argument. We propose three tasks over VisArgs to probe machine capacity for visual argument understanding: localization of premises, identification of premises, and deduction of conclusions. Experiments demonstrate that 1) machines cannot fully identify the relevant visual cues. The top-performing model, GPT-4-O, achieved an accuracy of only 78.5%, whereas humans reached 98.0%. All models showed a performance drop, with an average decrease in accuracy of 19.5%, when the comparison set was changed from objects outside the image to irrelevant objects within the image. Furthermore, 2) this limitation is the greatest factor impacting their performance in understanding visual arguments. Most models improved the most when given relevant visual premises as additional inputs, compared to other inputs, for deducing the conclusion of the visual argument.
comment: 12 pages, 5 figures
☆ Capturing Minds, Not Just Words: Enhancing Role-Playing Language Models with Personality-Indicative Data
Role-playing agents (RPA) have been a popular application area for large language models (LLMs), attracting significant interest from both industry and academia.While existing RPAs well portray the characters' knowledge and tones, they face challenges in capturing their minds, especially for small role-playing language models (RPLMs). In this paper, we propose to enhance RPLMs via personality-indicative data. Specifically, we leverage questions from psychological scales and distill advanced RPAs to generate dialogues that grasp the minds of characters. Experimental results validate that RPLMs trained with our dataset exhibit advanced role-playing capabilities for both general and personality-related evaluations. Code and data are available at \href{https://github.com/alienet1109/RolePersonality}{this URL}.
comment: 10pages
☆ TrustUQA: A Trustful Framework for Unified Structured Data Question Answering
Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.
☆ Factor-Conditioned Speaking-Style Captioning
This paper presents a novel speaking-style captioning method that generates diverse descriptions while accurately predicting speaking-style information. Conventional learning criteria directly use original captions that contain not only speaking-style factor terms but also syntax words, which disturbs learning speaking-style information. To solve this problem, we introduce factor-conditioned captioning (FCC), which first outputs a phrase representing speaking-style factors (e.g., gender, pitch, etc.), and then generates a caption to ensure the model explicitly learns speaking-style factors. We also propose greedy-then-sampling (GtS) decoding, which first predicts speaking-style factors deterministically to guarantee semantic accuracy, and then generates a caption based on factor-conditioned sampling to ensure diversity. Experiments show that FCC outperforms the original caption-based training, and with GtS, it generates more diverse captions while keeping style prediction performance.
comment: Accepted to Interspeech 2024
☆ Historia Magistra Vitae: Dynamic Topic Modeling of Roman Literature using Neural Embeddings
Dynamic topic models have been proposed as a tool for historical analysis, but traditional approaches have had limited usefulness, being difficult to configure, interpret, and evaluate. In this work, we experiment with a recent approach for dynamic topic modeling using BERT embeddings. We compare topic models built using traditional statistical models (LDA and NMF) and the BERT-based model, modeling topics over the entire surviving corpus of Roman literature. We find that while quantitative metrics prefer statistical models, qualitative evaluation finds better insights from the neural model. Furthermore, the neural topic model is less sensitive to hyperparameter configuration and thus may make dynamic topic modeling more viable for historical researchers.
comment: 6 pages, 2 figures
☆ Sonnet or Not, Bot? Poetry Evaluation for Large Models and Datasets
Large language models (LLMs) can now generate and recognize text in a wide range of styles and genres, including highly specialized, creative genres like poetry. But what do LLMs really know about poetry? What can they know about poetry? We develop a task to evaluate how well LLMs recognize a specific aspect of poetry, poetic form, for more than 20 forms and formal elements in the English language. Poetic form captures many different poetic features, including rhyme scheme, meter, and word or line repetition. We use this task to reflect on LLMs' current poetic capabilities, as well as the challenges and pitfalls of creating NLP benchmarks for poetry and for other creative tasks. In particular, we use this task to audit and reflect on the poems included in popular pretraining datasets. Our findings have implications for NLP researchers interested in model evaluation, digital humanities and cultural analytics scholars, and cultural heritage professionals.
☆ Can we teach language models to gloss endangered languages?
Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation. Automating the creation of interlinear glossed text can be desirable to reduce annotator effort and maintain consistency across annotated corpora. Prior research has explored a number of statistical and neural methods for automatically producing IGT. As large language models (LLMs) have showed promising results across multilingual tasks, even for rare, endangered languages, it is natural to wonder whether they can be utilized for the task of generating IGT. We explore whether LLMs can be effective at the task of interlinear glossing with in-context learning, without any traditional training. We propose new approaches for selecting examples to provide in-context, observing that targeted selection can significantly improve performance. We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all. These approaches still underperform state-of-the-art supervised systems for the task, but are highly practical for researchers outside of the NLP community, requiring minimal effort to use.
☆ SSP: Self-Supervised Prompting for Cross-Lingual Transfer to Low-Resource Languages using Large Language Models
Recently, very large language models (LLMs) have shown exceptional performance on several English NLP tasks with just in-context learning (ICL), but their utility in other languages is still underexplored. We investigate their effectiveness for NLP tasks in low-resource languages (LRLs), especially in the setting of zero-labelled cross-lingual transfer (0-CLT), where no labelled training data for the target language is available -- however training data from one or more related medium-resource languages (MRLs) is utilized, alongside the available unlabeled test data for a target language. We introduce Self-Supervised Prompting (SSP), a novel ICL approach tailored for the 0-CLT setting. SSP is based on the key observation that LLMs output more accurate labels if in-context exemplars are from the target language (even if their labels are slightly noisy). To operationalize this, since target language training data is not available in 0-CLT, SSP operates in two stages. In Stage I, using source MRL training data, target language's test data is noisily labeled. In Stage II, these noisy test data points are used as exemplars in ICL for further improved labelling. Additionally, our implementation of SSP uses a novel Integer Linear Programming (ILP)-based exemplar selection that balances similarity, prediction confidence (when available) and label coverage. Experiments on three tasks and eleven LRLs (from three regions) demonstrate that SSP strongly outperforms existing SOTA fine-tuned and prompting-based baselines in 0-CLT setup.
☆ DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment
Recent speech language models (SLMs) typically incorporate pre-trained speech models to extend the capabilities from large language models (LLMs). In this paper, we propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities, enabling SLMs to interpret and generate comprehensive natural language descriptions, thereby facilitating the capability to understand both linguistic and non-linguistic features in speech. Enhanced with the proposed approach, our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. Moreover, we discover that the aligned model exhibits a zero-shot instruction-following capability without explicit speech instruction tuning. These findings highlight the potential to reshape instruction-following SLMs by incorporating rich, descriptive speech captions.
comment: Accepted to Interspeech 2024
☆ Efficacy of Language Model Self-Play in Non-Zero-Sum Games
Game-playing agents like AlphaGo have achieved superhuman performance through self-play, which is theoretically guaranteed to yield optimal policies in competitive games. However, most language tasks are partially or fully cooperative, so it is an open question whether techniques like self-play can effectively be used to improve language models. We empirically investigate this question in a negotiation game setting known as Deal or No Deal (DoND). Crucially, the objective in DoND can be modified to produce a fully cooperative game, a strictly competitive one, or anything in between. We finetune language models in self-play over multiple rounds of filtered behavior cloning in DoND for each of these objectives. Contrary to expectations, we find that language model self-play leads to significant performance gains in both cooperation and competition with humans, suggesting that self-play and related techniques have promise despite a lack of theoretical guarantees.
☆ Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.
☆ FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus
Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. The dataset consists of 1,013 main text and 809 titles, all of which have been manually corrected. We measured the translation quality of two LLMs -- ChatGPT and ERNIE-bot, utilizing BLEU, TER and chrF scores as the evaluation metrics. For comparison, we also trained an OpenNMT model based on our dataset. We detail problems of LLMs and provide in-depth analysis, intending to stimulate further research and solutions in this largely uncharted territory. Our research underlines the need to optimize LLMs within the specific field of financial translation to ensure accuracy and quality.
comment: a simplified version of this paper is accepted by International Conference on Asian Language Processing 2024
☆ Learning Retrieval Augmentation for Personalized Dialogue Generation EMNLP-2023
Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose $\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for $\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration ($\textbf{LAPDOG}$), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. https://github.com/hqsiswiliam/LAPDOG
comment: Accepted to EMNLP-2023
☆ OutlierTune: Efficient Channel-Wise Quantization for Large Language Models
Quantizing the activations of large language models (LLMs) has been a significant challenge due to the presence of structured outliers. Most existing methods focus on the per-token or per-tensor quantization of activations, making it difficult to achieve both accuracy and hardware efficiency. To address this problem, we propose OutlierTune, an efficient per-channel post-training quantization (PTQ) method for the activations of LLMs. OutlierTune consists of two components: pre-execution of dequantization and symmetrization. The pre-execution of dequantization updates the model weights by the activation scaling factors, avoiding the internal scaling and costly additional computational overheads brought by the per-channel activation quantization. The symmetrization further reduces the quantization differences arising from the weight updates by ensuring the balanced numerical ranges across different activation channels. OutlierTune is easy to implement and hardware-efficient, introducing almost no additional computational overheads during the inference. Extensive experiments show that the proposed framework outperforms existing methods across multiple different tasks. Demonstrating better generalization, this framework improves the Int6 quantization of the instruction-tuning LLMs, such as OPT-IML, to the same level as half-precision (FP16). Moreover, we have shown that the proposed framework is 1.48x faster than the FP16 implementation while reducing approximately 2x memory usage.
☆ PathAlign: A vision-language model for whole slide images in histopathology
Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
comment: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures
☆ Voices Unheard: NLP Resources and Models for Yorùbá Regional Dialects
Yor\`ub\'a an African language with roughly 47 million speakers encompasses a continuum with several dialects. Recent efforts to develop NLP technologies for African languages have focused on their standard dialects, resulting in disparities for dialects and varieties for which there are little to no resources or tools. We take steps towards bridging this gap by introducing a new high-quality parallel text and speech corpus YOR\`ULECT across three domains and four regional Yor\`ub\'a dialects. To develop this corpus, we engaged native speakers, travelling to communities where these dialects are spoken, to collect text and speech data. Using our newly created corpus, we conducted extensive experiments on (text) machine translation, automatic speech recognition, and speech-to-text translation. Our results reveal substantial performance disparities between standard Yor\`ub\'a and the other dialects across all tasks. However, we also show that with dialect-adaptive finetuning, we are able to narrow this gap. We believe our dataset and experimental analysis will contribute greatly to developing NLP tools for Yor\`ub\'a and its dialects, and potentially for other African languages, by improving our understanding of existing challenges and offering a high-quality dataset for further development. We release YOR\`ULECT dataset and models publicly under an open license.
☆ Rethinking harmless refusals when fine-tuning foundation models ICLR 2024
In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emph{reason-based deception}, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
comment: ICLR 2024 AGI Workshop Poster
☆ Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations
We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.
comment: To appear at The Proceedings of the Association for Computational Linguistics, 2024
☆ Demarked: A Strategy for Enhanced Abusive Speech Moderation through Counterspeech, Detoxification, and Message Management
Despite regulations imposed by nations and social media platforms, such as recent EU regulations targeting digital violence, abusive content persists as a significant challenge. Existing approaches primarily rely on binary solutions, such as outright blocking or banning, yet fail to address the complex nature of abusive speech. In this work, we propose a more comprehensive approach called Demarcation scoring abusive speech based on four aspect -- (i) severity scale; (ii) presence of a target; (iii) context scale; (iv) legal scale -- and suggesting more options of actions like detoxification, counter speech generation, blocking, or, as a final measure, human intervention. Through a thorough analysis of abusive speech regulations across diverse jurisdictions, platforms, and research papers we highlight the gap in preventing measures and advocate for tailored proactive steps to combat its multifaceted manifestations. Our work aims to inform future strategies for effectively addressing abusive speech online.
☆ Context Matters: An Empirical Study of the Impact of Contextual Information in Temporal Question Answering Systems
Large language models (LLMs) often struggle with temporal reasoning, crucial for tasks like historical event analysis and time-sensitive information retrieval. Despite advancements, state-of-the-art models falter in handling temporal information, especially when faced with irrelevant or noisy contexts. This paper addresses this gap by empirically examining the robustness of temporal question-answering (TQA) systems trained on various context types, including relevant, irrelevant, slightly altered, and no context. Our findings indicate that training with a mix of these contexts enhances model robustness and accuracy. Additionally, we show that the position of context relative to the question significantly impacts performance, with question-first positioning yielding better results. We introduce two new context-rich TQA datasets, ContextAQA and ContextTQE, and provide comprehensive evaluations and guidelines for training robust TQA models. Our work lays the foundation for developing reliable and context-aware temporal QA systems, with broader implications for enhancing LLM robustness against diverse and potentially adversarial information.
☆ Handling Ontology Gaps in Semantic Parsing
The majority of Neural Semantic Parsing (NSP) models are developed with the assumption that there are no concepts outside the ones such models can represent with their target symbols (closed-world assumption). This assumption leads to generate hallucinated outputs rather than admitting their lack of knowledge. Hallucinations can lead to wrong or potentially offensive responses to users. Hence, a mechanism to prevent this behavior is crucial to build trusted NSP-based Question Answering agents. To that end, we propose the Hallucination Simulation Framework (HSF), a general setting for stimulating and analyzing NSP model hallucinations. The framework can be applied to any NSP task with a closed-ontology. Using the proposed framework and KQA Pro as the benchmark dataset, we assess state-of-the-art techniques for hallucination detection. We then present a novel hallucination detection strategy that exploits the computational graph of the NSP model to detect the NSP hallucinations in the presence of ontology gaps, out-of-domain utterances, and to recognize NSP errors, improving the F1-Score respectively by ~21, ~24% and ~1%. This is the first work in closed-ontology NSP that addresses the problem of recognizing ontology gaps. We release our code and checkpoints at https://github.com/amazon-science/handling-ontology-gaps-in-semantic-parsing.
☆ TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.
comment: 6 pages, 6 figures
☆ Captioning Visualizations with Large Language Models (CVLLM): A Tutorial
Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.
comment: 6 pages, 4 figures
☆ Are Generative Language Models Multicultural? A Study on Hausa Culture and Emotions using ChatGPT
Large Language Models (LLMs), such as ChatGPT, are widely used to generate content for various purposes and audiences. However, these models may not reflect the cultural and emotional diversity of their users, especially for low-resource languages. In this paper, we investigate how ChatGPT represents Hausa's culture and emotions. We compare responses generated by ChatGPT with those provided by native Hausa speakers on 37 culturally relevant questions. We conducted experiments using emotion analysis and applied two similarity metrics to measure the alignment between human and ChatGPT responses. We also collected human participants ratings and feedback on ChatGPT responses. Our results show that ChatGPT has some level of similarity to human responses, but also exhibits some gaps and biases in its knowledge and awareness of the Hausa culture and emotions. We discuss the implications and limitations of our methodology and analysis and suggest ways to improve the performance and evaluation of LLMs for low-resource languages.
☆ Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
Despite significant advancements, there is a limited understanding of how large language models (LLMs) utilize knowledge for reasoning. To address this, we propose a method that deconstructs complex real-world questions into a graph, representing each question as a node with parent nodes of background knowledge needed to solve the question. We develop the DepthQA dataset, deconstructing questions into three depths: (i) recalling conceptual knowledge, (ii) applying procedural knowledge, and (iii) analyzing strategic knowledge. Based on a hierarchical graph, we quantify forward discrepancy, discrepancies in LLMs' performance on simpler sub-problems versus complex questions. We also measure backward discrepancy, where LLMs answer complex questions but struggle with simpler ones. Our analysis shows that smaller models have more discrepancies than larger models. Additionally, guiding models from simpler to complex questions through multi-turn interactions improves performance across model sizes, highlighting the importance of structured intermediate steps in knowledge reasoning. This work enhances our understanding of LLM reasoning and suggests ways to improve their problem-solving abilities.
comment: Work in progress; code is available at https://github.com/kaistAI/knowledge-reasoning
☆ Monitoring Latent World States in Language Models with Propositional Probes
Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypothesize that language models represent their input contexts in a latent world model, and seek to extract this latent world state from the activations. We do so with 'propositional probes', which compositionally probe tokens for lexical information and bind them into logical propositions representing the world state. For example, given the input context ''Greg is a nurse. Laura is a physicist.'', we decode the propositions ''WorksAs(Greg, nurse)'' and ''WorksAs(Laura, physicist)'' from the model's activations. Key to this is identifying a 'binding subspace' in which bound tokens have high similarity (''Greg'' and ''nurse'') but unbound ones do not (''Greg'' and ''physicist''). We validate propositional probes in a closed-world setting with finitely many predicates and properties. Despite being trained on simple templated contexts, propositional probes generalize to contexts rewritten as short stories and translated to Spanish. Moreover, we find that in three settings where language models respond unfaithfully to the input context -- prompt injections, backdoor attacks, and gender bias -- the decoded propositions remain faithful. This suggests that language models often encode a faithful world model but decode it unfaithfully, which motivates the search for better interpretability tools for monitoring LMs.
☆ Knowledge acquisition for dialogue agents using reinforcement learning on graph representations
We develop an artificial agent motivated to augment its knowledge base beyond its initial training. The agent actively participates in dialogues with other agents, strategically acquiring new information. The agent models its knowledge as an RDF knowledge graph, integrating new beliefs acquired through conversation. Responses in dialogue are generated by identifying graph patterns around these new integrated beliefs. We show that policies can be learned using reinforcement learning to select effective graph patterns during an interaction, without relying on explicit user feedback. Within this context, our study is a proof of concept for leveraging users as effective sources of information.
☆ Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts
Large language models (LLMs) are increasingly utilized to assist in scientific and academic writing, helping authors enhance the coherence of their articles. Previous studies have highlighted stereotypes and biases present in LLM outputs, emphasizing the need to evaluate these models for their alignment with human narrative styles and potential gender biases. In this study, we assess the alignment of three prominent LLMs - Claude 3 Opus, Mistral AI Large, and Gemini 1.5 Flash - by analyzing their performance on benchmark text-generation tasks for scientific abstracts. We employ the Linguistic Inquiry and Word Count (LIWC) framework to extract lexical, psychological, and social features from the generated texts. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases. This research highlights the importance of developing LLMs that maintain a diversity of writing styles to promote inclusivity in academic discourse.
☆ Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems
Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool's success - focusing on the factual accuracy and reliability of outcomes.
☆ LoPT: Low-Rank Prompt Tuning for Parameter Efficient Language Models
In prompt tuning, a prefix or suffix text is added to the prompt, and the embeddings (soft prompts) or token indices (hard prompts) of the prefix/suffix are optimized to gain more control over language models for specific tasks. This approach eliminates the need for hand-crafted prompt engineering or explicit model fine-tuning. Prompt tuning is significantly more parameter-efficient than model fine-tuning, as it involves optimizing partial inputs of language models to produce desired outputs. In this work, we aim to further reduce the amount of trainable parameters required for a language model to perform well on specific tasks. We propose Low-rank Prompt Tuning (LoPT), a low-rank model for prompts that achieves efficient prompt optimization. The proposed method demonstrates similar outcomes to full parameter prompt tuning while reducing the number of trainable parameters by a factor of 5. It also provides promising results compared to the state-of-the-art methods that would require 10 to 20 times more parameters.
☆ xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
While machine translation (MT) systems are achieving increasingly strong performance on benchmarks, they often produce translations with errors and anomalies. Understanding these errors can potentially help improve the translation quality and user experience. This paper introduces xTower, an open large language model (LLM) built on top of TowerBase designed to provide free-text explanations for translation errors in order to guide the generation of a corrected translation. The quality of the generated explanations by xTower are assessed via both intrinsic and extrinsic evaluation. We ask expert translators to evaluate the quality of the explanations across two dimensions: relatedness towards the error span being explained and helpfulness in error understanding and improving translation quality. Extrinsically, we test xTower across various experimental setups in generating translation corrections, demonstrating significant improvements in translation quality. Our findings highlight xTower's potential towards not only producing plausible and helpful explanations of automatic translations, but also leveraging them to suggest corrected translations.
☆ Sparse Regression for Machine Translation
We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of $L_1$ regularized regression (\textit{lasso}) to learn the mappings between sparsely observed feature sets versus $L_2$ regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce \textit{dice} instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that $L_1$ regularized regression performs better than $L_2$ regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.
comment: 8 pages, 4 figures, 4 tables
☆ Changing Answer Order Can Decrease MMLU Accuracy
As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.
comment: Short paper, 9 pages
☆ Can Large Language Models Generate High-quality Patent Claims?
Large language models (LLMs) have shown exceptional performance across various text generation tasks but remain under-explored in the patent domain, which offers highly structured and precise language. This paper constructs a dataset to investigate the performance of current LLMs in patent claim generation. Our results demonstrate that generating claims based on patent descriptions outperforms previous research relying on abstracts. Interestingly, current patent-specific LLMs perform much worse than state-of-the-art general LLMs, highlighting the necessity for future research on in-domain LLMs. We also find that LLMs can produce high-quality first independent claims, but their performances markedly decrease for subsequent dependent claims. Moreover, fine-tuning can enhance the completeness of inventions' features, conceptual clarity, and feature linkage. Among the tested LLMs, GPT-4 demonstrates the best performance in comprehensive human evaluations by patent experts, with better feature coverage, conceptual clarity, and technical coherence. Despite these capabilities, comprehensive revision and modification are still necessary to pass rigorous patent scrutiny and ensure legal robustness.
comment: 13 pages
♻ ☆ Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
comment: 22 pages, 9 figures. Project page is available at https://boyiwei.com/alignment-attribution/
♻ ☆ VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
Visual programs are executable code generated by large language models to address visual reasoning problems. They decompose complex questions into multiple reasoning steps and invoke specialized models for each step to solve the problems. However, these programs are prone to logic errors, with our preliminary evaluation showing that 58% of the total errors are caused by program logic errors. Debugging complex visual programs remains a major bottleneck for visual reasoning. To address this, we introduce VDebugger, a novel critic-refiner framework trained to localize and debug visual programs by tracking execution step by step. VDebugger identifies and corrects program errors leveraging detailed execution feedback, improving interpretability and accuracy. The training data is generated through an automated pipeline that injects errors into correct visual programs using a novel mask-best decoding technique. Evaluations on six datasets demonstrate VDebugger's effectiveness, showing performance improvements of up to 3.2% in downstream task accuracy. Further studies show VDebugger's ability to generalize to unseen tasks, bringing a notable improvement of 2.3% on the unseen COVR task. Code, data and models are made publicly available at https://github.com/shirley-wu/vdebugger/
comment: update reference
♻ ☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
♻ ☆ Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
Self-alignment is an effective way to reduce the cost of human annotation while ensuring promising model capability. However, most current methods complete the data collection and training steps in a single round, which may overlook the continuously improving ability of self-aligned models. This gives rise to a key query: What if we do multi-time bootstrapping self-alignment? Does this strategy enhance model performance or lead to rapid degradation? In this paper, our pioneering exploration delves into the impact of bootstrapping self-alignment on large language models. Our findings reveal that bootstrapping self-alignment markedly surpasses the single-round approach, by guaranteeing data diversity from in-context learning. To further exploit the capabilities of bootstrapping, we investigate and adjust the training order of data, which yields improved performance of the model. Drawing on these findings, we propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance. Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance. Our experiments demonstrate the efficiency of SOFT (SOFT+) across various classification and generation tasks, highlighting the potential of bootstrapping self-alignment on continually enhancing model alignment performance.
♻ ☆ Thermometer: Towards Universal Calibration for Large Language Models ICML 2024
We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.
comment: Camera ready version for ICML 2024
♻ ☆ MuTox: Universal MUltilingual Audio-based TOXicity Dataset and Zero-shot Detector
Research in toxicity detection in natural language processing for the speech modality (audio-based) is quite limited, particularly for languages other than English. To address these limitations and lay the groundwork for truly multilingual audio-based toxicity detection, we introduce MuTox, the first highly multilingual audio-based dataset with toxicity labels. The dataset comprises 20,000 audio utterances for English and Spanish, and 4,000 for the other 19 languages. To demonstrate the quality of this dataset, we trained the MuTox audio-based toxicity classifier, which enables zero-shot toxicity detection across a wide range of languages. This classifier outperforms existing text-based trainable classifiers by more than 1% AUC, while expanding the language coverage more than tenfold. When compared to a wordlist-based classifier that covers a similar number of languages, MuTox improves precision and recall by approximately 2.5 times. This significant improvement underscores the potential of MuTox in advancing the field of audio-based toxicity detection.
♻ ☆ MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic
The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.
comment: 19 pages
♻ ☆ CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation
Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
♻ ☆ Assessing the nature of large language models: A caution against anthropocentrism
Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.
comment: 31 pages, 6 figures
♻ ☆ AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator
Artificial intelligence has significantly advanced healthcare, particularly through large language models (LLMs) that excel in medical question answering benchmarks. However, their real-world clinical application remains limited due to the complexities of doctor-patient interactions. To address this, we introduce \textbf{AI Hospital}, a multi-agent framework simulating dynamic medical interactions between \emph{Doctor} as player and NPCs including \emph{Patient}, \emph{Examiner}, \emph{Chief Physician}. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation (MVME) benchmark, utilizing high-quality Chinese medical records and NPCs to evaluate LLMs' performance in symptom collection, examination recommendations, and diagnoses. Additionally, a dispute resolution collaborative mechanism is proposed to enhance diagnostic accuracy through iterative discussions. Despite improvements, current LLMs exhibit significant performance gaps in multi-turn interactions compared to one-step approaches. Our findings highlight the need for further research to bridge these gaps and improve LLMs' clinical diagnostic capabilities. Our data, code, and experimental results are all open-sourced at \url{https://github.com/LibertFan/AI_Hospital}.
comment: https://github.com/LibertFan/AI_Hospital
♻ ☆ Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA ACL 2024
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs' performance with synthetic data and offer insights for enhancement. Code and data are released at https://sites.google.com/view/multipanelvqa/home.
comment: ACL 2024
♻ ☆ Unified Active Retrieval for Retrieval Augmented Generation
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
♻ ☆ ReFT: Reasoning with Reinforced Fine-Tuning ACL 2024
One way to enhance the reasoning capability of Large Language Models (LLMs) is to conduct Supervised Fine-Tuning (SFT) using Chain-of-Thought (CoT) annotations. This approach does not show sufficiently strong generalization ability, however, because the training only relies on the given CoT data. In math problem-solving, for example, there is usually only one annotated reasoning path for each question in the training data. Intuitively, it would be better for the algorithm to learn from multiple annotated reasoning paths given a question. To address this issue, we propose a simple yet effective approach called Reinforced Fine-Tuning (ReFT) to enhance the generalizability of learning LLMs for reasoning, with math problem-solving as an example. ReFT first warmups the model with SFT, and then employs on-line reinforcement learning, specifically the PPO algorithm in this paper, to further fine-tune the model, where an abundance of reasoning paths are automatically sampled given the question and the rewards are naturally derived from the ground-truth answers. Extensive experiments on GSM8K, MathQA, and SVAMP datasets show that ReFT significantly outperforms SFT, and the performance can be potentially further boosted by combining inference-time strategies such as majority voting and re-ranking. Note that ReFT obtains the improvement by learning from the same training questions as SFT, without relying on extra or augmented training questions. This indicates a superior generalization ability for ReFT.
comment: ACL 2024 main conference; adjust with reviewer comments; 13 pages
♻ ☆ Token-level Direct Preference Optimization
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
♻ ☆ MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
comment: Github link: https://github.com/ncbi-nlp/MedCalc-Bench HuggingFace link: https://huggingface.co/datasets/nsk7153/MedCalc-Bench
♻ ☆ Daisy-TTS: Simulating Wider Spectrum of Emotions via Prosody Embedding Decomposition
We often verbally express emotions in a multifaceted manner, they may vary in their intensities and may be expressed not just as a single but as a mixture of emotions. This wide spectrum of emotions is well-studied in the structural model of emotions, which represents variety of emotions as derivative products of primary emotions with varying degrees of intensity. In this paper, we propose an emotional text-to-speech design to simulate a wider spectrum of emotions grounded on the structural model. Our proposed design, Daisy-TTS, incorporates a prosody encoder to learn emotionally-separable prosody embedding as a proxy for emotion. This emotion representation allows the model to simulate: (1) Primary emotions, as learned from the training samples, (2) Secondary emotions, as a mixture of primary emotions, (3) Intensity-level, by scaling the emotion embedding, and (4) Emotions polarity, by negating the emotion embedding. Through a series of perceptual evaluations, Daisy-TTS demonstrated overall higher emotional speech naturalness and emotion perceiveability compared to the baseline.
comment: Project Page: https://rendchevi.github.io/daisy-tts; Updates: (1) Fixed typos, missing references, and layout, (2) Revise explanation on emotion classifier or discriminator
♻ ☆ CLIMATELI: Evaluating Entity Linking on Climate Change Data ACL 2024
Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models' performances.
comment: 8 pages, accepted at ClimateNLP 2024 workshop @ ACL 2024
♻ ☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.
♻ ☆ QUB-Cirdan at "Discharge Me!": Zero shot discharge letter generation by open-source LLM
The BioNLP ACL'24 Shared Task on Streamlining Discharge Documentation aims to reduce the administrative burden on clinicians by automating the creation of critical sections of patient discharge letters. This paper presents our approach using the Llama3 8B quantized model to generate the "Brief Hospital Course" and "Discharge Instructions" sections. We employ a zero-shot method combined with Retrieval-Augmented Generation (RAG) to produce concise, contextually accurate summaries. Our contributions include the development of a curated template-based approach to ensure reliability and consistency, as well as the integration of RAG for word count prediction. We also describe several unsuccessful experiments to provide insights into our pathway for the competition. Our results demonstrate the effectiveness and efficiency of our approach, achieving high scores across multiple evaluation metrics.
comment: BioNLP 2024 workshop
♻ ☆ VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks MICCAI 2024
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at: https://github.com/naamiinepal/vlsm-adapter.
comment: Accepted at MICCAI 2024, the 27th International Conference on Medical Image Computing and Computer Assisted Intervention
♻ ☆ How to Handle Different Types of Out-of-Distribution Scenarios in Computational Argumentation? A Comprehensive and Fine-Grained Field Study
The advent of pre-trained Language Models (LMs) has markedly advanced natural language processing, but their efficacy in out-of-distribution (OOD) scenarios remains a significant challenge. Computational argumentation (CA), modeling human argumentation processes, is a field notably impacted by these challenges because complex annotation schemes and high annotation costs naturally lead to resources barely covering the multiplicity of available text sources and topics. Due to this data scarcity, generalization to data from uncovered covariant distributions is a common challenge for CA tasks like stance detection or argument classification. This work systematically assesses LMs' capabilities for such OOD scenarios. While previous work targets specific OOD types like topic shifts or OOD uniformly, we address three prevalent OOD scenarios in CA: topic shift, domain shift, and language shift. Our findings challenge the previously asserted general superiority of in-context learning (ICL) for OOD. We find that the efficacy of such learning paradigms varies with the type of OOD. Specifically, while ICL excels for domain shifts, prompt-based fine-tuning surpasses for topic shifts. To sum up, we navigate the heterogeneity of OOD scenarios in CA and empirically underscore the potential of base-sized LMs in overcoming these challenges.
♻ ☆ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL according to natural language questions (text-to-SQL) is a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems, comprising human engineering and deep neural networks, have made substantial progress. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex, the corresponding user questions also grow more challenging, leading PLMs with limited comprehension capabilities to produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods for PLMs, which, in turn, restricts the applications of PLM-based systems. Most recently, large language models (LLMs) have demonstrated significant capabilities in natural language understanding as the model scale remains increasing. Therefore, integrating the LLM-based implementation can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the technical challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future research directions.
♻ ☆ WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images
Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (PathText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues and achieve competitive performance on certain slide-level tasks. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping surpassing previous state-of-the-art approaches. Our collected dataset and related code are available.
♻ ☆ Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
♻ ☆ Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems
The inherent ambiguity of cause and effect boundaries poses a challenge in evaluating causal event extraction tasks. Traditional metrics like Exact Match and BertScore poorly reflect model performance, so we trained evaluation models to approximate human evaluation, achieving high agreement. We used them to perform Reinforcement Learning with extraction models to align them with human preference, prioritising semantic understanding. We successfully explored our approach through multiple datasets, including transferring an evaluator trained on one dataset to another as a way to decrease the reliance on human-annotated data. In that vein, we also propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model while still achieving high performance in training an RL model. Our code is available at https://github.com/oyarsa/event_extraction/tree/causal-event-extraction.
comment: 13 pages, 6 figures, 6 tables
♻ ☆ CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents. However, the misinformation in external documents may mislead LLMs' generation. To address this issue, we explore the task of "credibility-aware RAG", in which LLMs automatically adjust the influence of retrieved documents based on their credibility scores to counteract misinformation. To this end, we introduce a plug-and-play method named $\textbf{Cr}$edibility-aware $\textbf{A}$ttention $\textbf{M}$odification (CrAM). CrAM identifies influential attention heads in LLMs and adjusts their attention weights based on the credibility of the documents, thereby reducing the impact of low-credibility documents. Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen-7B show that CrAM improves the RAG performance of LLMs against misinformation pollution by over 20%, even surpassing supervised fine-tuning methods.
comment: Under review
♻ ☆ Exploring Defeasibility in Causal Reasoning ACL 2024
Defeasibility in causal reasoning implies that the causal relationship between cause and effect can be strengthened or weakened. Namely, the causal strength between cause and effect should increase or decrease with the incorporation of strengthening arguments (supporters) or weakening arguments (defeaters), respectively. However, existing works ignore defeasibility in causal reasoning and fail to evaluate existing causal strength metrics in defeasible settings. In this work, we present $\delta$-CAUSAL, the first benchmark dataset for studying defeasibility in causal reasoning. $\delta$-CAUSAL includes around 11K events spanning ten domains, featuring defeasible causality pairs, i.e., cause-effect pairs accompanied by supporters and defeaters. We further show current causal strength metrics fail to reflect the change of causal strength with the incorporation of supporters or defeaters in $\delta$-CAUSAL. To this end, we propose CESAR (Causal Embedding aSsociation with Attention Rating), a metric that measures causal strength based on token-level causal relationships. CESAR achieves a significant 69.7% relative improvement over existing metrics, increasing from 47.2% to 80.1% in capturing the causal strength change brought by supporters and defeaters. We further demonstrate even Large Language Models (LLMs) like GPT-3.5 still lag 4.5 and 10.7 points behind humans in generating supporters and defeaters, emphasizing the challenge posed by $\delta$-CAUSAL.
comment: Accepted by ACL 2024 (Findings)
♻ ☆ Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
comment: IWSLT 2024
♻ ☆ 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African voices and personas are under-represented in these systems. As speech synthesis becomes increasingly democratized, it is desirable to increase the representation of African English accents. We present Afro-TTS, the first pan-African accented English speech synthesis system able to generate speech in 86 African accents, with 1000 personas representing the rich phonological diversity across the continent for downstream application in Education, Public Health, and Automated Content Creation. Speaker interpolation retains naturalness and accentedness, enabling the creation of new voices.
comment: Accepted at Interspeech 2024
♻ ☆ Continual Learning Under Language Shift
The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.
comment: Accepted to TSD 2024
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Efficient Continual Pre-training by Mitigating the Stability Gap
Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct}.
♻ ☆ mHuBERT-147: A Compact Multilingual HuBERT Model
We present mHuBERT-147, the first general-purpose massively multilingual HuBERT speech representation model trained on 90K hours of clean, open-license data. To scale up the multi-iteration HuBERT approach, we use faiss-based clustering, achieving 5.2x faster label assignment than the original method. We also apply a new multilingual batching up-sampling strategy, leveraging both language and dataset diversity. After 3 training iterations, our compact 95M parameter mHuBERT-147 outperforms larger models trained on substantially more data. We rank second and first on the ML-SUPERB 10min and 1h leaderboards, with SOTA scores for 3 tasks. Across ASR/LID tasks, our model consistently surpasses XLS-R (300M params; 436K hours) and demonstrates strong competitiveness against the much larger MMS (1B params; 491K hours). Our findings indicate that mHuBERT-147 is a promising model for multilingual speech tasks, offering an unprecedented balance between high performance and parameter efficiency.
comment: Extended version of the Interspeech 2024 paper of same name
♻ ☆ Focus on Your Question! Interpreting and Mitigating Toxic CoT Problems in Commonsense Reasoning ACL 2024
Large language models exhibit high-level commonsense reasoning abilities, especially with enhancement methods like Chain-of-Thought (CoT). However, we find these CoT-like methods lead to a considerable number of originally correct answers turning wrong, which we define as the Toxic CoT problem. To interpret and mitigate this problem, we first utilize attribution tracing and causal tracing methods to probe the internal working mechanism of the LLM during CoT reasoning. Through comparisons, we prove that the model exhibits information loss from the question over the shallow attention layers when generating rationales or answers. Based on the probing findings, we design a novel method called RIDERS (Residual decodIng and sERial-position Swap), which compensates for the information deficit in the model from both decoding and serial-position perspectives. Through extensive experiments on multiple commonsense reasoning benchmarks, we validate that this method not only significantly eliminates Toxic CoT problems (decreased by 23.6%), but also effectively improves the model's overall commonsense reasoning performance (increased by 5.5%).
comment: Accepted as a long paper to ACL 2024 Main, 25 pages, 22 figures
♻ ☆ GCRE-GPT: A Generative Model for Comparative Relation Extraction
Given comparative text, comparative relation extraction aims to extract two targets (\eg two cameras) in comparison and the aspect they are compared for (\eg image quality). The extracted comparative relations form the basis of further opinion analysis.Existing solutions formulate this task as a sequence labeling task, to extract targets and aspects. However, they cannot directly extract comparative relation(s) from text. In this paper, we show that comparative relations can be directly extracted with high accuracy, by generative model. Based on GPT-2, we propose a Generation-based Comparative Relation Extractor (GCRE-GPT). Experiment results show that \modelname achieves state-of-the-art accuracy on two datasets.
comment: 6 pages, 6 tables, 1 figure
♻ ☆ Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models NeurIPS 2024
Recent advances in prompt optimization have notably enhanced the performance of pre-trained language models (PLMs) on downstream tasks. However, the potential of optimized prompts on domain generalization has been under-explored. To explore the nature of prompt generalization on unknown domains, we conduct pilot experiments and find that (i) Prompts gaining more attention weight from PLMs' deep layers are more generalizable and (ii) Prompts with more stable attention distributions in PLMs' deep layers are more generalizable. Thus, we offer a fresh objective towards domain-generalizable prompts optimization named "Concentration", which represents the "lookback" attention from the current decoding token to the prompt tokens, to increase the attention strength on prompts and reduce the fluctuation of attention distribution. We adapt this new objective to popular soft prompt and hard prompt optimization methods, respectively. Extensive experiments demonstrate that our idea improves comparison prompt optimization methods by 1.42% for soft prompt generalization and 2.16% for hard prompt generalization in accuracy on the multi-source domain generalization setting, while maintaining satisfying in-domain performance. The promising results validate the effectiveness of our proposed prompt optimization objective and provide key insights into domain-generalizable prompts.
comment: Submitted to NeurIPS 2024, Preprint, Under review
♻ ☆ M4GT-Bench: Evaluation Benchmark for Black-Box Machine-Generated Text Detection
The advent of Large Language Models (LLMs) has brought an unprecedented surge in machine-generated text (MGT) across diverse channels. This raises legitimate concerns about its potential misuse and societal implications. The need to identify and differentiate such content from genuine human-generated text is critical in combating disinformation, preserving the integrity of education and scientific fields, and maintaining trust in communication. In this work, we address this problem by introducing a new benchmark based on a multilingual, multi-domain, and multi-generator corpus of MGTs -- M4GT-Bench. The benchmark is compiled of three tasks: (1) mono-lingual and multi-lingual binary MGT detection; (2) multi-way detection where one need to identify, which particular model generated the text; and (3) mixed human-machine text detection, where a word boundary delimiting MGT from human-written content should be determined. On the developed benchmark, we have tested several MGT detection baselines and also conducted an evaluation of human performance. We see that obtaining good performance in MGT detection usually requires an access to the training data from the same domain and generators. The benchmark is available at https://github.com/mbzuai-nlp/M4GT-Bench.
comment: 29 pages
♻ ☆ Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.
♻ ☆ GlossLM: Multilingual Pretraining for Low-Resource Interlinear Glossing ACL
Language documentation projects often involve the creation of annotated text in a format such as interlinear glossed text (IGT), which captures fine-grained morphosyntactic analyses in a morpheme-by-morpheme format. However, there are few existing resources providing large amounts of standardized, easily accessible IGT data, limiting their applicability to linguistic research, and making it difficult to use such data in NLP modeling. We compile the largest existing corpus of IGT data from a variety of sources, covering over 450k examples across 1.8k languages, to enable research on crosslingual transfer and IGT generation. We normalize much of our data to follow a standard set of labels across languages. Furthermore, we explore the task of automatically generating IGT in order to aid documentation projects. As many languages lack sufficient monolingual data, we pretrain a large multilingual model on our corpus. We demonstrate the utility of this model by finetuning it on monolingual corpora, outperforming SOTA models by up to 6.6%. We will make our pretrained model and dataset available through Hugging Face, as well as provide access through a web interface for use in language documentation efforts.
comment: 19 pages, 7 figures Submitted to ACL ARR June 2024. First two authors are equal contribution
♻ ☆ Enhancing Text-based Knowledge Graph Completion with Zero-Shot Large Language Models: A Focus on Semantic Enhancement
The design and development of text-based knowledge graph completion (KGC) methods leveraging textual entity descriptions are at the forefront of research. These methods involve advanced optimization techniques such as soft prompts and contrastive learning to enhance KGC models. The effectiveness of text-based methods largely hinges on the quality and richness of the training data. Large language models (LLMs) can utilize straightforward prompts to alter text data, thereby enabling data augmentation for KGC. Nevertheless, LLMs typically demand substantial computational resources. To address these issues, we introduce a framework termed constrained prompts for KGC (CP-KGC). This CP-KGC framework designs prompts that adapt to different datasets to enhance semantic richness. Additionally, CP-KGC employs a context constraint strategy to effectively identify polysemous entities within KGC datasets. Through extensive experimentation, we have verified the effectiveness of this framework. Even after quantization, the LLM (Qwen-7B-Chat-int4) still enhances the performance of text-based KGC methods \footnote{Code and datasets are available at \href{https://github.com/sjlmg/CP-KGC}{https://github.com/sjlmg/CP-KGC}}. This study extends the performance limits of existing models and promotes further integration of KGC with LLMs.
comment: new version
♻ ☆ NutePrune: Efficient Progressive Pruning with Numerous Teachers for Large Language Models
The considerable size of Large Language Models (LLMs) presents notable deployment challenges, particularly on resource-constrained hardware. Structured pruning, offers an effective means to compress LLMs, thereby reducing storage costs and enhancing inference speed for more efficient utilization. In this work, we study data-efficient and resource-efficient structure pruning methods to obtain smaller yet still powerful models. Knowledge Distillation is well-suited for pruning, as the intact model can serve as an excellent teacher for pruned students. However, it becomes challenging in the context of LLMs due to memory constraints. To address this, we propose an efficient progressive Numerous-teacher pruning method (NutePrune). NutePrune mitigates excessive memory costs by loading only one intact model and integrating it with various masks and LoRA modules, enabling it to seamlessly switch between teacher and student roles. This approach allows us to leverage numerous teachers with varying capacities to progressively guide the pruned model, enhancing overall performance. Extensive experiments across various tasks demonstrate the effectiveness of NutePrune. In LLaMA-7B zero-shot experiments, NutePrune retains 97.17% of the performance of the original model at 20% sparsity and 95.07% at 25% sparsity. Our code is available at https://github.com/Lucius-lsr/NutePrune.
♻ ☆ Metric Dimension and Resolvability of Jaccard Spaces
A subset of points in a metric space is said to resolve it if each point in the space is uniquely characterized by its distance to each point in the subset. In particular, resolving sets can be used to represent points in abstract metric spaces as Euclidean vectors. Importantly, due to the triangle inequality, points close by in the space are represented as vectors with similar coordinates, which may find applications in classification problems of symbolic objects under suitably chosen metrics. In this manuscript, we address the resolvability of Jaccard spaces, i.e., metric spaces of the form $(2^X,\text{Jac})$, where $2^X$ is the power set of a finite set $X$, and $\text{Jac}$ is the Jaccard distance between subsets of $X$. Specifically, for different $a,b\in 2^X$, $\text{Jac}(a,b)=|a\Delta b|/|a\cup b|$, where $|\cdot|$ denotes size (i.e., cardinality) and $\Delta$ denotes the symmetric difference of sets. We combine probabilistic and linear algebra arguments to construct highly likely but nearly optimal (i.e., of minimal size) resolving sets of $(2^X,\text{Jac})$. In particular, we show that the metric dimension of $(2^X,\text{Jac})$, i.e., the minimum size of a resolving set of this space, is $\Theta(|X|/\ln|X|)$. In addition, we show that a much smaller subset of $2^X$ suffices to resolve, with high probability, all different pairs of subsets of $X$ of cardinality at most $\sqrt{|X|}/\ln|X|$, up to a factor.
comment: 13 pages, 1 table
♻ ☆ Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code LLMs, leading to significant performance degradation. In this study, we conducted extensive preliminary experiments and analyses on six Repo-Code LLMs. The results indicate that maintaining the topological dependencies of files and increasing the code file content in the completion prompts can improve completion accuracy; pruning the specific implementations of functions in all dependent files does not significantly reduce the accuracy of completions. Based on these findings, we proposed a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content, significantly reduces the input length for repository-level code completion. We applied the HCP strategy in experiments with six Repo-Code LLMs, and the results demonstrate that our proposed method can significantly enhance completion accuracy while substantially reducing the length of input. Our code and data are available at https://github.com/Hambaobao/HCP-Coder.
♻ ☆ EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries
Discharge summaries in Electronic Health Records (EHRs) are crucial for clinical decision-making, but their length and complexity make information extraction challenging, especially when dealing with accumulated summaries across multiple patient admissions. Large Language Models (LLMs) show promise in addressing this challenge by efficiently analyzing vast and complex data. Existing benchmarks, however, fall short in properly evaluating LLMs' capabilities in this context, as they typically focus on single-note information or limited topics, failing to reflect the real-world inquiries required by clinicians. To bridge this gap, we introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries. Every QA pair is initially generated using GPT-4 and then manually reviewed and refined by three clinicians to ensure clinical relevance. EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries. We offer EHRNoteQA in two formats: open-ended and multi-choice question answering, and propose a reliable evaluation method for each. We evaluate 27 LLMs using EHRNoteQA and examine various factors affecting the model performance (e.g., the length and number of discharge summaries). Furthermore, to validate EHRNoteQA as a reliable proxy for expert evaluations in clinical practice, we measure the correlation between the LLM performance on EHRNoteQA, and the LLM performance manually evaluated by clinicians. Results show that LLM performance on EHRNoteQA have higher correlation with clinician-evaluated performance (Spearman: 0.78, Kendall: 0.62) compared to other benchmarks, demonstrating its practical relevance in evaluating LLMs in clinical settings.
comment: Under Review
♻ ☆ Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals? ACL 2024
In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.
comment: ACL 2024 camera ready
♻ ☆ Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through fine-tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
♻ ☆ EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
comment: Github Repository: https://github.com/SAIS-FUXI/EvalAlign
♻ ☆ MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ICML 2024
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
comment: ICML 2024. Code is available at https://github.com/facebookresearch/MobileLLM
♻ ☆ Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains ACL 2024
Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.
comment: ACL 2024 camera ready
♻ ☆ Linear Cross-Lingual Mapping of Sentence Embeddings ACL
Semantics of a sentence is defined with much less ambiguity than semantics of a single word, and we assume that it should be better preserved by translation to another language. If multilingual sentence embeddings intend to represent sentence semantics, then the similarity between embeddings of any two sentences must be invariant with respect to translation. Based on this suggestion, we consider a simple linear cross-lingual mapping as a possible improvement of the multilingual embeddings. We also consider deviation from orthogonality conditions as a measure of deficiency of the embeddings.
comment: Accepted to ACL Findings 2024
♻ ☆ MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification
To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.
comment: It has changed a lot from the previous version and needs to set up a new one
♻ ☆ Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation
The scaling laws and extraordinary performance of large foundation models motivate the development and utilization of such models in biomedicine. However, despite early promising results on some biomedical benchmarks, there are still major challenges that need to be addressed before these models can be used in real-world clinics. Frontier general-domain models such as GPT-4V still have significant performance gaps in multimodal biomedical applications. More importantly, less-acknowledged pragmatic issues, including accessibility, model cost, and tedious manual evaluation make it hard for clinicians to use state-of-the-art large models directly on private patient data. Here, we explore training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. To maximize data efficiency, we adopt a modular approach by incorporating state-of-the-art pre-trained models for image and text modalities, and focusing on training a lightweight adapter to ground each modality to the text embedding space, as exemplified by LLaVA-Med. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. For best practice, we conduct a systematic ablation study on various choices in data engineering and multimodal training. The resulting LlaVA-Rad (7B) model attains state-of-the-art results on standard radiology tasks such as report generation and cross-modal retrieval, even outperforming much larger models such as GPT-4V and Med-PaLM M (84B). The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
♻ ☆ A Large Language Model Approach to Educational Survey Feedback Analysis
This paper assesses the potential for the large language models (LLMs) GPT-4 and GPT-3.5 to aid in deriving insight from education feedback surveys. Exploration of LLM use cases in education has focused on teaching and learning, with less exploration of capabilities in education feedback analysis. Survey analysis in education involves goals such as finding gaps in curricula or evaluating teachers, often requiring time-consuming manual processing of textual responses. LLMs have the potential to provide a flexible means of achieving these goals without specialized machine learning models or fine-tuning. We demonstrate a versatile approach to such goals by treating them as sequences of natural language processing (NLP) tasks including classification (multi-label, multi-class, and binary), extraction, thematic analysis, and sentiment analysis, each performed by LLM. We apply these workflows to a real-world dataset of 2500 end-of-course survey comments from biomedical science courses, and evaluate a zero-shot approach (i.e., requiring no examples or labeled training data) across all tasks, reflecting education settings, where labeled data is often scarce. By applying effective prompting practices, we achieve human-level performance on multiple tasks with GPT-4, enabling workflows necessary to achieve typical goals. We also show the potential of inspecting LLMs' chain-of-thought (CoT) reasoning for providing insight that may foster confidence in practice. Moreover, this study features development of a versatile set of classification categories, suitable for various course types (online, hybrid, or in-person) and amenable to customization. Our results suggest that LLMs can be used to derive a range of insights from survey text.
♻ ☆ Methodology of Adapting Large English Language Models for Specific Cultural Contexts
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
comment: 11 pages, 2 figures
♻ ☆ Symbolic Prompt Program Search: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
In many modern LLM applications, such as retrieval augmented generation, prompts have become programs themselves. In these settings, prompt programs are repeatedly called with different user queries or data instances. A big practical challenge is optimizing such prompt programs. Recent work has mostly focused on either simple prompt programs or assumed that the general structure of a prompt program is fixed. We introduce SAMMO, a framework to perform symbolic prompt program search for compile-time optimizations of prompt programs. SAMMO represents prompt programs on a symbolic level which allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs. We make all code available open-source at https://github.com/microsoft/sammo .
♻ ☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models ACL 2024
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and received over 25,000 stars and 3,000 forks.
comment: 13 pages, accepted to ACL 2024 System Demonstration Track
♻ ☆ Fine-Tuning BERTs for Definition Extraction from Mathematical Text
In this paper, we fine-tuned three pre-trained BERT models on the task of "definition extraction" from mathematical English written in LaTeX. This is presented as a binary classification problem, where either a sentence contains a definition of a mathematical term or it does not. We used two original data sets, "Chicago" and "TAC," to fine-tune and test these models. We also tested on WFMALL, a dataset presented by Vanetik and Litvak in 2021 and compared the performance of our models to theirs. We found that a high-performance Sentence-BERT transformer model performed best based on overall accuracy, recall, and precision metrics, achieving comparable results to the earlier models with less computational effort.
♻ ☆ Target Span Detection for Implicit Harmful Content
Identifying the targets of hate speech is a crucial step in grasping the nature of such speech and, ultimately, in improving the detection of offensive posts on online forums. Much harmful content on online platforms uses implicit language especially when targeting vulnerable and protected groups such as using stereotypical characteristics instead of explicit target names, making it harder to detect and mitigate the language. In this study, we focus on identifying implied targets of hate speech, essential for recognizing subtler hate speech and enhancing the detection of harmful content on digital platforms. We define a new task aimed at identifying the targets even when they are not explicitly stated. To address that task, we collect and annotate target spans in three prominent implicit hate speech datasets: SBIC, DynaHate, and IHC. We call the resulting merged collection Implicit-Target-Span. The collection is achieved using an innovative pooling method with matching scores based on human annotations and Large Language Models (LLMs). Our experiments indicate that Implicit-Target-Span provides a challenging test bed for target span detection methods.
♻ ☆ Software Engineering Methods For AI-Driven Deductive Legal Reasoning SP
The recent proliferation of generative artificial intelligence (AI) technologies such as pre-trained large language models (LLMs) has opened up new frontiers in computational law. An exciting area of development is the use of AI to automate the deductive rule-based reasoning inherent in statutory and contract law. This paper argues that such automated deductive legal reasoning can now be viewed from the lens of software engineering, treating LLMs as interpreters of natural-language programs with natural-language inputs. We show how it is possible to apply principled software engineering techniques to enhance AI-driven legal reasoning of complex statutes and to unlock new applications in automated meta-reasoning such as mutation-guided example generation and metamorphic property-based testing.
comment: Appearing in Onward! at SPLASH 2024
♻ ☆ "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
comment: arXiv admin note: text overlap with arXiv:2405.07703
♻ ☆ Guylingo: The Republic of Guyana Creole Corpora NAACL 2024
While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.
comment: Accepted to NAACL 2024 Main Conference Special Theme Track: Languages of Latin America and The Caribbean
Machine Learning 157
☆ The Remarkable Robustness of LLMs: Stages of Inference?
We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95\% of the original model's prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.
☆ TabReD: A Benchmark of Tabular Machine Learning in-the-Wild
Benchmarks that closely reflect downstream application scenarios are essential for the streamlined adoption of new research in tabular machine learning (ML). In this work, we examine existing tabular benchmarks and find two common characteristics of industry-grade tabular data that are underrepresented in the datasets available to the academic community. First, tabular data often changes over time in real-world deployment scenarios. This impacts model performance and requires time-based train and test splits for correct model evaluation. Yet, existing academic tabular datasets often lack timestamp metadata to enable such evaluation. Second, a considerable portion of datasets in production settings stem from extensive data acquisition and feature engineering pipelines. For each specific dataset, this can have a different impact on the absolute and relative number of predictive, uninformative, and correlated features, which in turn can affect model selection. To fill the aforementioned gaps in academic benchmarks, we introduce TabReD -- a collection of eight industry-grade tabular datasets covering a wide range of domains from finance to food delivery services. We assess a large number of tabular ML models in the feature-rich, temporally-evolving data setting facilitated by TabReD. We demonstrate that evaluation on time-based data splits leads to different methods ranking, compared to evaluation on random splits more common in academic benchmarks. Furthermore, on the TabReD datasets, MLP-like architectures and GBDT show the best results, while more sophisticated DL models are yet to prove their effectiveness.
comment: Code: https://github.com/puhsu/tabred
☆ Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Modern generative models demonstrate impressive capabilities, likely stemming from an ability to identify and manipulate abstract concepts underlying their training data. However, fundamental questions remain: what determines the concepts a model learns, the order in which it learns them, and its ability to manipulate those concepts? To address these questions, we propose analyzing a model's learning dynamics via a framework we call the concept space, where each axis represents an independent concept underlying the data generating process. By characterizing learning dynamics in this space, we identify how the speed at which a concept is learned, and hence the order of concept learning, is controlled by properties of the data we term concept signal. Further, we observe moments of sudden turns in the direction of a model's learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities, i.e., where latent interventions show the model possesses the capability to manipulate a concept, but these capabilities cannot yet be elicited via naive input prompting. While our results focus on synthetically defined toy datasets, we hypothesize a general claim on emergence of hidden capabilities may hold: generative models possess latent capabilities that emerge suddenly and consistently during training, though a model might not exhibit these capabilities under naive input prompting.
comment: Preprint
☆ DiVERT: Distractor Generation with Variational Errors Represented as Text for Math Multiple-choice Questions
High-quality distractors are crucial to both the assessment and pedagogical value of multiple-choice questions (MCQs), where manually crafting ones that anticipate knowledge deficiencies or misconceptions among real students is difficult. Meanwhile, automated distractor generation, even with the help of large language models (LLMs), remains challenging for subjects like math. It is crucial to not only identify plausible distractors but also understand the error behind them. In this paper, we introduce DiVERT (Distractor Generation with Variational Errors Represented as Text), a novel variational approach that learns an interpretable representation of errors behind distractors in math MCQs. Through experiments on a real-world math MCQ dataset with 1,434 questions used by hundreds of thousands of students, we show that DiVERT, despite using a base open-source LLM with 7B parameters, outperforms state-of-the-art approaches using GPT-4o on downstream distractor generation. We also conduct a human evaluation with math educators and find that DiVERT leads to error labels that are of comparable quality to human-authored ones.
☆ Subtractive Training for Music Stem Insertion using Latent Diffusion Models
We present Subtractive Training, a simple and novel method for synthesizing individual musical instrument stems given other instruments as context. This method pairs a dataset of complete music mixes with 1) a variant of the dataset lacking a specific stem, and 2) LLM-generated instructions describing how the missing stem should be reintroduced. We then fine-tune a pretrained text-to-audio diffusion model to generate the missing instrument stem, guided by both the existing stems and the text instruction. Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks. We also show that we can use the text instruction to control the generation of the inserted stem in terms of rhythm, dynamics, and genre, allowing us to modify the style of a single instrument in a full song while keeping the remaining instruments the same. Lastly, we extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
☆ Efficient World Models with Context-Aware Tokenization ICML 2024
Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $\Delta$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $\Delta$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
comment: ICML 2024
☆ Jump Starting Bandits with LLM-Generated Prior Knowledge
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
☆ Mapping Land Naturalness from Sentinel-2 using Deep Contextual and Geographical Priors ICLR 2024
In recent decades, the causes and consequences of climate change have accelerated, affecting our planet on an unprecedented scale. This change is closely tied to the ways in which humans alter their surroundings. As our actions continue to impact natural areas, using satellite images to observe and measure these effects has become crucial for understanding and combating climate change. Aiming to map land naturalness on the continuum of modern human pressure, we have developed a multi-modal supervised deep learning framework that addresses the unique challenges of satellite data and the task at hand. We incorporate contextual and geographical priors, represented by corresponding coordinate information and broader contextual information, including and surrounding the immediate patch to be predicted. Our framework improves the model's predictive performance in mapping land naturalness from Sentinel-2 data, a type of multi-spectral optical satellite imagery. Recognizing that our protective measures are only as effective as our understanding of the ecosystem, quantifying naturalness serves as a crucial step toward enhancing our environmental stewardship.
comment: 6 pages, 3 figures, ICLR 2024 Tackling Climate Change with Machine Learning Workshop
☆ MCNC: Manifold Constrained Network Compression
The outstanding performance of large foundational models across diverse tasks-from computer vision to speech and natural language processing-has significantly increased their demand. However, storing and transmitting these models pose significant challenges due to their massive size (e.g., 350GB for GPT-3). Recent literature has focused on compressing the original weights or reducing the number of parameters required for fine-tuning these models. These compression methods typically involve constraining the parameter space, for example, through low-rank reparametrization (e.g., LoRA) or quantization (e.g., QLoRA) during model training. In this paper, we present MCNC as a novel model compression method that constrains the parameter space to low-dimensional pre-defined and frozen nonlinear manifolds, which effectively cover this space. Given the prevalence of good solutions in over-parameterized deep neural networks, we show that by constraining the parameter space to our proposed manifold, we can identify high-quality solutions while achieving unprecedented compression rates across a wide variety of tasks. Through extensive experiments in computer vision and natural language processing tasks, we demonstrate that our method, MCNC, significantly outperforms state-of-the-art baselines in terms of compression, accuracy, and/or model reconstruction time.
☆ scTree: Discovering Cellular Hierarchies in the Presence of Batch Effects in scRNA-seq Data
We propose a novel method, scTree, for single-cell Tree Variational Autoencoders, extending a hierarchical clustering approach to single-cell RNA sequencing data. scTree corrects for batch effects while simultaneously learning a tree-structured data representation. This VAE-based method allows for a more in-depth understanding of complex cellular landscapes independently of the biasing effects of batches. We show empirically on seven datasets that scTree discovers the underlying clusters of the data and the hierarchical relations between them, as well as outperforms established baseline methods across these datasets. Additionally, we analyze the learned hierarchy to understand its biological relevance, thus underpinning the importance of integrating batch correction directly into the clustering procedure.
☆ Compositional Image Decomposition with Diffusion Models ICML 2024
Given an image of a natural scene, we are able to quickly decompose it into a set of components such as objects, lighting, shadows, and foreground. We can then envision a scene where we combine certain components with those from other images, for instance a set of objects from our bedroom and animals from a zoo under the lighting conditions of a forest, even if we have never encountered such a scene before. In this paper, we present a method to decompose an image into such compositional components. Our approach, Decomp Diffusion, is an unsupervised method which, when given a single image, infers a set of different components in the image, each represented by a diffusion model. We demonstrate how components can capture different factors of the scene, ranging from global scene descriptors like shadows or facial expression to local scene descriptors like constituent objects. We further illustrate how inferred factors can be flexibly composed, even with factors inferred from other models, to generate a variety of scenes sharply different than those seen in training time. Website and code at https://energy-based-model.github.io/decomp-diffusion.
comment: ICML 2024, Webpage: https://energy-based-model.github.io/decomp-diffusion
☆ From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
☆ HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
☆ Stochastic Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have emerged as a promising interpretable method whose final prediction is based on intermediate, human-understandable concepts rather than the raw input. Through time-consuming manual interventions, a user can correct wrongly predicted concept values to enhance the model's downstream performance. We propose Stochastic Concept Bottleneck Models (SCBMs), a novel approach that models concept dependencies. In SCBMs, a single-concept intervention affects all correlated concepts, thereby improving intervention effectiveness. Unlike previous approaches that model the concept relations via an autoregressive structure, we introduce an explicit, distributional parameterization that allows SCBMs to retain the CBMs' efficient training and inference procedure. Additionally, we leverage the parameterization to derive an effective intervention strategy based on the confidence region. We show empirically on synthetic tabular and natural image datasets that our approach improves intervention effectiveness significantly. Notably, we showcase the versatility and usability of SCBMs by examining a setting with CLIP-inferred concepts, alleviating the need for manual concept annotations.
☆ Leveraging Contrastive Learning for Enhanced Node Representations in Tokenized Graph Transformers
While tokenized graph Transformers have demonstrated strong performance in node classification tasks, their reliance on a limited subset of nodes with high similarity scores for constructing token sequences overlooks valuable information from other nodes, hindering their ability to fully harness graph information for learning optimal node representations. To address this limitation, we propose a novel graph Transformer called GCFormer. Unlike previous approaches, GCFormer develops a hybrid token generator to create two types of token sequences, positive and negative, to capture diverse graph information. And a tailored Transformer-based backbone is adopted to learn meaningful node representations from these generated token sequences. Additionally, GCFormer introduces contrastive learning to extract valuable information from both positive and negative token sequences, enhancing the quality of learned node representations. Extensive experimental results across various datasets, including homophily and heterophily graphs, demonstrate the superiority of GCFormer in node classification, when compared to representative graph neural networks (GNNs) and graph Transformers.
☆ Advection Augmented Convolutional Neural Networks
Many problems in physical sciences are characterized by the prediction of space-time sequences. Such problems range from weather prediction to the analysis of disease propagation and video prediction. Modern techniques for the solution of these problems typically combine Convolution Neural Networks (CNN) architecture with a time prediction mechanism. However, oftentimes, such approaches underperform in the long-range propagation of information and lack explainability. In this work, we introduce a physically inspired architecture for the solution of such problems. Namely, we propose to augment CNNs with advection by designing a novel semi-Lagrangian push operator. We show that the proposed operator allows for the non-local transformation of information compared with standard convolutional kernels. We then complement it with Reaction and Diffusion neural components to form a network that mimics the Reaction-Advection-Diffusion equation, in high dimensions. We demonstrate the effectiveness of our network on a number of spatio-temporal datasets that show their merit.
☆ NTFormer: A Composite Node Tokenized Graph Transformer for Node Classification
Recently, the emerging graph Transformers have made significant advancements for node classification on graphs. In most graph Transformers, a crucial step involves transforming the input graph into token sequences as the model input, enabling Transformer to effectively learn the node representations. However, we observe that existing methods only express partial graph information of nodes through single-type token generation. Consequently, they require tailored strategies to encode additional graph-specific features into the Transformer to ensure the quality of node representation learning, limiting the model flexibility to handle diverse graphs. To this end, we propose a new graph Transformer called NTFormer to address this issue. NTFormer introduces a novel token generator called Node2Par, which constructs various token sequences using different token elements for each node. This flexibility allows Node2Par to generate valuable token sequences from different perspectives, ensuring comprehensive expression of rich graph features. Benefiting from the merits of Node2Par, NTFormer only leverages a Transformer-based backbone without graph-specific modifications to learn node representations, eliminating the need for graph-specific modifications. Extensive experiments conducted on various benchmark datasets containing homophily and heterophily graphs with different scales demonstrate the superiority of NTFormer over representative graph Transformers and graph neural networks for node classification.
☆ Improving the Expressiveness of $K$-hop Message-Passing GNNs by Injecting Contextualized Substructure Information KDD2023
Graph neural networks (GNNs) have become the \textit{de facto} standard for representational learning in graphs, and have achieved state-of-the-art performance in many graph-related tasks; however, it has been shown that the expressive power of standard GNNs are equivalent maximally to 1-dimensional Weisfeiler-Lehman (1-WL) Test. Recently, there is a line of works aiming to enhance the expressive power of graph neural networks. One line of such works aim at developing $K$-hop message-passing GNNs where node representation is updated by aggregating information from not only direct neighbors but all neighbors within $K$-hop of the node. Another line of works leverages subgraph information to enhance the expressive power which is proven to be strictly more powerful than 1-WL test. In this work, we discuss the limitation of $K$-hop message-passing GNNs and propose \textit{substructure encoding function} to uplift the expressive power of any $K$-hop message-passing GNN. We further inject contextualized substructure information to enhance the expressiveness of $K$-hop message-passing GNNs. Our method is provably more powerful than previous works on $K$-hop graph neural networks and 1-WL subgraph GNNs, which is a specific type of subgraph based GNN models, and not less powerful than 3-WL. Empirically, our proposed method set new state-of-the-art performance or achieves comparable performance for a variety of datasets. Our code is available at \url{https://github.com/tianyao-aka/Expresive_K_hop_GNNs}.
comment: 13 pages, published in Research track of KDD2023
☆ Revealing Fine-Grained Values and Opinions in Large Language Models
Uncovering latent values and opinions in large language models (LLMs) can help identify biases and mitigate potential harm. Recently, this has been approached by presenting LLMs with survey questions and quantifying their stances towards morally and politically charged statements. However, the stances generated by LLMs can vary greatly depending on how they are prompted, and there are many ways to argue for or against a given position. In this work, we propose to address this by analysing a large and robust dataset of 156k LLM responses to the 62 propositions of the Political Compass Test (PCT) generated by 6 LLMs using 420 prompt variations. We perform coarse-grained analysis of their generated stances and fine-grained analysis of the plain text justifications for those stances. For fine-grained analysis, we propose to identify tropes in the responses: semantically similar phrases that are recurrent and consistent across different prompts, revealing patterns in the text that a given LLM is prone to produce. We find that demographic features added to prompts significantly affect outcomes on the PCT, reflecting bias, as well as disparities between the results of tests when eliciting closed-form vs. open domain responses. Additionally, patterns in the plain text rationales via tropes show that similar justifications are repeatedly generated across models and prompts even with disparate stances.
comment: 28 pages, 20 figures, 7 tables
☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
☆ Tools Fail: Detecting Silent Errors in Faulty Tools
Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
comment: 18 pages, 12 figures
☆ T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
☆ Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights
Long-term causal effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions to estimate long-term average effects, e.g., no unobserved confounders or a binary treatment,while in numerous real-world applications, these assumptions could be violated and average effects are unable to provide individual-level suggestions.In this paper,we address a more general problem of estimating the long-term heterogeneous dose-response curve (HDRC) while accounting for unobserved confounders. Specifically, to remove unobserved confounding in observational data, we introduce an optimal transport weighting framework to align the observational data to the experimental data with theoretical guarantees. Furthermore,to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop an HDRC estimator building upon the above theoretical foundations. Extensive experimental studies conducted on multiple synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.
☆ BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring
This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5$\times$ lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.
comment: 4 pages, 2 tables, 2 figures
☆ Averaging log-likelihoods in direct alignment
To better align Large Language Models (LLMs) with human judgment, Reinforcement Learning from Human Feedback (RLHF) learns a reward model and then optimizes it using regularized RL. Recently, direct alignment methods were introduced to learn such a fine-tuned model directly from a preference dataset without computing a proxy reward function. These methods are built upon contrastive losses involving the log-likelihood of (dis)preferred completions according to the trained model. However, completions have various lengths, and the log-likelihood is not length-invariant. On the other side, the cross-entropy loss used in supervised training is length-invariant, as batches are typically averaged token-wise. To reconcile these approaches, we introduce a principled approach for making direct alignment length-invariant. Formally, we introduce a new averaging operator, to be composed with the optimality operator giving the best policy for the underlying RL problem. It translates into averaging the log-likelihood within the loss. We empirically study the effect of such averaging, observing a trade-off between the length of generations and their scores.
☆ Contrastive Policy Gradient: Aligning LLMs on sequence-level scores in a supervised-friendly fashion
Reinforcement Learning (RL) has been used to finetune Large Language Models (LLMs) using a reward model trained from preference data, to better align with human judgment. The recently introduced direct alignment methods, which are often simpler, more stable, and computationally lighter, can more directly achieve this. However, these approaches cannot optimize arbitrary rewards, and the preference-based ones are not the only rewards of interest for LLMs (eg., unit tests for code generation or textual entailment for summarization, among others). RL-finetuning is usually done with a variation of policy gradient, which calls for on-policy or near-on-policy samples, requiring costly generations. We introduce Contrastive Policy Gradient, or CoPG, a simple and mathematically principled new RL algorithm that can estimate the optimal policy even from off-policy data. It can be seen as an off-policy policy gradient approach that does not rely on important sampling techniques and highlights the importance of using (the right) state baseline. We show this approach to generalize the direct alignment method IPO (identity preference optimization) and classic policy gradient. We experiment with the proposed CoPG on a toy bandit problem to illustrate its properties, as well as for finetuning LLMs on a summarization task, using a learned reward function considered as ground truth for the purpose of the experiments.
☆ Towards Reducing Data Acquisition and Labeling for Defect Detection using Simulated Data
In many manufacturing settings, annotating data for machine learning and computer vision is costly, but synthetic data can be generated at significantly lower cost. Substituting the real-world data with synthetic data is therefore appealing for many machine learning applications that require large amounts of training data. However, relying solely on synthetic data is frequently inadequate for effectively training models that perform well on real-world data, primarily due to domain shifts between the synthetic and real-world data. We discuss approaches for dealing with such a domain shift when detecting defects in X-ray scans of aluminium wheels. Using both simulated and real-world X-ray images, we train an object detection model with different strategies to identify the training approach that generates the best detection results while minimising the demand for annotated real-world training samples. Our preliminary findings suggest that the sim-2-real domain adaptation approach is more cost-efficient than a fully supervised oracle - if the total number of available annotated samples is fixed. Given a certain number of labeled real-world samples, training on a mix of synthetic and unlabeled real-world data achieved comparable or even better detection results at significantly lower cost. We argue that future research into the cost-efficiency of different training strategies is important for a better understanding of how to allocate budget in applied machine learning projects.
☆ Heterogeneous Causal Metapath Graph Neural Network for Gene-Microbe-Disease Association Prediction
The recent focus on microbes in human medicine highlights their potential role in the genetic framework of diseases. To decode the complex interactions among genes, microbes, and diseases, computational predictions of gene-microbe-disease (GMD) associations are crucial. Existing methods primarily address gene-disease and microbe-disease associations, but the more intricate triple-wise GMD associations remain less explored. In this paper, we propose a Heterogeneous Causal Metapath Graph Neural Network (HCMGNN) to predict GMD associations. HCMGNN constructs a heterogeneous graph linking genes, microbes, and diseases through their pairwise associations, and utilizes six predefined causal metapaths to extract directed causal subgraphs, which facilitate the multi-view analysis of causal relations among three entity types. Within each subgraph, we employ a causal semantic sharing message passing network for node representation learning, coupled with an attentive fusion method to integrate these representations for predicting GMD associations. Our extensive experiments show that HCMGNN effectively predicts GMD associations and addresses association sparsity issue by enhancing the graph's semantics and structure.
☆ Advancing operational PM2.5 forecasting with dual deep neural networks (D-DNet)
PM2.5 forecasting is crucial for public health, air quality management, and policy development. Traditional physics-based models are computationally demanding and slow to adapt to real-time conditions. Deep learning models show potential in efficiency but still suffer from accuracy loss over time due to error accumulation. To address these challenges, we propose a dual deep neural network (D-DNet) prediction and data assimilation system that efficiently integrates real-time observations, ensuring reliable operational forecasting. D-DNet excels in global operational forecasting for PM2.5 and AOD550, maintaining consistent accuracy throughout the entire year of 2019. It demonstrates notably higher efficiency than the Copernicus Atmosphere Monitoring Service (CAMS) 4D-Var operational forecasting system while maintaining comparable accuracy. This efficiency benefits ensemble forecasting, uncertainty analysis, and large-scale tasks.
☆ Resolving Discrepancies in Compute-Optimal Scaling of Language Models
Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i.e., "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al., we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.
☆ YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention
The accurate prediction of drug molecule solubility is essential for determining their therapeutic effectiveness and safety, influencing the drug's ADME processes. Traditional solubility prediction techniques often fail to capture the complex nature of molecular tructures, leading to notable deviations between predictions and actual results. For example, the Discussion on Advanced Drug-Like Compound Structures. Lusci highlighted issues in capturing crucial cyclic structural information in molecules with ring structures. To overcome this issue, our research introduces a novel deep learning framework combining attention-based transformers, Long Short-Term Memory (LSTM) networks, and Graph Convolutional Networks (GCN), aimed at enhancing the precision of solubility predictions. Utilizing a training set of 9,943 compounds and testing on an anticancer compound dataset, our method achieved a correlation coefficient ($R^2$) of 0.55 and a Root Mean Square Error (RMSE) of 0.59, which outperforms the benchmark models' scores of 0.52 ($R^2$) and 0.61 (RMSE). Importantly, in an additional independent test, our model significantly outperformed the baseline with an RMSE of 1.05 compared to 1.28, a relative accuracy improvement of 45.9%. This research not only demonstrates the vast potential of deep learning for improving solubility prediction accuracy but also offers novel insights for drug design and selection in the future. Continued efforts will be directed towards optimizing the model architecture and extending its application to better support the drug development process, underscoring the pivotal role of deep learning in drug discovery.
comment: 18 pages, 12 figures, 6 tables
☆ Towards Learning Abductive Reasoning using VSA Distributed Representations
We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven's progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC's seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at https://github.com/IBM/abductive-rule-learner-with-context-awareness.
comment: Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024
☆ CHEW: A Dataset of CHanging Events in Wikipedia
We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.
comment: Short Paper
☆ A Teacher Is Worth A Million Instructions
Large Language Models(LLMs) have shown exceptional abilities, yet training these models can be quite challenging. There is a strong dependence on the quality of data and finding the best instruction tuning set. Further, the inherent limitations in training methods create substantial difficulties to train relatively smaller models with 7B and 13B parameters. In our research, we suggest an improved training method for these models by utilising knowledge from larger models, such as a mixture of experts (8x7B) architectures. The scale of these larger models allows them to capture a wide range of variations from data alone, making them effective teachers for smaller models. Moreover, we implement a novel post-training domain alignment phase that employs domain-specific expert models to boost domain-specific knowledge during training while preserving the model's ability to generalise. Fine-tuning Mistral 7B and 2x7B with our method surpasses the performance of state-of-the-art language models with more than 7B and 13B parameters: achieving up to $7.9$ in MT-Bench and $93.04\%$ on AlpacaEval.
comment: 7 pages, 4 figures
☆ Adaptive Stochastic Weight Averaging
Ensemble models often improve generalization performances in challenging tasks. Yet, traditional techniques based on prediction averaging incur three well-known disadvantages: the computational overhead of training multiple models, increased latency, and memory requirements at test time. To address these issues, the Stochastic Weight Averaging (SWA) technique maintains a running average of model parameters from a specific epoch onward. Despite its potential benefits, maintaining a running average of parameters can hinder generalization, as an underlying running model begins to overfit. Conversely, an inadequately chosen starting point can render SWA more susceptible to underfitting compared to an underlying running model. In this work, we propose Adaptive Stochastic Weight Averaging (ASWA) technique that updates a running average of model parameters, only when generalization performance is improved on the validation dataset. Hence, ASWA can be seen as a combination of SWA with the early stopping technique, where the former accepts all updates on a parameter ensemble model and the latter rejects any update on an underlying running model. We conducted extensive experiments ranging from image classification to multi-hop reasoning over knowledge graphs. Our experiments over 11 benchmark datasets with 7 baseline models suggest that ASWA leads to a statistically better generalization across models and datasets
☆ Dimensions underlying the representational alignment of deep neural networks with humans
Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.
☆ Dancing in the Shadows: Harnessing Ambiguity for Fairer Classifiers
This paper introduces a novel approach to bolster algorithmic fairness in scenarios where sensitive information is only partially known. In particular, we propose to leverage instances with uncertain identity with regards to the sensitive attribute to train a conventional machine learning classifier. The enhanced fairness observed in the final predictions of this classifier highlights the promising potential of prioritizing ambiguity (i.e., non-normativity) as a means to improve fairness guarantees in real-world classification tasks.
☆ Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO
Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.
☆ A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)
This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users' behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE's goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.
comment: DFKI Technical Report
☆ Stochastic Gradient Piecewise Deterministic Monte Carlo Samplers
Recent work has suggested using Monte Carlo methods based on piecewise deterministic Markov processes (PDMPs) to sample from target distributions of interest. PDMPs are non-reversible continuous-time processes endowed with momentum, and hence can mix better than standard reversible MCMC samplers. Furthermore, they can incorporate exact sub-sampling schemes which only require access to a single (randomly selected) data point at each iteration, yet without introducing bias to the algorithm's stationary distribution. However, the range of models for which PDMPs can be used, particularly with sub-sampling, is limited. We propose approximate simulation of PDMPs with sub-sampling for scalable sampling from posterior distributions. The approximation takes the form of an Euler approximation to the true PDMP dynamics, and involves using an estimate of the gradient of the log-posterior based on a data sub-sample. We thus call this class of algorithms stochastic-gradient PDMPs. Importantly, the trajectories of stochastic-gradient PDMPs are continuous and can leverage recent ideas for sampling from measures with continuous and atomic components. We show these methods are easy to implement, present results on their approximation error and demonstrate numerically that this class of algorithms has similar efficiency to, but is more robust than, stochastic gradient Langevin dynamics.
☆ FedMap: Iterative Magnitude-Based Pruning for Communication-Efficient Federated Learning
Federated Learning (FL) is a distributed machine learning approach that enables training on decentralized data while preserving privacy. However, FL systems often involve resource-constrained client devices with limited computational power, memory, storage, and bandwidth. This paper introduces FedMap, a novel method that aims to enhance the communication efficiency of FL deployments by collaboratively learning an increasingly sparse global model through iterative, unstructured pruning. Importantly, FedMap trains a global model from scratch, unlike other methods reported in the literature, making it ideal for privacy-critical use cases such as in the medical and finance domains, where suitable pre-training data is often limited. FedMap adapts iterative magnitude-based pruning to the FL setting, ensuring all clients prune and refine the same subset of the global model parameters, therefore gradually reducing the global model size and communication overhead. The iterative nature of FedMap, forming subsequent models as subsets of predecessors, avoids parameter reactivation issues seen in prior work, resulting in stable performance. In this paper we provide an extensive evaluation of FedMap across diverse settings, datasets, model architectures, and hyperparameters, assessing performance in both IID and non-IID environments. Comparative analysis against the baseline approach demonstrates FedMap's ability to achieve more stable client model performance. For IID scenarios, FedMap achieves over $90$\% pruning without significant performance degradation. In non-IID settings, it achieves at least $~80$\% pruning while maintaining accuracy. FedMap offers a promising solution to alleviate communication bottlenecks in FL systems while retaining model accuracy.
comment: Submitted to IEEE Transactions on Neural Networks and Learning Systems
☆ Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation
"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
☆ On Convex Optimization with Semi-Sensitive Features COLT 2024
We study the differentially private (DP) empirical risk minimization (ERM) problem under the semi-sensitive DP setting where only some features are sensitive. This generalizes the Label DP setting where only the label is sensitive. We give improved upper and lower bounds on the excess risk for DP-ERM. In particular, we show that the error only scales polylogarithmically in terms of the sensitive domain size, improving upon previous results that scale polynomially in the sensitive domain size (Ghazi et al., 2021).
comment: To appear in COLT 2024
☆ Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes
Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article.
☆ Zero-shot domain adaptation based on dual-level mix and contrast
Zero-shot domain adaptation (ZSDA) is a domain adaptation problem in the situation that labeled samples for a target task (task of interest) are only available from the source domain at training time, but for a task different from the task of interest (irrelevant task), labeled samples are available from both source and target domains. In this situation, classical domain adaptation techniques can only learn domain-invariant features in the irrelevant task. However, due to the difference in sample distribution between the two tasks, domain-invariant features learned in the irrelevant task are biased and not necessarily domain-invariant in the task of interest. To solve this problem, this paper proposes a new ZSDA method to learn domain-invariant features with low task bias. To this end, we propose (1) data augmentation with dual-level mixups in both task and domain to fill the absence of target task-of-interest data, (2) an extension of domain adversarial learning to learn domain-invariant features with less task bias, and (3) a new dual-level contrastive learning method that enhances domain-invariance and less task biasedness of features. Experimental results show that our proposal achieves good performance on several benchmarks.
comment: Accepted by IEEE conference on Artificial intelligence 2024
☆ FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity MICCAI 2024
Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at https://github.com/szbonaldo/FedMLP.
comment: Early accepted by MICCAI 2024
☆ Semi-supervised Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.
comment: 17 pages
☆ A Fast Learning-Based Surrogate of Electrical Machines using a Reduced Basis
A surrogate model approximates the outputs of a solver of Partial Differential Equations (PDEs) with a low computational cost. In this article, we propose a method to build learning-based surrogates in the context of parameterized PDEs, which are PDEs that depend on a set of parameters but are also temporal and spatial processes. Our contribution is a method hybridizing the Proper Orthogonal Decomposition and several Support Vector Regression machines. This method is conceived to work in real-time, thus aimed for being used in the context of digital twins, where a user can perform an interactive analysis of results based on the proposed surrogate. We present promising results on two use cases concerning electrical machines. These use cases are not toy examples but are produced an industrial computational code, they use meshes representing non-trivial geometries and contain non-linearities.
☆ Alignment For Performance Improvement in Conversation Bots
This paper shows that alignment methods can achieve superior adherence to guardrails compared to instruction fine-tuning alone in conversational agents, also known as bots, within predefined guidelines or 'guardrails'. It examines traditional training approaches such as instruction fine-tuning and the recent advancements in direct alignment methods like Identity Preference Optimization (IPO), and Kahneman-Tversky Optimization (KTO). The effectiveness of alignment techniques both pre and post-instruction tuning is highlighted, illustrating their potential to optimize conversational bots in domains that require strict adherence to specified rules, such as customer care.
☆ Evaluating AI Group Fairness: a Fuzzy Logic Perspective
Artificial intelligence systems often address fairness concerns by evaluating and mitigating measures of group discrimination, for example that indicate biases against certain genders or races. However, what constitutes group fairness depends on who is asked and the social context, whereas definitions are often relaxed to accept small deviations from the statistical constraints they set out to impose. Here we decouple definitions of group fairness both from the context and from relaxation-related uncertainty by expressing them in the axiomatic system of Basic fuzzy Logic (BL) with loosely understood predicates, like encountering group members. We then evaluate the definitions in subclasses of BL, such as Product or Lukasiewicz logics. Evaluation produces continuous instead of binary truth values by choosing the logic subclass and truth values for predicates that reflect uncertain context-specific beliefs, such as stakeholder opinions gathered through questionnaires. Internally, it follows logic-specific rules to compute the truth values of definitions. We show that commonly held propositions standardize the resulting mathematical formulas and we transcribe logic and truth value choices to layperson terms, so that anyone can answer them. We also use our framework to study several literature definitions of algorithmic fairness, for which we rationalize previous expedient practices that are non-probabilistic and show how to re-interpret their formulas and parameters in new contexts.
comment: preprint, 32 pages, 7 figures, 2 theorems, 6 appendices
☆ Federated Graph Semantic and Structural Learning
Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenges. Most relative arts focus on traditional distributed tasks like images and voices, incapable of graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantics, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push it away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over its counterparts.
☆ Semi-adaptive Synergetic Two-way Pseudoinverse Learning System
Deep learning has become a crucial technology for making breakthroughs in many fields. Nevertheless, it still faces two important challenges in theoretical and applied aspects. The first lies in the shortcomings of gradient descent based learning schemes which are time-consuming and difficult to determine the learning control hyperparameters. Next, the architectural design of the model is usually tricky. In this paper, we propose a semi-adaptive synergetic two-way pseudoinverse learning system, wherein each subsystem encompasses forward learning, backward learning, and feature concatenation modules. The whole system is trained using a non-gradient descent learning algorithm. It simplifies the hyperparameter tuning while improving the training efficiency. The architecture of the subsystems is designed using a data-driven approach that enables automated determination of the depth of the subsystems. We compare our method with the baselines of mainstream non-gradient descent based methods and the results demonstrate the effectiveness of our proposed method. The source code for this paper is available at http://github.com/B-berrypie/Semi-adaptive-Synergetic-Two-way-Pseudoinverse-Learning-System}{http://github.com/B-berrypie/Semi-adaptive-Synergetic-Two-way-Pseudoinverse-Learning-System.
☆ Enhanced ASR Robustness to Packet Loss with a Front-End Adaptation Network INTERSPEECH 2024
In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.
comment: Accepted for publication at INTERSPEECH 2024
☆ Fine-tuned network relies on generic representation to solve unseen cognitive task
Fine-tuning pretrained language models has shown promising results on a wide range of tasks, but when encountering a novel task, do they rely more on generic pretrained representation, or develop brand new task-specific solutions? Here, we fine-tuned GPT-2 on a context-dependent decision-making task, novel to the model but adapted from neuroscience literature. We compared its performance and internal mechanisms to a version of GPT-2 trained from scratch on the same task. Our results show that fine-tuned models depend heavily on pretrained representations, particularly in later layers, while models trained from scratch develop different, more task-specific mechanisms. These findings highlight the advantages and limitations of pretraining for task generalization and underscore the need for further investigation into the mechanisms underpinning task-specific fine-tuning in LLMs.
☆ Learning Pareto Set for Multi-Objective Continuous Robot Control
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.
☆ Time Matters: Scaling Laws for Any Budget
A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. We show that with some simple accounting, we can estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this lets us estimate the final loss of the model. We fit our estimate to real data with a linear regression, and apply the result to rewrite Chinchilla in terms of a model's estimated training time as opposed to the amount of training data. This gives an expression for the loss in terms of the model's hyperparameters alone. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently.
☆ Statistical Test for Data Analysis Pipeline by Selective Inference
A data analysis pipeline is a structured sequence of processing steps that transforms raw data into meaningful insights by effectively integrating various analysis algorithms. In this paper, we propose a novel statistical test designed to assess the statistical significance of data analysis pipelines. Our approach allows for the systematic development of valid statistical tests applicable to any data analysis pipeline configuration composed of a set of data analysis components. We have developed this framework by adapting selective inference, which has gained recent attention as a new statistical inference technique for data-driven hypotheses. The proposed statistical test is theoretically designed to control the type I error at the desired significance level in finite samples. As examples, we consider a class of pipelines composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We confirm the validity of our statistical test through experiments with both synthetic and real data for this class of data analysis pipelines. Additionally, we present an implementation framework that facilitates testing across any configuration of data analysis pipelines in this class without extra implementation costs.
☆ LearnedKV: Integrating LSM and Learned Index for Superior Performance on SSD
In this paper, we introduce LearnedKV, a novel tiered key-value (KV) store that seamlessly integrates a Log-Structured Merge (LSM) tree with a Learned Index. This integration yields superior read and write performance compared to standalone indexing structures on SSDs. Our design capitalizes on the LSM tree's high write/update throughput and the Learned Index's fast read capabilities, enabling each component to leverage its strengths. We analyze the impact of size on LSM tree performance and demonstrate how the tiered Learned Index significantly mitigates the LSM tree's size-related performance degradation, particularly by reducing the intensive I/O operations resulting from re-insertions after Garbage Collection (GC). To maintain rapid read performance for newly inserted keys, we introduce a non-blocking conversion mechanism that efficiently transforms the existing LSM tree into a new Learned Index with minimal overhead during GC. Our experimental results, conducted across diverse workloads, show that LearnedKV outperforms state-of-the-art solutions by up to 1.32x in read requests and 1.31x in write performance.
comment: 17 pages, 13 figures
☆ From Biased Selective Labels to Pseudo-Labels: An Expectation-Maximization Framework for Learning from Biased Decisions ICML 2024
Selective labels occur when label observations are subject to a decision-making process; e.g., diagnoses that depend on the administration of laboratory tests. We study a clinically-inspired selective label problem called disparate censorship, where labeling biases vary across subgroups and unlabeled individuals are imputed as "negative" (i.e., no diagnostic test = no illness). Machine learning models naively trained on such labels could amplify labeling bias. Inspired by causal models of selective labels, we propose Disparate Censorship Expectation-Maximization (DCEM), an algorithm for learning in the presence of disparate censorship. We theoretically analyze how DCEM mitigates the effects of disparate censorship on model performance. We validate DCEM on synthetic data, showing that it improves bias mitigation (area between ROC curves) without sacrificing discriminative performance (AUC) compared to baselines. We achieve similar results in a sepsis classification task using clinical data.
comment: 39 pages, 33 figures. ICML 2024 conference paper
☆ Predicting the duration of traffic incidents for Sydney greater metropolitan area using machine learning methods
This research presents a comprehensive approach to predicting the duration of traffic incidents and classifying them as short-term or long-term across the Sydney Metropolitan Area. Leveraging a dataset that encompasses detailed records of traffic incidents, road network characteristics, and socio-economic indicators, we train and evaluate a variety of advanced machine learning models including Gradient Boosted Decision Trees (GBDT), Random Forest, LightGBM, and XGBoost. The models are assessed using Root Mean Square Error (RMSE) for regression tasks and F1 score for classification tasks. Our experimental results demonstrate that XGBoost and LightGBM outperform conventional models with XGBoost achieving the lowest RMSE of 33.7 for predicting incident duration and highest classification F1 score of 0.62 for a 30-minute duration threshold. For classification, the 30-minute threshold balances performance with 70.84\% short-term duration classification accuracy and 62.72\% long-term duration classification accuracy. Feature importance analysis, employing both tree split counts and SHAP values, identifies the number of affected lanes, traffic volume, and types of primary and secondary vehicles as the most influential features. The proposed methodology not only achieves high predictive accuracy but also provides stakeholders with vital insights into factors contributing to incident durations. These insights enable more informed decision-making for traffic management and response strategies. The code is available by the link: https://github.com/Future-Mobility-Lab/SydneyIncidents
☆ What Is Missing In Homophily? Disentangling Graph Homophily For Graph Neural Networks
Graph homophily refers to the phenomenon that connected nodes tend to share similar characteristics. Understanding this concept and its related metrics is crucial for designing effective Graph Neural Networks (GNNs). The most widely used homophily metrics, such as edge or node homophily, quantify such "similarity" as label consistency across the graph topology. These metrics are believed to be able to reflect the performance of GNNs, especially on node-level tasks. However, many recent studies have empirically demonstrated that the performance of GNNs does not always align with homophily metrics, and how homophily influences GNNs still remains unclear and controversial. Then, a crucial question arises: What is missing in our current understanding of homophily? To figure out the missing part, in this paper, we disentangle the graph homophily into $3$ aspects: label, structural, and feature homophily, providing a more comprehensive understanding of GNN performance. To investigate their synergy, we propose a Contextual Stochastic Block Model with $3$ types of Homophily (CSBM-3H), where the topology and feature generation are controlled by the $3$ metrics. Based on the theoretical analysis of CSBM-3H, we derive a new composite metric, named Tri-Hom, that considers all $3$ aspects and overcomes the limitations of conventional homophily metrics. The theoretical conclusions and the effectiveness of Tri-Hom have been verified through synthetic experiments on CSBM-3H. In addition, we conduct experiments on $31$ real-world benchmark datasets and calculate the correlations between homophily metrics and model performance. Tri-Hom has significantly higher correlation values than $17$ existing metrics that only focus on a single homophily aspect, demonstrating its superiority and the importance of homophily synergy. Our code is available at \url{https://github.com/zylMozart/Disentangle_GraphHom}.
☆ Decoding-Time Language Model Alignment with Multiple Objectives
Aligning language models (LMs) to human preferences has emerged as a critical pursuit, enabling these models to better serve diverse user needs. Existing methods primarily focus on optimizing LMs for a single reward function, limiting their adaptability to varied objectives. Here, we propose $\textbf{multi-objective decoding (MOD)}$, a decoding-time algorithm that outputs the next token from a linear combination of predictions of all base models, for any given weightings over different objectives. We exploit a common form among a family of $f$-divergence regularized alignment approaches (such as PPO, DPO, and their variants) to identify a closed-form solution by Legendre transform, and derive an efficient decoding strategy. Theoretically, we show why existing approaches can be sub-optimal even in natural settings and obtain optimality guarantees for our method. Empirical results demonstrate the effectiveness of the algorithm. For example, compared to a parameter-merging baseline, MOD achieves 12.8% overall reward improvement when equally optimizing towards $3$ objectives. Moreover, we experiment with MOD on combining three fully-finetuned LLMs of different model sizes, each aimed at different objectives such as safety, coding, and general user preference. Unlike traditional methods that require careful curation of a mixture of datasets to achieve comprehensive improvement, we can quickly experiment with preference weightings using MOD to find the best combination of models. Our best combination reduces toxicity on Toxigen to nearly 0% and achieves 7.9--33.3% improvement across other three metrics ($\textit{i.e.}$, Codex@1, GSM-COT, BBH-COT).
☆ LICO: Large Language Models for In-Context Molecular Optimization
Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions.
☆ Temporally Multi-Scale Sparse Self-Attention for Physical Activity Data Imputation
Wearable sensors enable health researchers to continuously collect data pertaining to the physiological state of individuals in real-world settings. However, such data can be subject to extensive missingness due to a complex combination of factors. In this work, we study the problem of imputation of missing step count data, one of the most ubiquitous forms of wearable sensor data. We construct a novel and large scale data set consisting of a training set with over 3 million hourly step count observations and a test set with over 2.5 million hourly step count observations. We propose a domain knowledge-informed sparse self-attention model for this task that captures the temporal multi-scale nature of step-count data. We assess the performance of the model relative to baselines and conduct ablation studies to verify our specific model designs.
comment: Accepted by Conference on Health, Inference, and Learning (CHIL) 2024
☆ Learning Retrieval Augmentation for Personalized Dialogue Generation EMNLP-2023
Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose $\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for $\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration ($\textbf{LAPDOG}$), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. https://github.com/hqsiswiliam/LAPDOG
comment: Accepted to EMNLP-2023
☆ Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training
Existing checkpointing approaches seem ill-suited for distributed training even though hardware limitations make model parallelism, i.e., sharding model state across multiple accelerators, a requirement for model scaling. Consolidating distributed model state into a single checkpoint unacceptably slows down training, and is impractical at extreme scales. Distributed checkpoints, in contrast, are tightly coupled to the model parallelism and hardware configurations of the training run, and thus unusable on different configurations. To address this problem, we propose Universal Checkpointing, a technique that enables efficient checkpoint creation while providing the flexibility of resuming on arbitrary parallelism strategy and hardware configurations. Universal Checkpointing unlocks unprecedented capabilities for large-scale training such as improved resilience to hardware failures through continued training on remaining healthy hardware, and reduced training time through opportunistic exploitation of elastic capacity. The key insight of Universal Checkpointing is the selection of the optimal representation in each phase of the checkpointing life cycle: distributed representation for saving, and consolidated representation for loading. This is achieved using two key mechanisms. First, the universal checkpoint format, which consists of a consolidated representation of each model parameter and metadata for mapping parameter fragments into training ranks of arbitrary model-parallelism configuration. Second, the universal checkpoint language, a simple but powerful specification language for converting distributed checkpoints into the universal checkpoint format. Our evaluation demonstrates the effectiveness and generality of Universal Checkpointing on state-of-the-art model architectures and a wide range of parallelism techniques.
☆ MissionGNN: Hierarchical Multimodal GNN-based Weakly Supervised Video Anomaly Recognition with Mission-Specific Knowledge Graph Generation
In the context of escalating safety concerns across various domains, the tasks of Video Anomaly Detection (VAD) and Video Anomaly Recognition (VAR) have emerged as critically important for applications in intelligent surveillance, evidence investigation, violence alerting, etc. These tasks, aimed at identifying and classifying deviations from normal behavior in video data, face significant challenges due to the rarity of anomalies which leads to extremely imbalanced data and the impracticality of extensive frame-level data annotation for supervised learning. This paper introduces a novel hierarchical graph neural network (GNN) based model MissionGNN that addresses these challenges by leveraging a state-of-the-art large language model and a comprehensive knowledge graph for efficient weakly supervised learning in VAR. Our approach circumvents the limitations of previous methods by avoiding heavy gradient computations on large multimodal models and enabling fully frame-level training without fixed video segmentation. Utilizing automated, mission-specific knowledge graph generation, our model provides a practical and efficient solution for real-time video analysis without the constraints of previous segmentation-based or multimodal approaches. Experimental validation on benchmark datasets demonstrates our model's performance in VAD and VAR, highlighting its potential to redefine the landscape of anomaly detection and recognition in video surveillance systems.
☆ Length Optimization in Conformal Prediction
Conditional validity and length efficiency are two crucial aspects of conformal prediction (CP). Achieving conditional validity ensures accurate uncertainty quantification for data subpopulations, while proper length efficiency ensures that the prediction sets remain informative and non-trivial. Despite significant efforts to address each of these issues individually, a principled framework that reconciles these two objectives has been missing in the CP literature. In this paper, we develop Conformal Prediction with Length-Optimization (CPL) - a novel framework that constructs prediction sets with (near-) optimal length while ensuring conditional validity under various classes of covariate shifts, including the key cases of marginal and group-conditional coverage. In the infinite sample regime, we provide strong duality results which indicate that CPL achieves conditional validity and length optimality. In the finite sample regime, we show that CPL constructs conditionally valid prediction sets. Our extensive empirical evaluations demonstrate the superior prediction set size performance of CPL compared to state-of-the-art methods across diverse real-world and synthetic datasets in classification, regression, and text-related settings.
☆ Density Ratio Estimation via Sampling along Generalized Geodesics on Statistical Manifolds
The density ratio of two probability distributions is one of the fundamental tools in mathematical and computational statistics and machine learning, and it has a variety of known applications. Therefore, density ratio estimation from finite samples is a very important task, but it is known to be unstable when the distributions are distant from each other. One approach to address this problem is density ratio estimation using incremental mixtures of the two distributions. We geometrically reinterpret existing methods for density ratio estimation based on incremental mixtures. We show that these methods can be regarded as iterating on the Riemannian manifold along a particular curve between the two probability distributions. Making use of the geometry of the manifold, we propose to consider incremental density ratio estimation along generalized geodesics on this manifold. To achieve such a method requires Monte Carlo sampling along geodesics via transformations of the two distributions. We show how to implement an iterative algorithm to sample along these geodesics and show how changing the distances along the geodesic affect the variance and accuracy of the estimation of the density ratio. Our experiments demonstrate that the proposed approach outperforms the existing approaches using incremental mixtures that do not take the geometry of the
☆ Online Stackelberg Optimization via Nonlinear Control COLT 2024
In repeated interaction problems with adaptive agents, our objective often requires anticipating and optimizing over the space of possible agent responses. We show that many problems of this form can be cast as instances of online (nonlinear) control which satisfy \textit{local controllability}, with convex losses over a bounded state space which encodes agent behavior, and we introduce a unified algorithmic framework for tractable regret minimization in such cases. When the instance dynamics are known but otherwise arbitrary, we obtain oracle-efficient $O(\sqrt{T})$ regret by reduction to online convex optimization, which can be made computationally efficient if dynamics are locally \textit{action-linear}. In the presence of adversarial disturbances to the state, we give tight bounds in terms of either the cumulative or per-round disturbance magnitude (for \textit{strongly} or \textit{weakly} locally controllable dynamics, respectively). Additionally, we give sublinear regret results for the cases of unknown locally action-linear dynamics as well as for the bandit feedback setting. Finally, we demonstrate applications of our framework to well-studied problems including performative prediction, recommendations for adaptive agents, adaptive pricing of real-valued goods, and repeated gameplay against no-regret learners, directly yielding extensions beyond prior results in each case.
comment: COLT 2024
☆ All Random Features Representations are Equivalent
Random features are an important technique that make it possible to rewrite positive-definite kernels as infinite-dimensional dot products. Over time, increasingly elaborate random feature representations have been developed in pursuit of finite approximations with ever lower error. We resolve this arms race by deriving an optimal sampling policy, and show that under this policy all random features representations have the same approximation error. This establishes a lower bound that holds across all random feature representations, and shows that we are free to choose whatever representation we please, provided we sample optimally.
☆ Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think
Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This has been attributed to the absence of feature learning. We show that this is not the case. In fact, we show that infinite width NTK models are able to access richer features than finite models by selecting relevant subfeatures from their (infinite) feature vector. In fact, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, weak performance is due to the fact that existing constructions depend on weak optimizers like SGD. We provide an infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.
☆ Private Zeroth-Order Nonsmooth Nonconvex Optimization
We introduce a new zeroth-order algorithm for private stochastic optimization on nonconvex and nonsmooth objectives. Given a dataset of size $M$, our algorithm ensures $(\alpha,\alpha\rho^2/2)$-R\'enyi differential privacy and finds a $(\delta,\epsilon)$-stationary point so long as $M=\tilde\Omega\left(\frac{d}{\delta\epsilon^3} + \frac{d^{3/2}}{\rho\delta\epsilon^2}\right)$. This matches the optimal complexity of its non-private zeroth-order analog. Notably, although the objective is not smooth, we have privacy ``for free'' whenever $\rho \ge \sqrt{d}\epsilon$.
☆ PathAlign: A vision-language model for whole slide images in histopathology
Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
comment: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures
☆ Deep Temporal Sequence Classification and Mathematical Modeling for Cell Tracking in Dense 3D Microscopy Videos of Bacterial Biofilms
Automatic cell tracking in dense environments is plagued by inaccurate correspondences and misidentification of parent-offspring relationships. In this paper, we introduce a novel cell tracking algorithm named DenseTrack, which integrates deep learning with mathematical model-based strategies to effectively establish correspondences between consecutive frames and detect cell division events in crowded scenarios. We formulate the cell tracking problem as a deep learning-based temporal sequence classification task followed by solving a constrained one-to-one matching optimization problem exploiting the classifier's confidence scores. Additionally, we present an eigendecomposition-based cell division detection strategy that leverages knowledge of cellular geometry. The performance of the proposed approach has been evaluated by tracking densely packed cells in 3D time-lapse image sequences of bacterial biofilm development. The experimental results on simulated as well as experimental fluorescence image sequences suggest that the proposed tracking method achieves superior performance in terms of both qualitative and quantitative evaluation measures compared to recent state-of-the-art cell tracking approaches.
☆ On Counterfactual Interventions in Vector Autoregressive Models
Counterfactual reasoning allows us to explore hypothetical scenarios in order to explain the impacts of our decisions. However, addressing such inquires is impossible without establishing the appropriate mathematical framework. In this work, we introduce the problem of counterfactual reasoning in the context of vector autoregressive (VAR) processes. We also formulate the inference of a causal model as a joint regression task where for inference we use both data with and without interventions. After learning the model, we exploit linearity of the VAR model to make exact predictions about the effects of counterfactual interventions. Furthermore, we quantify the total causal effects of past counterfactual interventions. The source code for this project is freely available at https://github.com/KurtButler/counterfactual_interventions.
☆ Instance-Optimal Private Density Estimation in the Wasserstein Distance
Estimating the density of a distribution from samples is a fundamental problem in statistics. In many practical settings, the Wasserstein distance is an appropriate error metric for density estimation. For example, when estimating population densities in a geographic region, a small Wasserstein distance means that the estimate is able to capture roughly where the population mass is. In this work we study differentially private density estimation in the Wasserstein distance. We design and analyze instance-optimal algorithms for this problem that can adapt to easy instances. For distributions $P$ over $\mathbb{R}$, we consider a strong notion of instance-optimality: an algorithm that uniformly achieves the instance-optimal estimation rate is competitive with an algorithm that is told that the distribution is either $P$ or $Q_P$ for some distribution $Q_P$ whose probability density function (pdf) is within a factor of 2 of the pdf of $P$. For distributions over $\mathbb{R}^2$, we use a different notion of instance optimality. We say that an algorithm is instance-optimal if it is competitive with an algorithm that is given a constant-factor multiplicative approximation of the density of the distribution. We characterize the instance-optimal estimation rates in both these settings and show that they are uniformly achievable (up to polylogarithmic factors). Our approach for $\mathbb{R}^2$ extends to arbitrary metric spaces as it goes via hierarchically separated trees. As a special case our results lead to instance-optimal private learning in TV distance for discrete distributions.
☆ Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning
We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.
☆ Cost-efficient Active Illumination Camera For Hyper-spectral Reconstruction
Hyper-spectral imaging has recently gained increasing attention for use in different applications, including agricultural investigation, ground tracking, remote sensing and many other. However, the high cost, large physical size and complicated operation process stop hyperspectral cameras from being employed for various applications and research fields. In this paper, we introduce a cost-efficient, compact and easy to use active illumination camera that may benefit many applications. We developed a fully functional prototype of such camera. With the hope of helping with agricultural research, we tested our camera for plant root imaging. In addition, a U-Net model for spectral reconstruction was trained by using a reference hyperspectral camera's data as ground truth and our camera's data as input. We demonstrated our camera's ability to obtain additional information over a typical RGB camera. In addition, the ability to reconstruct hyperspectral data from multi-spectral input makes our device compatible to models and algorithms developed for hyperspectral applications with no modifications required.
☆ BOrg: A Brain Organoid-Based Mitosis Dataset for Automatic Analysis of Brain Diseases
Recent advances have enabled the study of human brain development using brain organoids derived from stem cells. Quantifying cellular processes like mitosis in these organoids offers insights into neurodevelopmental disorders, but the manual analysis is time-consuming, and existing datasets lack specific details for brain organoid studies. We introduce BOrg, a dataset designed to study mitotic events in the embryonic development of the brain using confocal microscopy images of brain organoids. BOrg utilizes an efficient annotation pipeline with sparse point annotations and techniques that minimize expert effort, overcoming limitations of standard deep learning approaches on sparse data. We adapt and benchmark state-of-the-art object detection and cell counting models on BOrg for detecting and analyzing mitotic cells across prophase, metaphase, anaphase, and telophase stages. Our results demonstrate these adapted models significantly improve mitosis analysis efficiency and accuracy for brain organoid research compared to existing methods. BOrg facilitates the development of automated tools to quantify statistics like mitosis rates, aiding mechanistic studies of neurodevelopmental processes and disorders. Data and code are available at https://github.com/awaisrauf/borg.
☆ Rethinking harmless refusals when fine-tuning foundation models ICLR 2024
In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emph{reason-based deception}, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
comment: ICLR 2024 AGI Workshop Poster
☆ ASCENT: Amplifying Power Side-Channel Resilience via Learning & Monte-Carlo Tree Search
Power side-channel (PSC) analysis is pivotal for securing cryptographic hardware. Prior art focused on securing gate-level netlists obtained as-is from chip design automation, neglecting all the complexities and potential side-effects for security arising from the design automation process. That is, automation traditionally prioritizes power, performance, and area (PPA), sidelining security. We propose a "security-first" approach, refining the logic synthesis stage to enhance the overall resilience of PSC countermeasures. We introduce ASCENT, a learning-and-search-based framework that (i) drastically reduces the time for post-design PSC evaluation and (ii) explores the security-vs-PPA design space. Thus, ASCENT enables an efficient exploration of a large number of candidate netlists, leading to an improvement in PSC resilience compared to regular PPA-optimized netlists. ASCENT is up to 120x faster than traditional PSC analysis and yields a 3.11x improvement for PSC resilience of state-of-the-art PSC countermeasures
comment: Accepted at 2024 ACM/IEEE International Conference on Computer-Aided Design
☆ Dataless Quadratic Neural Networks for the Maximum Independent Set Problem
Combinatorial Optimization (CO) plays a crucial role in addressing various significant problems, among them the challenging Maximum Independent Set (MIS) problem. In light of recent advancements in deep learning methods, efforts have been directed towards leveraging data-driven learning approaches, typically rooted in supervised learning and reinforcement learning, to tackle the NP-hard MIS problem. However, these approaches rely on labeled datasets, exhibit weak generalization, and often depend on problem-specific heuristics. Recently, ReLU-based dataless neural networks were introduced to address combinatorial optimization problems. This paper introduces a novel dataless quadratic neural network formulation, featuring a continuous quadratic relaxation for the MIS problem. Notably, our method eliminates the need for training data by treating the given MIS instance as a trainable entity. More specifically, the graph structure and constraints of the MIS instance are used to define the structure and parameters of the neural network such that training it on a fixed input provides a solution to the problem, thereby setting it apart from traditional supervised or reinforcement learning approaches. By employing a gradient-based optimization algorithm like ADAM and leveraging an efficient off-the-shelf GPU parallel implementation, our straightforward yet effective approach demonstrates competitive or superior performance compared to state-of-the-art learning-based methods. Another significant advantage of our approach is that, unlike exact and heuristic solvers, the running time of our method scales only with the number of nodes in the graph, not the number of edges.
☆ Forward and Backward State Abstractions for Off-policy Evaluation
Off-policy evaluation (OPE) is crucial for evaluating a target policy's impact offline before its deployment. However, achieving accurate OPE in large state spaces remains challenging.This paper studies state abstractions-originally designed for policy learning-in the context of OPE. Our contributions are three-fold: (i) We define a set of irrelevance conditions central to learning state abstractions for OPE. (ii) We derive sufficient conditions for achieving irrelevance in Q-functions and marginalized importance sampling ratios, the latter obtained by constructing a time-reversed Markov decision process (MDP) based on the observed MDP. (iii) We propose a novel two-step procedure that sequentially projects the original state space into a smaller space, which substantially simplify the sample complexity of OPE arising from high cardinality.
comment: 42 pages, 5 figures
☆ TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.
comment: 6 pages, 6 figures
♻ ☆ Submodular Information Selection for Hypothesis Testing with Misclassification Penalties
We consider the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task where the goal is to identify the true state of the world from a finite set of hypotheses, based on finite observation samples from the sources. In order to characterize the learning performance, we propose a misclassification penalty framework, which enables non-uniform treatment of different misclassification errors. In a centralized Bayesian learning setting, we study two variants of the subset selection problem: (i) selecting a minimum cost information set to ensure that the maximum penalty of misclassifying the true hypothesis remains bounded and (ii) selecting an optimal information set under a limited budget to minimize the maximum penalty of misclassifying the true hypothesis. Under certain assumptions, we prove that the objective (or constraints) of these combinatorial optimization problems are weak (or approximate) submodular, and establish high-probability performance guarantees for greedy algorithms. Further, we propose an alternate metric for information set selection which is based on the total penalty of misclassification. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems. Finally, we present numerical simulations to validate our theoretical results over several randomly generated instances.
comment: 21 pages, 4 figures
♻ ☆ Physics-Guided Neural Networks for Intraventricular Vector Flow Mapping
Intraventricular vector flow mapping (iVFM) seeks to enhance and quantify color Doppler in cardiac imaging. In this study, we propose novel alternatives to the traditional iVFM optimization scheme by utilizing physics-informed neural networks (PINNs) and a physics-guided nnU-Net-based supervised approach. When evaluated on simulated color Doppler images derived from a patient-specific computational fluid dynamics model and in vivo Doppler acquisitions, both approaches demonstrate comparable reconstruction performance to the original iVFM algorithm. The efficiency of PINNs is boosted through dual-stage optimization and pre-optimized weights. On the other hand, the nnU-Net method excels in generalizability and real-time capabilities. Notably, nnU-Net shows superior robustness on sparse and truncated Doppler data while maintaining independence from explicit boundary conditions. Overall, our results highlight the effectiveness of these methods in reconstructing intraventricular vector blood flow. The study also suggests potential applications of PINNs in ultrafast color Doppler imaging and the incorporation of fluid dynamics equations to derive biomarkers for cardiovascular diseases based on blood flow.
comment: 12 pages, accepted for publication in IEEE TUFFC; camera ready corrections, corrected acknowledgments
♻ ☆ Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
comment: 22 pages, 9 figures. Project page is available at https://boyiwei.com/alignment-attribution/
♻ ☆ CHESS: Contextual Harnessing for Efficient SQL Synthesis
Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address this problem by proposing a new pipeline that effectively retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient SQL queries. To increase retrieval precision, our pipeline introduces a hierarchical retrieval method leveraging model-generated keywords, locality-sensitive hashing indexing, and vector databases. Additionally, we have developed an adaptive schema pruning technique that adjusts based on the complexity of the problem and the model's context size. Our approach generalizes to both frontier proprietary models like GPT-4 and open-source models such as Llama-3-70B. Through a series of ablation studies, we demonstrate the effectiveness of each component of our pipeline and its impact on the end-to-end performance. Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
♻ ☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
♻ ☆ GSplit: Scaling Graph Neural Network Training on Large Graphs via Split-Parallelism
Graph neural networks (GNNs), an emerging class of machine learning models for graphs, have gained popularity for their superior performance in various graph analytical tasks. Mini-batch training is commonly used to train GNNs on large graphs, and data parallelism is the standard approach to scale mini-batch training across multiple GPUs. One of the major performance costs in GNN training is the loading of input features, which prevents GPUs from being fully utilized. In this paper, we argue that this problem is exacerbated by redundancies that are inherent to the data parallel approach. To address this issue, we introduce a hybrid parallel mini-batch training paradigm called split parallelism. Split parallelism avoids redundant data loads and splits the sampling and training of each mini-batch across multiple GPUs online, at each iteration, using a lightweight splitting algorithm. We implement split parallelism in GSplit and show that it outperforms state-of-the-art mini-batch training systems like DGL, Quiver, and $P^3$.
♻ ☆ $μ$GUIDE: a framework for quantitative imaging via generalized uncertainty-driven inference using deep learning
This work proposes $\mu$GUIDE: a general Bayesian framework to estimate posterior distributions of tissue microstructure parameters from any given biophysical model or MRI signal representation, with exemplar demonstration in diffusion-weighted MRI. Harnessing a new deep learning architecture for automatic signal feature selection combined with simulation-based inference and efficient sampling of the posterior distributions, $\mu$GUIDE bypasses the high computational and time cost of conventional Bayesian approaches and does not rely on acquisition constraints to define model-specific summary statistics. The obtained posterior distributions allow to highlight degeneracies present in the model definition and quantify the uncertainty and ambiguity of the estimated parameters.
♻ ☆ Thermometer: Towards Universal Calibration for Large Language Models ICML 2024
We consider the issue of calibration in large language models (LLM). Recent studies have found that common interventions such as instruction tuning often result in poorly calibrated LLMs. Although calibration is well-explored in traditional applications, calibrating LLMs is uniquely challenging. These challenges stem as much from the severe computational requirements of LLMs as from their versatility, which allows them to be applied to diverse tasks. Addressing these challenges, we propose THERMOMETER, a calibration approach tailored to LLMs. THERMOMETER learns an auxiliary model, given data from multiple tasks, for calibrating a LLM. It is computationally efficient, preserves the accuracy of the LLM, and produces better-calibrated responses for new tasks. Extensive empirical evaluations across various benchmarks demonstrate the effectiveness of the proposed method.
comment: Camera ready version for ICML 2024
♻ ☆ Optimistic Information Directed Sampling
We study the problem of online learning in contextual bandit problems where the loss function is assumed to belong to a known parametric function class. We propose a new analytic framework for this setting that bridges the Bayesian theory of information-directed sampling due to Russo and Van Roy (2018) and the worst-case theory of Foster, Kakade, Qian, and Rakhlin (2021) based on the decision-estimation coefficient. Drawing from both lines of work, we propose a algorithmic template called Optimistic Information-Directed Sampling and show that it can achieve instance-dependent regret guarantees similar to the ones achievable by the classic Bayesian IDS method, but with the major advantage of not requiring any Bayesian assumptions. The key technical innovation of our analysis is introducing an optimistic surrogate model for the regret and using it to define a frequentist version of the Information Ratio of Russo and Van Roy (2018), and a less conservative version of the Decision Estimation Coefficient of Foster et al. (2021). Keywords: Contextual bandits, information-directed sampling, decision estimation coefficient, first-order regret bounds.
♻ ☆ Efficient Interaction-Aware Interval Analysis of Neural Network Feedback Loops
In this paper, we propose a computationally efficient framework for interval reachability of systems with neural network controllers. Our approach leverages inclusion functions for the open-loop system and the neural network controller to embed the closed-loop system into a larger-dimensional embedding system, where a single trajectory over-approximates the original system's behavior under uncertainty. We propose two methods for constructing closed-loop embedding systems, which account for the interactions between the system and the controller in different ways. The interconnection-based approach considers the worst-case evolution of each coordinate separately by substituting the neural network inclusion function into the open-loop inclusion function. The interaction-based approach uses novel Jacobian-based inclusion functions to capture the first-order interactions between the open-loop system and the controller by leveraging state-of-the-art neural network verifiers. Finally, we implement our approach in a Python framework called ReachMM to demonstrate its efficiency and scalability on benchmarks and examples ranging to $200$ state dimensions.
♻ ☆ Coarse-to-Fine Concept Bottleneck Models
Deep learning algorithms have recently gained significant attention due to their impressive performance. However, their high complexity and un-interpretable mode of operation hinders their confident deployment in real-world safety-critical tasks. This work targets ante hoc interpretability, and specifically Concept Bottleneck Models (CBMs). Our goal is to design a framework that admits a highly interpretable decision making process with respect to human understandable concepts, on two levels of granularity. To this end, we propose a novel two-level concept discovery formulation leveraging: (i) recent advances in vision-language models, and (ii) an innovative formulation for coarse-to-fine concept selection via data-driven and sparsity-inducing Bayesian arguments. Within this framework, concept information does not solely rely on the similarity between the whole image and general unstructured concepts; instead, we introduce the notion of concept hierarchy to uncover and exploit more granular concept information residing in patch-specific regions of the image scene. As we experimentally show, the proposed construction not only outperforms recent CBM approaches, but also yields a principled framework towards interpetability.
♻ ☆ Refining Myocardial Infarction Detection: A Novel Multi-Modal Composite Kernel Strategy in One-Class Classification
Early detection of myocardial infarction (MI), a critical condition arising from coronary artery disease (CAD), is vital to prevent further myocardial damage. This study introduces a novel method for early MI detection using a one-class classification (OCC) algorithm in echocardiography. Our study overcomes the challenge of limited echocardiography data availability by adopting a novel approach based on Multi-modal Subspace Support Vector Data Description. The proposed technique involves a specialized MI detection framework employing multi-view echocardiography incorporating a composite kernel in the non-linear projection trick, fusing Gaussian and Laplacian sigmoid functions. Additionally, we enhance the update strategy of the projection matrices by adapting maximization for both or one of the modalities in the optimization process. Our method boosts MI detection capability by efficiently transforming features extracted from echocardiography data into an optimized lower-dimensional subspace. The OCC model trained specifically on target class instances from the comprehensive HMC-QU dataset that includes multiple echocardiography views indicates a marked improvement in MI detection accuracy. Our findings reveal that our proposed multi-view approach achieves a geometric mean of 71.24%, signifying a substantial advancement in echocardiography-based MI diagnosis and offering more precise and efficient diagnostic tools.
♻ ☆ Shortcut Learning in Medical Image Segmentation MICCAI 2024
Shortcut learning is a phenomenon where machine learning models prioritize learning simple, potentially misleading cues from data that do not generalize well beyond the training set. While existing research primarily investigates this in the realm of image classification, this study extends the exploration of shortcut learning into medical image segmentation. We demonstrate that clinical annotations such as calipers, and the combination of zero-padded convolutions and center-cropped training sets in the dataset can inadvertently serve as shortcuts, impacting segmentation accuracy. We identify and evaluate the shortcut learning on two different but common medical image segmentation tasks. In addition, we suggest strategies to mitigate the influence of shortcut learning and improve the generalizability of the segmentation models. By uncovering the presence and implications of shortcuts in medical image segmentation, we provide insights and methodologies for evaluating and overcoming this pervasive challenge and call for attention in the community for shortcuts in segmentation. Our code is public at https://github.com/nina-weng/shortcut_skinseg .
comment: 11 pages, 6 figures, accepted at MICCAI 2024
♻ ☆ Stable Differentiable Causal Discovery
Inferring causal relationships as directed acyclic graphs (DAGs) is an important but challenging problem. Differentiable Causal Discovery (DCD) is a promising approach to this problem, framing the search as a continuous optimization. But existing DCD methods are numerically unstable, with poor performance beyond tens of variables. In this paper, we propose Stable Differentiable Causal Discovery (SDCD), a new method that improves previous DCD methods in two ways: (1) It employs an alternative constraint for acyclicity; this constraint is more stable, both theoretically and empirically, and fast to compute. (2) It uses a training procedure tailored for sparse causal graphs, which are common in real-world scenarios. We first derive SDCD and prove its stability and correctness. We then evaluate it with both observational and interventional data and on both small-scale and large-scale settings. We find that SDCD outperforms existing methods in both convergence speed and accuracy and can scale to thousands of variables. We provide code at https://github.com/azizilab/sdcd.
♻ ☆ S4: Self-Supervised Sensing Across the Spectrum
Satellite image time series (SITS) segmentation is crucial for many applications like environmental monitoring, land cover mapping and agricultural crop type classification. However, training models for SITS segmentation remains a challenging task due to the lack of abundant training data, which requires fine grained annotation. We propose S4 a new self-supervised pre-training approach that significantly reduces the requirement for labeled training data by utilizing two new insights: (a) Satellites capture images in different parts of the spectrum such as radio frequencies, and visible frequencies. (b) Satellite imagery is geo-registered allowing for fine-grained spatial alignment. We use these insights to formulate pre-training tasks in S4. We also curate m2s2-SITS, a large-scale dataset of unlabeled, spatially-aligned, multi-modal and geographic specific SITS that serves as representative pre-training data for S4. Finally, we evaluate S4 on multiple SITS segmentation datasets and demonstrate its efficacy against competing baselines while using limited labeled data.
♻ ☆ Local to Global: Learning Dynamics and Effect of Initialization for Transformers
In recent years, transformer-based models have revolutionized deep learning, particularly in sequence modeling. To better understand this phenomenon, there is a growing interest in using Markov input processes to study transformers. However, our current understanding in this regard remains limited with many fundamental questions about how transformers learn Markov chains still unanswered. In this paper, we address this by focusing on first-order Markov chains and single-layer transformers, providing a comprehensive characterization of the learning dynamics in this context. Specifically, we prove that transformer parameters trained on next-token prediction loss can either converge to global or local minima, contingent on the initialization and the Markovian data properties, and we characterize the precise conditions under which this occurs. To the best of our knowledge, this is the first result of its kind highlighting the role of initialization. We further demonstrate that our theoretical findings are corroborated by empirical evidence. Based on these insights, we provide guidelines for the initialization of transformer parameters and demonstrate their effectiveness. Finally, we outline several open problems in this arena. Code is available at: https://github.com/Bond1995/Markov.
♻ ☆ Heterophily-Aware Graph Attention Network
Graph Neural Networks (GNNs) have shown remarkable success in graph representation learning. Unfortunately, current weight assignment schemes in standard GNNs, such as the calculation based on node degrees or pair-wise representations, can hardly be effective in processing the networks with heterophily, in which the connected nodes usually possess different labels or features. Existing heterophilic GNNs tend to ignore the modeling of heterophily of each edge, which is also a vital part in tackling the heterophily problem. In this paper, we firstly propose a heterophily-aware attention scheme and reveal the benefits of modeling the edge heterophily, i.e., if a GNN assigns different weights to edges according to different heterophilic types, it can learn effective local attention patterns, which enable nodes to acquire appropriate information from distinct neighbors. Then, we propose a novel Heterophily-Aware Graph Attention Network (HA-GAT) by fully exploring and utilizing the local distribution as the underlying heterophily, to handle the networks with different homophily ratios. To demonstrate the effectiveness of the proposed HA-GAT, we analyze the proposed heterophily-aware attention scheme and local distribution exploration, by seeking for an interpretation from their mechanism. Extensive results demonstrate that our HA-GAT achieves state-of-the-art performances on eight datasets with different homophily ratios in both the supervised and semi-supervised node classification tasks.
♻ ☆ VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks MICCAI 2024
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at: https://github.com/naamiinepal/vlsm-adapter.
comment: Accepted at MICCAI 2024, the 27th International Conference on Medical Image Computing and Computer Assisted Intervention
♻ ☆ The Price of Adaptivity in Stochastic Convex Optimization COLT
We prove impossibility results for adaptivity in non-smooth stochastic convex optimization. Given a set of problem parameters we wish to adapt to, we define a "price of adaptivity" (PoA) that, roughly speaking, measures the multiplicative increase in suboptimality due to uncertainty in these parameters. When the initial distance to the optimum is unknown but a gradient norm bound is known, we show that the PoA is at least logarithmic for expected suboptimality, and double-logarithmic for median suboptimality. When there is uncertainty in both distance and gradient norm, we show that the PoA must be polynomial in the level of uncertainty. Our lower bounds nearly match existing upper bounds, and establish that there is no parameter-free lunch. En route, we also establish tight upper and lower bounds for (known-parameter) high-probability stochastic convex optimization with heavy-tailed and bounded noise, respectively.
comment: Accepted for presentation at the Conference on Learning Theory (COLT) 2024; to appear in proceedings as an extended abstract
♻ ☆ MMGPL: Multimodal Medical Data Analysis with Graph Prompt Learning
Prompt learning has demonstrated impressive efficacy in the fine-tuning of multimodal large models to a wide range of downstream tasks. Nonetheless, applying existing prompt learning methods for the diagnosis of neurological disorder still suffers from two issues: (i) existing methods typically treat all patches equally, despite the fact that only a small number of patches in neuroimaging are relevant to the disease, and (ii) they ignore the structural information inherent in the brain connection network which is crucial for understanding and diagnosing neurological disorders. To tackle these issues, we introduce a novel prompt learning model by learning graph prompts during the fine-tuning process of multimodal large models for diagnosing neurological disorders. Specifically, we first leverage GPT-4 to obtain relevant disease concepts and compute semantic similarity between these concepts and all patches. Secondly, we reduce the weight of irrelevant patches according to the semantic similarity between each patch and disease-related concepts. Moreover, we construct a graph among tokens based on these concepts and employ a graph convolutional network layer to extract the structural information of the graph, which is used to prompt the pre-trained multimodal large models for diagnosing neurological disorders. Extensive experiments demonstrate that our method achieves superior performance for neurological disorder diagnosis compared with state-of-the-art methods and validated by clinicians.
♻ ☆ Self-Supervised Detection of Perfect and Partial Input-Dependent Symmetries
Group equivariance can overly constrain models if the symmetries in the group differ from those observed in data. While common methods address this by determining the appropriate level of symmetry at the dataset level, they are limited to supervised settings and ignore scenarios in which multiple levels of symmetry co-exist in the same dataset. In this paper, we propose a method able to detect the level of symmetry of each input without the need for labels. Our framework is general enough to accommodate different families of both continuous and discrete symmetry distributions, such as arbitrary unimodal, symmetric distributions and discrete groups. We validate the effectiveness of our approach on synthetic datasets with different per-class levels of symmetries, and demonstrate practical applications such as the detection of out-of-distribution symmetries. Our code is publicly available at https://github.com/aurban0/ssl-sym.
comment: 19 pages, 8 figures, corrected typos, revised argument in Appendix B.1, results unchanged
♻ ☆ D-GRIL: End-to-End Topological Learning with 2-parameter Persistence
End-to-end topological learning using 1-parameter persistence is well-known. We show that the framework can be enhanced using 2-parameter persistence by adopting a recently introduced 2-parameter persistence based vectorization technique called GRIL. We establish a theoretical foundation of differentiating GRIL producing D-GRIL. We show that D-GRIL can be used to learn a bifiltration function on standard benchmark graph datasets. Further, we exhibit that this framework can be applied in the context of bio-activity prediction in drug discovery.
♻ ☆ Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families
We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model's posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.
comment: Published in Transactions on Machine Learning Research (TMLR), 2024
♻ ☆ Optimizing Large Model Training through Overlapped Activation Recomputation
Large model training has been using recomputation to alleviate the memory pressure and pipelining to exploit the parallelism of data, tensor, and devices. The existing recomputation approaches may incur up to 40% overhead when training real-world models, e.g., the GPT model with 22B parameters. This is because they are executed on demand in the critical training path. In this paper, we design a new recomputation framework, Lynx, to reduce the overhead by overlapping the recomputation with communication occurring in training pipelines. It consists of an optimal scheduling algorithm (OPT) and a heuristic-based scheduling algorithm (HEU). OPT achieves a global optimum but suffers from a long search time. HEU was designed based on our observation that there are identical structures in large DNN models so that we can apply the same scheduling policy to all identical structures. HEU achieves a local optimum but reduces the search time by 99% compared to OPT. Our comprehensive evaluation using GPT models with 1.3B-20B parameters shows that both OPT and HEU outperform the state-of-the-art recomputation approaches (e.g., Megatron-LM and Checkmake) by 1.02-1.53x. HEU achieves a similar performance as OPT with a search time of 0.16s on average.
comment: 13 pages
♻ ☆ Long-term drought prediction using deep neural networks based on geospatial weather data
The problem of high-quality drought forecasting up to a year in advance is critical for agriculture planning and insurance. Yet, it is still unsolved with reasonable accuracy due to data complexity and aridity stochasticity. We tackle drought data by introducing an end-to-end approach that adopts a spatio-temporal neural network model with accessible open monthly climate data as the input. Our systematic research employs diverse proposed models and five distinct environmental regions as a testbed to evaluate the efficacy of the Palmer Drought Severity Index (PDSI) prediction. Key aggregated findings are the exceptional performance of a Transformer model, EarthFormer, in making accurate short-term (up to six months) forecasts. At the same time, the Convolutional LSTM excels in longer-term forecasting. Both models achieved high ROC AUC scores: 0.948 for one month ahead and 0.617 for twelve months ahead forecasts, becoming closer to perfect ROC-AUC by $54\%$ and $16\%$, respectively, c.t. classic approaches.
♻ ☆ Mapping the Potential of Explainable AI for Fairness Along the AI Lifecycle
The widespread use of artificial intelligence (AI) systems across various domains is increasingly surfacing issues related to algorithmic fairness, especially in high-stakes scenarios. Thus, critical considerations of how fairness in AI systems might be improved -- and what measures are available to aid this process -- are overdue. Many researchers and policymakers see explainable AI (XAI) as a promising way to increase fairness in AI systems. However, there is a wide variety of XAI methods and fairness conceptions expressing different desiderata, and the precise connections between XAI and fairness remain largely nebulous. Besides, different measures to increase algorithmic fairness might be applicable at different points throughout an AI system's lifecycle. Yet, there currently is no coherent mapping of fairness desiderata along the AI lifecycle. In this paper, we we distill eight fairness desiderata, map them along the AI lifecycle, and discuss how XAI could help address each of them. We hope to provide orientation for practical applications and to inspire XAI research specifically focused on these fairness desiderata.
♻ ☆ Physics-informed and Unsupervised Riemannian Domain Adaptation for Machine Learning on Heterogeneous EEG Datasets
Combining electroencephalogram (EEG) datasets for supervised machine learning (ML) is challenging due to session, subject, and device variability. ML algorithms typically require identical features at train and test time, complicating analysis due to varying sensor numbers and positions across datasets. Simple channel selection discards valuable data, leading to poorer performance, especially with datasets sharing few channels. To address this, we propose an unsupervised approach leveraging EEG signal physics. We map EEG channels to fixed positions using field interpolation, facilitating source-free domain adaptation. Leveraging Riemannian geometry classification pipelines and transfer learning steps, our method demonstrates robust performance in brain-computer interface (BCI) tasks and potential biomarker applications. Comparative analysis against a statistical-based approach known as Dimensionality Transcending, a signal-based imputation called ComImp, source-dependent methods, as well as common channel selection and spherical spline interpolation, was conducted with leave-one-dataset-out validation on six public BCI datasets for a right-hand/left-hand classification task. Numerical experiments show that in the presence of few shared channels in train and test, the field interpolation consistently outperforms other methods, demonstrating enhanced classification performance across all datasets. When more channels are shared, field interpolation was found to be competitive with other methods and faster to compute than source-dependent methods.
♻ ☆ SoK: Can Trajectory Generation Combine Privacy and Utility?
While location trajectories represent a valuable data source for analyses and location-based services, they can reveal sensitive information, such as political and religious preferences. Differentially private publication mechanisms have been proposed to allow for analyses under rigorous privacy guarantees. However, the traditional protection schemes suffer from a limiting privacy-utility trade-off and are vulnerable to correlation and reconstruction attacks. Synthetic trajectory data generation and release represent a promising alternative to protection algorithms. While initial proposals achieve remarkable utility, they fail to provide rigorous privacy guarantees. This paper proposes a framework for designing a privacy-preserving trajectory publication approach by defining five design goals, particularly stressing the importance of choosing an appropriate Unit of Privacy. Based on this framework, we briefly discuss the existing trajectory protection approaches, emphasising their shortcomings. This work focuses on the systematisation of the state-of-the-art generative models for trajectories in the context of the proposed framework. We find that no existing solution satisfies all requirements. Thus, we perform an experimental study evaluating the applicability of six sequential generative models to the trajectory domain. Finally, we conclude that a generative trajectory model providing semantic guarantees remains an open research question and propose concrete next steps for future research.
comment: Added DOI: 10.56553/popets-2024-0068
♻ ☆ Examining Common Paradigms in Multi-Task Learning
While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
comment: -
♻ ☆ Decentralized Stochastic Subgradient Methods for Nonsmooth Nonconvex Optimization
In this paper, we concentrate on decentralized optimization problems with nonconvex and nonsmooth objective functions, especially on the decentralized training of nonsmooth neural networks. We introduce a unified framework to analyze the global convergence of decentralized stochastic subgradient-based methods. We prove the global convergence of our proposed framework under mild conditions, by establishing that the generated sequence asymptotically approximates the trajectories of its associated differential inclusion. Furthermore, we establish that our proposed framework covers a wide range of existing efficient decentralized subgradient-based methods, including decentralized stochastic subgradient descent (DSGD), DSGD with gradient-tracking technique (DSGD-T), and DSGD with momentum (DSGD-M). In addition, we introduce the sign map to regularize the update directions in DSGD-M, and show it is enclosed in our proposed framework. Consequently, our convergence results establish, for the first time, global convergence of these methods when applied to nonsmooth nonconvex objectives. Preliminary numerical experiments demonstrate that our proposed framework yields highly efficient decentralized subgradient-based methods with convergence guarantees in the training of nonsmooth neural networks.
comment: 22 pages
♻ ☆ FAITH: Frequency-domain Attention In Two Horizons for Time Series Forecasting
Time Series Forecasting plays a crucial role in various fields such as industrial equipment maintenance, meteorology, energy consumption, traffic flow and financial investment. However, despite their considerable advantages over traditional statistical approaches, current deep learning-based predictive models often exhibit a significant deviation between their forecasting outcomes and the ground truth. This discrepancy is largely due to an insufficient emphasis on extracting the sequence's latent information, particularly its global information within the frequency domain and the relationship between different variables. To address this issue, we propose a novel model Frequency-domain Attention In Two Horizons, which decomposes time series into trend and seasonal components using a multi-scale sequence adaptive decomposition and fusion architecture, and processes them separately. FAITH utilizes Frequency Channel feature Extraction Module and Frequency Temporal feature Extraction Module to capture inter-channel relationships and temporal global information in the sequence, significantly improving its ability to handle long-term dependencies and complex patterns. Furthermore, FAITH achieves theoretically linear complexity by modifying the time-frequency domain transformation method, effectively reducing computational costs. Extensive experiments on 6 benchmarks for long-term forecasting and 3 benchmarks for short-term forecasting demonstrate that FAITH outperforms existing models in many fields, such as electricity, weather and traffic, proving its effectiveness and superiority both in long-term and short-term time series forecasting tasks. Our codes and data are available at https://github.com/LRQ577/FAITH.
♻ ☆ LExCI: A Framework for Reinforcement Learning with Embedded Systems
Advances in artificial intelligence (AI) have led to its application in many areas of everyday life. In the context of control engineering, reinforcement learning (RL) represents a particularly promising approach as it is centred around the idea of allowing an agent to freely interact with its environment to find an optimal strategy. One of the challenges professionals face when training and deploying RL agents is that the latter often have to run on dedicated embedded devices. This could be to integrate them into an existing toolchain or to satisfy certain performance criteria like real-time constraints. Conventional RL libraries, however, cannot be easily utilised in conjunction with that kind of hardware. In this paper, we present a framework named LExCI, the Learning and Experiencing Cycle Interface, which bridges this gap and provides end-users with a free and open-source tool for training agents on embedded systems using the open-source library RLlib. Its operability is demonstrated with two state-of-the-art RL-algorithms and a rapid control prototyping system.
comment: The code, models, and data used for this work are available in a separate branch of LExCI's GitHub repository (https://github.com/mechatronics-RWTH/lexci-2/tree/lexci_paper). This paper has been submitted to Applied Intelligence (https://link.springer.com/journal/10489). 2024-06-27: Updated the footnote on the title page so that it provides information about the paper's Version of Record
♻ ☆ G-Transformer: Counterfactual Outcome Prediction under Dynamic and Time-varying Treatment Regimes
In the context of medical decision making, counterfactual prediction enables clinicians to predict treatment outcomes of interest under alternative courses of therapeutic actions given observed patient history. Prior machine learning approaches for counterfactual predictions under time-varying treatments focus on static time-varying treatment regimes where treatments do not depend on previous covariate history. In this work, we present G-Transformer, a Transformer-based framework supporting g-computation for counterfactual prediction under dynamic and time-varying treatment strategies. G-Transfomer captures complex, long-range dependencies in time-varying covariates using a Transformer architecture. G-Transformer estimates the conditional distribution of relevant covariates given covariate and treatment history at each time point using an encoder architecture, then produces Monte Carlo estimates of counterfactual outcomes by simulating forward patient trajectories under treatment strategies of interest. We evaluate G-Transformer extensively using two simulated longitudinal datasets from mechanistic models, and a real-world sepsis ICU dataset from MIMIC-IV. G-Transformer outperforms both classical and state-of-the-art counterfactual prediction models in these settings. To the best of our knowledge, this is the first Transformer-based architecture for counterfactual outcome prediction under dynamic and time-varying treatment strategies.
♻ ☆ Continual Learning Under Language Shift
The recent increase in data and model scale for language model pre-training has led to huge training costs. In scenarios where new data become available over time, updating a model instead of fully retraining it would therefore provide significant gains. We study the pros and cons of updating a language model when new data comes from new languages -- the case of continual learning under language shift. Starting from a monolingual English language model, we incrementally add data from Danish, Icelandic, and Norwegian to investigate how forward and backward transfer effects depend on pre-training order and characteristics of languages, for three different model sizes. Our results show that, while forward transfer is largely positive and independent of language order, backward transfer can be positive or negative depending on the order and characteristics of new languages. We explore a number of potentially explanatory factors and find that a combination of language contamination and syntactic similarity best fits our results.
comment: Accepted to TSD 2024
♻ ☆ InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
Large Language Models for code (code LLMs) have witnessed tremendous progress in recent years. With the rapid development of code LLMs, many popular evaluation benchmarks, such as HumanEval, DS-1000, and MBPP, have emerged to measure the performance of code LLMs with a particular focus on code generation tasks. However, they are insufficient to cover the full range of expected capabilities of code LLMs, which span beyond code generation to answering diverse coding-related questions. To fill this gap, we propose InfiBench, the first large-scale freeform question-answering (QA) benchmark for code to our knowledge, comprising 234 carefully selected high-quality Stack Overflow questions that span across 15 programming languages. InfiBench uses four types of model-free automatic metrics to evaluate response correctness where domain experts carefully concretize the criterion for each question. We conduct a systematic evaluation for over 100 latest code LLMs on InfiBench, leading to a series of novel and insightful findings. Our detailed analyses showcase potential directions for further advancement of code LLMs. InfiBench is fully open source and continuously expanding to foster more scientific and systematic practices for code LLM evaluation.
comment: 30 pages, 10 pages for main content, work in progress
♻ ☆ Spectral complexity of deep neural networks
It is well-known that randomly initialized, push-forward, fully-connected neural networks weakly converge to isotropic Gaussian processes, in the limit where the width of all layers goes to infinity. In this paper, we propose to use the angular power spectrum of the limiting field to characterize the complexity of the network architecture. In particular, we define sequences of random variables associated with the angular power spectrum, and provide a full characterization of the network complexity in terms of the asymptotic distribution of these sequences as the depth diverges. On this basis, we classify neural networks as low-disorder, sparse, or high-disorder; we show how this classification highlights a number of distinct features for standard activation functions, and in particular, sparsity properties of ReLU networks. Our theoretical results are also validated by numerical simulations.
♻ ☆ JAXbind: Bind any function to JAX
JAX is widely used in machine learning and scientific computing, the latter of which often relies on existing high-performance code that we would ideally like to incorporate into JAX. Reimplementing the existing code in JAX is often impractical and the existing interface in JAX for binding custom code either limits the user to a single Jacobian product or requires deep knowledge of JAX and its C++ backend for general Jacobian products. With JAXbind we drastically reduce the effort required to bind custom functions implemented in other programming languages with full support for Jacobian-vector products and vector-Jacobian products to JAX. Specifically, JAXbind provides an easy-to-use Python interface for defining custom, so-called JAX primitives. Via JAXbind, any function callable from Python can be exposed as a JAX primitive. JAXbind allows a user to interface the JAX function transformation engine with custom derivatives and batching rules, enabling all JAX transformations for the custom primitive.
comment: 4 pages, Github: https://github.com/NIFTy-PPL/JAXbind
♻ ☆ LayerMatch: Do Pseudo-labels Benefit All Layers?
Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods.
♻ ☆ Lifting Architectural Constraints of Injective Flows ICLR 2024
Normalizing Flows explicitly maximize a full-dimensional likelihood on the training data. However, real data is typically only supported on a lower-dimensional manifold leading the model to expend significant compute on modeling noise. Injective Flows fix this by jointly learning a manifold and the distribution on it. So far, they have been limited by restrictive architectures and/or high computational cost. We lift both constraints by a new efficient estimator for the maximum likelihood loss, compatible with free-form bottleneck architectures. We further show that naively learning both the data manifold and the distribution on it can lead to divergent solutions, and use this insight to motivate a stable maximum likelihood training objective. We perform extensive experiments on toy, tabular and image data, demonstrating the competitive performance of the resulting model.
comment: Camera-ready version: accepted to ICLR 2024
♻ ☆ Can Low-Rank Knowledge Distillation in LLMs be Useful for Microelectronic Reasoning?
In this work, we present empirical results regarding the feasibility of using offline large language models (LLMs) in the context of electronic design automation (EDA). The goal is to investigate and evaluate a contemporary language model's (Llama-2-7B) ability to function as a microelectronic Q & A expert as well as its reasoning, and generation capabilities in solving microelectronic-related problems. Llama-2-7B was tested across a variety of adaptation methods, including introducing a novel low-rank knowledge distillation (LoRA-KD) scheme. Our experiments produce both qualitative and quantitative results.
comment: 4 pages, 2 figures, 2 tables, The First IEEE International Workshop on LLM-Aided Design (LAD'24)
♻ ☆ AdaTreeFormer: Few Shot Domain Adaptation for Tree Counting from a Single High-Resolution Image
The process of estimating and counting tree density using only a single aerial or satellite image is a difficult task in the fields of photogrammetry and remote sensing. However, it plays a crucial role in the management of forests. The huge variety of trees in varied topography severely hinders tree counting models to perform well. The purpose of this paper is to propose a framework that is learnt from the source domain with sufficient labeled trees and is adapted to the target domain with only a limited number of labeled trees. Our method, termed as AdaTreeFormer, contains one shared encoder with a hierarchical feature extraction scheme to extract robust features from the source and target domains. It also consists of three subnets: two for extracting self-domain attention maps from source and target domains respectively and one for extracting cross-domain attention maps. For the latter, an attention-to-adapt mechanism is introduced to distill relevant information from different domains while generating tree density maps; a hierarchical cross-domain feature alignment scheme is proposed that progressively aligns the features from the source and target domains. We also adopt adversarial learning into the framework to further reduce the gap between source and target domains. Our AdaTreeFormer is evaluated on six designed domain adaptation tasks using three tree counting datasets, \ie Jiangsu, Yosemite, and London. Experimental results show that AdaTreeFormer significantly surpasses the state of the art, \eg in the cross domain from the Yosemite to Jiangsu dataset, it achieves a reduction of 15.9 points in terms of the absolute counting errors and an increase of 10.8\% in the accuracy of the detected trees' locations. The codes and datasets are available at https://github.com/HAAClassic/AdaTreeFormer.
♻ ☆ Deep Support Vectors
Deep learning has achieved tremendous success. \nj{However,} unlike SVMs, which provide direct decision criteria and can be trained with a small dataset, it still has significant weaknesses due to its requirement for massive datasets during training and the black-box characteristics on decision criteria. \nj{This paper addresses} these issues by identifying support vectors in deep learning models. To this end, we propose the DeepKKT condition, an adaptation of the traditional Karush-Kuhn-Tucker (KKT) condition for deep learning models, and confirm that generated Deep Support Vectors (DSVs) using this condition exhibit properties similar to traditional support vectors. This allows us to apply our method to few-shot dataset distillation problems and alleviate the black-box characteristics of deep learning models. Additionally, we demonstrate that the DeepKKT condition can transform conventional classification models into generative models with high fidelity, particularly as latent \jh{generative} models using class labels as latent variables. We validate the effectiveness of DSVs \nj{using common datasets (ImageNet, CIFAR10 \nj{and} CIFAR100) on the general architectures (ResNet and ConvNet)}, proving their practical applicability. (See Fig.~\ref{fig:generated})
♻ ☆ LPFormer: An Adaptive Graph Transformer for Link Prediction KDD'24
Link prediction is a common task on graph-structured data that has seen applications in a variety of domains. Classically, hand-crafted heuristics were used for this task. Heuristic measures are chosen such that they correlate well with the underlying factors related to link formation. In recent years, a new class of methods has emerged that combines the advantages of message-passing neural networks (MPNN) and heuristics methods. These methods perform predictions by using the output of an MPNN in conjunction with a "pairwise encoding" that captures the relationship between nodes in the candidate link. They have been shown to achieve strong performance on numerous datasets. However, current pairwise encodings often contain a strong inductive bias, using the same underlying factors to classify all links. This limits the ability of existing methods to learn how to properly classify a variety of different links that may form from different factors. To address this limitation, we propose a new method, LPFormer, which attempts to adaptively learn the pairwise encodings for each link. LPFormer models the link factors via an attention module that learns the pairwise encoding that exists between nodes by modeling multiple factors integral to link prediction. Extensive experiments demonstrate that LPFormer can achieve SOTA performance on numerous datasets while maintaining efficiency. The code is available at The code is available at https://github.com/HarryShomer/LPFormer.
comment: KDD'24
♻ ☆ Inference Attacks: A Taxonomy, Survey, and Promising Directions
The prosperity of machine learning has also brought people's concerns about data privacy. Among them, inference attacks can implement privacy breaches in various MLaaS scenarios and model training/prediction phases. Specifically, inference attacks can perform privacy inference on undisclosed target training sets based on outputs of the target model, including but not limited to statistics, membership, semantics, data representation, etc. For instance, infer whether the target data has the characteristics of AIDS. In addition, the rapid development of the machine learning community in recent years, especially the surge of model types and application scenarios, has further stimulated the inference attacks' research. Thus, studying inference attacks and analyzing them in depth is urgent and significant. However, there is still a gap in the systematic discussion of inference attacks from taxonomy, global perspective, attack, and defense perspectives. This survey provides an in-depth and comprehensive inference of attacks and corresponding countermeasures in ML-as-a-service based on taxonomy and the latest researches. Without compromising researchers' intuition, we first propose the 3MP taxonomy based on the community research status, trying to normalize the confusing naming system of inference attacks. Also, we analyze the pros and cons of each type of inference attack, their workflow, countermeasure, and how they interact with other attacks. In the end, we point out several promising directions for researchers from a more comprehensive and novel perspective.
♻ ☆ Time Series Modeling for Heart Rate Prediction: From ARIMA to Transformers
Cardiovascular disease (CVD) is a leading cause of death globally, necessitating precise forecasting models for monitoring vital signs like heart rate, blood pressure, and ECG. Traditional models, such as ARIMA and Prophet, are limited by their need for manual parameter tuning and challenges in handling noisy, sparse, and highly variable medical data. This study investigates advanced deep learning models, including LSTM, and transformer-based architectures, for predicting heart rate time series from the MIT-BIH Database. Results demonstrate that deep learning models, particularly PatchTST, significantly outperform traditional models across multiple metrics, capturing complex patterns and dependencies more effectively. This research underscores the potential of deep learning to enhance patient monitoring and CVD management, suggesting substantial clinical benefits. Future work should extend these findings to larger, more diverse datasets and real-world clinical applications to further validate and optimize model performance.
comment: Accepted by 2024 6th International Conference on Electronic Engineering and Informatics
♻ ☆ Glauber Generative Model: Discrete Diffusion Models via Binary Classification
We introduce the Glauber Generative Model (GGM), a new class of discrete diffusion models, to obtain new samples from a distribution given samples from a discrete space. GGM deploys a discrete Markov chain called the heat bath dynamics (or the Glauber dynamics) to denoise a sequence of noisy tokens to a sample from a joint distribution of discrete tokens. Our novel conceptual framework provides an exact reduction of the task of learning the denoising Markov chain to solving a class of binary classification tasks. More specifically, the model learns to classify a given token in a noisy sequence as signal or noise. In contrast, prior works on discrete diffusion models either solve regression problems to learn importance ratios, or minimize loss functions given by variational approximations. We apply GGM to language modeling and image generation, where images are discretized using image tokenizers like VQGANs. We show that it outperforms existing discrete diffusion models in language generation, and demonstrates strong performance for image generation without using dataset-specific image tokenizers. We also show that our model is capable of performing well in zero-shot control settings like text and image infilling.
♻ ☆ Kernelised Normalising Flows ICLR 2024
Normalising Flows are non-parametric statistical models characterised by their dual capabilities of density estimation and generation. This duality requires an inherently invertible architecture. However, the requirement of invertibility imposes constraints on their expressiveness, necessitating a large number of parameters and innovative architectural designs to achieve good results. Whilst flow-based models predominantly rely on neural-network-based transformations for expressive designs, alternative transformation methods have received limited attention. In this work, we present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework. Our results demonstrate that a kernelised flow can yield competitive or superior results compared to neural network-based flows whilst maintaining parameter efficiency. Kernelised flows excel especially in the low-data regime, enabling flexible non-parametric density estimation in applications with sparse data availability.
comment: Alternate title: Kernelized Normalizing Flows; Accepted at ICLR 2024
♻ ☆ Error Bounds of Supervised Classification from Information-Theoretic Perspective
There remains a list of unanswered research questions on deep learning (DL), including the remarkable generalization power of overparametrized neural networks, the efficient optimization performance despite the non-convexity, and the mechanisms behind flat minima in generalization. In this paper, we adopt an information-theoretic perspective to explore the theoretical foundations of supervised classification using deep neural networks (DNNs). Our analysis introduces the concepts of fitting error and model risk, which, together with generalization error, constitute an upper bound on the expected risk. We demonstrate that the generalization errors are bounded by the complexity, influenced by both the smoothness of distribution and the sample size. Consequently, task complexity serves as a reliable indicator of the dataset's quality, guiding the setting of regularization hyperparameters. Furthermore, the derived upper bound fitting error links the back-propagated gradient, Neural Tangent Kernel (NTK), and the model's parameter count with the fitting error. Utilizing the triangle inequality, we establish an upper bound on the expected risk. This bound offers valuable insights into the effects of overparameterization, non-convex optimization, and the flat minima in DNNs.Finally, empirical verification confirms a significant positive correlation between the derived theoretical bounds and the practical expected risk, confirming the practical relevance of the theoretical findings.
♻ ☆ Data Reconstruction Attacks and Defenses: A Systematic Evaluation
Reconstruction attacks and defenses are essential in understanding the data leakage problem in machine learning. However, prior work has centered around empirical observations of gradient inversion attacks, lacks theoretical justifications, and cannot disentangle the usefulness of defending methods from the computational limitation of attacking methods. In this work, we propose to view the problem as an inverse problem, enabling us to theoretically, quantitatively, and systematically evaluate the data reconstruction problem. On various defense methods, we derived the algorithmic upper bound and the matching (in feature dimension and model width) information-theoretical lower bound on the reconstruction error for two-layer neural networks. To complement the theoretical results and investigate the utility-privacy trade-off, we defined a natural evaluation metric of the defense methods with similar utility loss among the strongest attacks. We further propose a strong reconstruction attack that helps update some previous understanding of the strength of defense methods under our proposed evaluation metric.
♻ ☆ MixerFlow: MLP-Mixer meets Normalising Flows ECML-PKDD 2024
Normalising flows are generative models that transform a complex density into a simpler density through the use of bijective transformations enabling both density estimation and data generation from a single model. %However, the requirement for bijectivity imposes the use of specialised architectures. In the context of image modelling, the predominant choice has been the Glow-based architecture, whereas alternative architectures remain largely unexplored in the research community. In this work, we propose a novel architecture called MixerFlow, based on the MLP-Mixer architecture, further unifying the generative and discriminative modelling architectures. MixerFlow offers an efficient mechanism for weight sharing for flow-based models. Our results demonstrate comparative or superior density estimation on image datasets and good scaling as the image resolution increases, making MixerFlow a simple yet powerful alternative to the Glow-based architectures. We also show that MixerFlow provides more informative embeddings than Glow-based architectures and can integrate many structured transformations such as splines or Kolmogorov-Arnold Networks.
comment: Alternative title: MixerFlow for Image Modelling; Accepted at ECML-PKDD 2024
♻ ☆ Fast Sampling via Discrete Non-Markov Diffusion Models
Discrete diffusion models have emerged as powerful tools for high-quality data generation. Despite their success in discrete spaces, such as text generation tasks, the acceleration of discrete diffusion models remains under explored. In this paper, we propose a discrete non-Markov diffusion model, which admits an accelerated reverse sampling for discrete data generation. Our method significantly reduces the number of function evaluations (i.e., calls to the neural network), making the sampling process much faster. Furthermore, we study the transition from finite to infinite step sampling, offering new insights into bridging the gap between discrete and continuous-time processes for discrete diffusion models. Extensive experiments on natural language generation and machine translation tasks demonstrate the superior performance of our method in terms of both generation speed and sample quality compared to existing methods for discrete diffusion models.
comment: 33 pages, 5 figures, 12 tables
♻ ☆ MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ICML 2024
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
comment: ICML 2024. Code is available at https://github.com/facebookresearch/MobileLLM
♻ ☆ Transfer Learning in ECG Diagnosis: Is It Effective?
The adoption of deep learning in ECG diagnosis is often hindered by the scarcity of large, well-labeled datasets in real-world scenarios, leading to the use of transfer learning to leverage features learned from larger datasets. Yet the prevailing assumption that transfer learning consistently outperforms training from scratch has never been systematically validated. In this study, we conduct the first extensive empirical study on the effectiveness of transfer learning in multi-label ECG classification, by investigating comparing the fine-tuning performance with that of training from scratch, covering a variety of ECG datasets and deep neural networks. We confirm that fine-tuning is the preferable choice for small downstream datasets; however, when the dataset is sufficiently large, training from scratch can achieve comparable performance, albeit requiring a longer training time to catch up. Furthermore, we find that transfer learning exhibits better compatibility with convolutional neural networks than with recurrent neural networks, which are the two most prevalent architectures for time-series ECG applications. Our results underscore the importance of transfer learning in ECG diagnosis, yet depending on the amount of available data, researchers may opt not to use it, considering the non-negligible cost associated with pre-training.
♻ ☆ Real-Time Machine-Learning-Based Optimization Using Input Convex LSTM
Neural network-based optimization and control have gradually supplanted first-principles model-based approaches in energy and manufacturing systems due to their efficient, data-driven process modeling that requires fewer resources. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-time decision-making processes. To address this challenge, we propose a novel Input Convex Long Short-Term Memory (ICLSTM) network to enhance the computational efficiency of neural network-based optimization. Through two case studies employing real-time neural network-based optimization for optimizing energy and chemical systems, we demonstrate the superior performance of ICLSTM-based optimization in terms of runtime. Specifically, in a real-time optimization problem of a real-world solar photovoltaic (PV) energy system at LHT Holdings in Singapore, ICLSTM-based optimization achieved an 8-fold speedup compared to conventional LSTM-based optimization. These results highlight the potential of ICLSTM networks to significantly enhance the efficiency of neural network-based optimization and control in practical applications. Source code is available at https://github.com/killingbear999/ICLSTM.
♻ ☆ Bi-Mamba+: Bidirectional Mamba for Time Series Forecasting
Long-term time series forecasting (LTSF) provides longer insights into future trends and patterns. Over the past few years, deep learning models especially Transformers have achieved advanced performance in LTSF tasks. However, LTSF faces inherent challenges such as long-term dependencies capturing and sparse semantic characteristics. Recently, a new state space model (SSM) named Mamba is proposed. With the selective capability on input data and the hardware-aware parallel computing algorithm, Mamba has shown great potential in balancing predicting performance and computational efficiency compared to Transformers. To enhance Mamba's ability to preserve historical information in a longer range, we design a novel Mamba+ block by adding a forget gate inside Mamba to selectively combine the new features with the historical features in a complementary manner. Furthermore, we apply Mamba+ both forward and backward and propose Bi-Mamba+, aiming to promote the model's ability to capture interactions among time series elements. Additionally, multivariate time series data in different scenarios may exhibit varying emphasis on intra- or inter-series dependencies. Therefore, we propose a series-relation-aware decider that controls the utilization of channel-independent or channel-mixing tokenization strategy for specific datasets. Extensive experiments on 8 real-world datasets show that our model achieves more accurate predictions compared with state-of-the-art methods.
comment: New Mamba-based architecture. All experiments rerun
♻ ☆ Sample Complexity of Offline Distributionally Robust Linear Markov Decision Processes
In offline reinforcement learning (RL), the absence of active exploration calls for attention on the model robustness to tackle the sim-to-real gap, where the discrepancy between the simulated and deployed environments can significantly undermine the performance of the learned policy. To endow the learned policy with robustness in a sample-efficient manner in the presence of high-dimensional state-action space, this paper considers the sample complexity of distributionally robust linear Markov decision processes (MDPs) with an uncertainty set characterized by the total variation distance using offline data. We develop a pessimistic model-based algorithm and establish its sample complexity bound under minimal data coverage assumptions, which outperforms prior art by at least $\widetilde{O}(d)$, where $d$ is the feature dimension. We further improve the performance guarantee of the proposed algorithm by incorporating a carefully-designed variance estimator.
comment: accepted by Reinforcement Learning Conference (RLC)
♻ ☆ CAT: Interpretable Concept-based Taylor Additive Models
As an emerging interpretable technique, Generalized Additive Models (GAMs) adopt neural networks to individually learn non-linear functions for each feature, which are then combined through a linear model for final predictions. Although GAMs can explain deep neural networks (DNNs) at the feature level, they require large numbers of model parameters and are prone to overfitting, making them hard to train and scale. Additionally, in real-world datasets with many features, the interpretability of feature-based explanations diminishes for humans. To tackle these issues, recent research has shifted towards concept-based interpretable methods. These approaches try to integrate concept learning as an intermediate step before making predictions, explaining the predictions in terms of human-understandable concepts. However, these methods require domain experts to extensively label concepts with relevant names and their ground-truth values. In response, we propose CAT, a novel interpretable Concept-bAsed Taylor additive model to simply this process. CAT does not have to require domain experts to annotate concepts and their ground-truth values. Instead, it only requires users to simply categorize input features into broad groups, which can be easily accomplished through a quick metadata review. Specifically, CAT first embeds each group of input features into one-dimensional high-level concept representation, and then feeds the concept representations into a new white-box Taylor Neural Network (TaylorNet). The TaylorNet aims to learn the non-linear relationship between the inputs and outputs using polynomials. Evaluation results across multiple benchmarks demonstrate that CAT can outperform or compete with the baselines while reducing the need of extensive model parameters. Importantly, it can explain model predictions through high-level concepts that human can understand.
♻ ☆ Some Primal-Dual Theory for Subgradient Methods for Strongly Convex Optimization
We consider (stochastic) subgradient methods for strongly convex but potentially nonsmooth non-Lipschitz optimization. We provide new equivalent dual descriptions (in the style of dual averaging) for the classic subgradient method, the proximal subgradient method, and the switching subgradient method. These equivalences enable $O(1/T)$ convergence guarantees in terms of both their classic primal gap and a not previously analyzed dual gap for strongly convex optimization. Consequently, our theory provides these classic methods with simple, optimal stopping criteria and optimality certificates at no added computational cost. Our results apply to a wide range of stepsize selections and of non-Lipschitz ill-conditioned problems where the early iterations of the subgradient method may diverge exponentially quickly (a phenomenon which, to the best of our knowledge, no prior works address). Even in the presence of such undesirable behaviors, our theory still ensures and bounds eventual convergence.
comment: 24 pages, major revision shortened the write-up and unified the analysis to be done just once in a single "super" setting
♻ ☆ Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
Multimodal Continual Instruction Tuning (MCIT) enables Multimodal Large Language Models (MLLMs) to meet continuously emerging requirements without expensive retraining. MCIT faces two major obstacles: catastrophic forgetting (where old knowledge is forgotten) and negative forward transfer (where the performance of future tasks is degraded). Although existing methods have greatly alleviated catastrophic forgetting, they still suffer from negative forward transfer. We discover a large discrepancy in different input embeddings by performing singular value decomposition (SVD) on input embeddings. This discrepancy results in the model learning irrelevant information for old and pre-trained tasks, leading to catastrophic forgetting and negative forward transfer. To address these issues, we propose Prompt Tuning with Positive Forward Transfer (Fwd-Prompt), a prompt-based method that projects the prompt gradient to the residual space to minimize interference between tasks and to the pre-trained subspace for reusing pre-trained knowledge. Our experiments demonstrate that Fwd-Prompt achieves state-of-the-art performance while updating fewer parameters and requiring no old samples. Our research illuminates the potential of continuously adapting MLLMs to new tasks under the instruction tuning paradigm and encourages future studies to explore MCIT.
♻ ☆ Reinforcement Learning in Credit Scoring and Underwriting
This paper proposes a novel reinforcement learning (RL) framework for credit underwriting that tackles ungeneralizable contextual challenges. We adapt RL principles for credit scoring, incorporating action space renewal and multi-choice actions. Our work demonstrates that the traditional underwriting approach aligns with the RL greedy strategy. We introduce two new RL-based credit underwriting algorithms to enable more informed decision-making. Simulations show these new approaches outperform the traditional method in scenarios where the data aligns with the model. However, complex situations highlight model limitations, emphasizing the importance of powerful machine learning models for optimal performance. Future research directions include exploring more sophisticated models alongside efficient exploration mechanisms.
♻ ☆ Neural Operator for Accelerating Coronal Magnetic Field Model
Studying the sun's outer atmosphere is challenging due to its complex magnetic fields impacting solar activities. Magnetohydrodynamics (MHD) simulations help model these interactions but are extremely time-consuming (usually on a scale of days). Our research applies the Fourier Neural Operator (FNO) to accelerate the coronal magnetic field modeling, specifically, the Bifrost MHD model. We apply Tensorized FNO (TFNO) to generate solutions from partial differential equations (PDEs) over a 3D domain efficiently. TFNO's performance is compared with other deep learning methods, highlighting its accuracy and scalability. Physics analysis confirms that TFNO is reliable and capable of accelerating MHD simulations with high precision. This advancement improves efficiency in data handling, enhances predictive capabilities, and provides a better understanding of magnetic topologies.
♻ ☆ Extraction of nonlinearity in neural networks with Koopman operator
Nonlinearity plays a crucial role in deep neural networks. In this paper, we investigate the degree to which the nonlinearity of the neural network is essential. For this purpose, we employ the Koopman operator, extended dynamic mode decomposition, and the tensor-train format. The Koopman operator approach has been recently developed in physics and nonlinear sciences; the Koopman operator deals with the time evolution in the observable space instead of the state space. Since we can replace the nonlinearity in the state space with the linearity in the observable space, it is a hopeful candidate for understanding complex behavior in nonlinear systems. Here, we analyze learned neural networks for the classification problems. As a result, the replacement of the nonlinear middle layers with the Koopman matrix yields enough accuracy in numerical experiments. In addition, we confirm that the pruning of the Koopman matrix gives sufficient accuracy even at high compression ratios. These results indicate the possibility of extracting some features in the neural networks with the Koopman operator approach.
comment: 22 pages, 14 figures
♻ ☆ ASID: Active Exploration for System Identification in Robotic Manipulation
Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid
comment: Project website at https://weirdlabuw.github.io/asid
♻ ☆ System Identification for Continuous-time Linear Dynamical Systems
The problem of system identification for the Kalman filter, relying on the expectation-maximization (EM) procedure to learn the underlying parameters of a dynamical system, has largely been studied assuming that observations are sampled at equally-spaced time points. However, in many applications this is a restrictive and unrealistic assumption. This paper addresses system identification for the continuous-discrete filter, with the aim of generalizing learning for the Kalman filter by relying on a solution to a continuous-time It\^o stochastic differential equation (SDE) for the latent state and covariance dynamics. We introduce a novel two-filter, analytical form for the posterior with a Bayesian derivation, which yields analytical updates which do not require the forward-pass to be pre-computed. Using this analytical and efficient computation of the posterior, we provide an EM procedure which estimates the parameters of the SDE, naturally incorporating irregularly sampled measurements. Generalizing the learning of latent linear dynamical systems (LDS) to continuous-time may extend the use of the hybrid Kalman filter to data which is not regularly sampled or has intermittent missing values, and can extend the power of non-linear system identification methods such as switching LDS (SLDS), which rely on EM for the linear discrete-time Kalman filter as a sub-unit for learning locally linearized behavior of a non-linear system. We apply the method by learning the parameters of a latent, multivariate Fokker-Planck SDE representing a toggle-switch genetic circuit using biologically realistic parameters, and compare the efficacy of learning relative to the discrete-time Kalman filter as the step-size irregularity and spectral-radius of the dynamics-matrix increases.
comment: 31 pages, 3 figures. Only light changes and restructuring to previous version made
♻ ☆ Cross-conformal e-prediction
This note discusses a simple modification of cross-conformal prediction inspired by recent work on e-values. The precursor of conformal prediction developed in the 1990s by Gammerman, Vapnik, and Vovk was also based on e-values and is called conformal e-prediction in this note. Replacing e-values by p-values led to conformal prediction, which has important advantages over conformal e-prediction without obvious disadvantages. The situation with cross-conformal prediction is, however, different: whereas for cross-conformal prediction validity is only an empirical fact (and can be broken with excessive randomization), this note draws the reader's attention to the obvious fact that cross-conformal e-prediction enjoys a guaranteed property of validity.
comment: 8 pages. This version: exposition improved; proof of Proposition 4 added
♻ ☆ Intriguing Properties of Adversarial ML Attacks in the Problem Space [Extended Version]
Recent research efforts on adversarial machine learning (ML) have investigated problem-space attacks, focusing on the generation of real evasive objects in domains where, unlike images, there is no clear inverse mapping to the feature space (e.g., software). However, the design, comparison, and real-world implications of problem-space attacks remain underexplored. This article makes three major contributions. Firstly, we propose a general formalization for adversarial ML evasion attacks in the problem-space, which includes the definition of a comprehensive set of constraints on available transformations, preserved semantics, absent artifacts, and plausibility. We shed light on the relationship between feature space and problem space, and we introduce the concept of side-effect features as the by-product of the inverse feature-mapping problem. This enables us to define and prove necessary and sufficient conditions for the existence of problem-space attacks. Secondly, building on our general formalization, we propose a novel problem-space attack on Android malware that overcomes past limitations in terms of semantics and artifacts. We have tested our approach on a dataset with 150K Android apps from 2016 and 2018 which show the practical feasibility of evading a state-of-the-art malware classifier along with its hardened version. Thirdly, we explore the effectiveness of adversarial training as a possible approach to enforce robustness against adversarial samples, evaluating its effectiveness on the considered machine learning models under different scenarios. Our results demonstrate that "adversarial-malware as a service" is a realistic threat, as we automatically generate thousands of realistic and inconspicuous adversarial applications at scale, where on average it takes only a few minutes to generate an adversarial instance.
comment: This arXiv version (v3) corresponds to an extended version
♻ ☆ Symbolic Prompt Program Search: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
In many modern LLM applications, such as retrieval augmented generation, prompts have become programs themselves. In these settings, prompt programs are repeatedly called with different user queries or data instances. A big practical challenge is optimizing such prompt programs. Recent work has mostly focused on either simple prompt programs or assumed that the general structure of a prompt program is fixed. We introduce SAMMO, a framework to perform symbolic prompt program search for compile-time optimizations of prompt programs. SAMMO represents prompt programs on a symbolic level which allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs. We make all code available open-source at https://github.com/microsoft/sammo .
♻ ☆ A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents CIKM 2023
Automatic legal judgment prediction and its explanation suffer from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents and extracting their explanation becomes a challenging task, more so on documents with no structural annotation. We define this problem as "scarce annotated legal documents" and explore their lack of structural information and their long lengths with a deep-learning-based classification framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. We explore the adaptability of LLMs with multi-billion parameters (GPT-Neo, and GPT-J) to legal texts and their intra-domain(legal) transfer learning capacity. Alongside this, we compare their performance and adaptability with MESc and the impact of combining embeddings from their last layers. For such hierarchical models, we also propose an explanation extraction algorithm named ORSE; Occlusion sensitivity-based Relevant Sentence Extractor; based on the input-occlusion sensitivity of the model, to explain the predictions with the most relevant sentences from the document. We explore these methods and test their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. MESc achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art proposed methods, while ORSE applied on MESc achieves a total average gain of 50% over the baseline explainability scores.
comment: Published as non archival paper in the The 3rd International Workshop on Mining and Learning in the Legal Domain (MLLD-2023) at CIKM 2023, Birmingham, United Kingdom. (https://sites.google.com/view/mlld2023/)
♻ ☆ Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis
In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as \textit{pragmas}. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler's data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to $22\%$, and identifies designs with an average of $1.10\times$ and $1.26\times$ (up to $8.17\times$ and $13.31\times$) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.
comment: 14 pages, 8 figures. arXiv admin note: text overlap with arXiv:2305.10838
♻ ☆ Condition Monitoring with Incomplete Data: An Integrated Variational Autoencoder and Distance Metric Framework
Condition monitoring of industrial systems is crucial for ensuring safety and maintenance planning, yet notable challenges arise in real-world settings due to the limited or non-existent availability of fault samples. This paper introduces an innovative solution to this problem by proposing a new method for fault detection and condition monitoring for unseen data. Adopting an approach inspired by zero-shot learning, our method can identify faults and assign a relative health index to various operational conditions. Typically, we have plenty of data on normal operations, some data on compromised conditions, and very few (if any) samples of severe faults. We use a variational autoencoder to capture the probabilistic distribution of previously seen and new unseen conditions. The health status is determined by comparing each sample's deviation from a normal operation reference distribution in the latent space. Faults are detected by establishing a threshold for the health indexes, allowing the model to identify severe, unseen faults with high accuracy, even amidst noise. We validate our approach using the run-to-failure IMS-bearing dataset and compare it with other methods. The health indexes generated by our model closely match the established descriptive model of bearing wear, attesting to the robustness and reliability of our method. These findings highlight the potential of our methodology in augmenting fault detection capabilities within industrial domains, thereby contributing to heightened safety protocols and optimized maintenance practices.
comment: Accepted in the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE 2024)
♻ ☆ Directions of Curvature as an Explanation for Loss of Plasticity
Loss of plasticity is a phenomenon in which neural networks lose their ability to learn from new experience. Despite being empirically observed in several problem settings, little is understood about the mechanisms that lead to loss of plasticity. In this paper, we offer a consistent explanation for loss of plasticity: Neural networks lose directions of curvature during training and that loss of plasticity can be attributed to this reduction in curvature. To support such a claim, we provide a systematic investigation of loss of plasticity across continual learning tasks using MNIST, CIFAR-10 and ImageNet. Our findings illustrate that loss of curvature directions coincides with loss of plasticity, while also showing that previous explanations are insufficient to explain loss of plasticity in all settings. Lastly, we show that regularizers which mitigate loss of plasticity also preserve curvature, motivating a simple distributional regularizer that proves to be effective across the problem settings we considered.
Systems and Control 28
☆ Unconditional Stability Analysis of N-Port Networks Based on Structured Singular Value Computation
In this paper, a novel approach based on robust stability concepts and tools is introduced to evaluate the unconditional stability of microwave active $\textit{n}$-port devices. An efficient calculation of the Structured Singular Value of the $\textit{n}$x$\textit{n}$ scattering matrix is proposed to obtain the stability characteristics of the device. The presented method is validated in two ways. First, it is applied to a referential 4x4 scattering parameter set for independent verification. Second, the method is applied to a 4-port GaAs FET amplifier fabricated in hybrid technology. The results confirm the validity and computational efficiency of the proposed approach.
☆ A Max Pressure Algorithm for Traffic Signals Considering Pedestrian Queues
This paper proposes a novel max-pressure (MP) algorithm that incorporates pedestrian traffic into the MP control architecture. Pedestrians are modeled as being included in one of two groups: those walking on sidewalks and those queued at intersections waiting to cross. Traffic dynamics models for both groups are developed. Under the proposed control policy, the signal timings are determined based on the queue length of both vehicles and pedestrians waiting to cross the intersection. The proposed algorithm maintains the decentralized control structure, and the paper proves that it also exhibits the maximum stability property for both vehicles and pedestrians. Microscopic traffic simulation results demonstrate that the proposed model can improve the overall operational efficiency -- i.e., reduce person travel delays -- under various vehicle demand levels compared to the original queue-based MP (Q-MP) algorithm and a recently developed rule-based MP algorithm considering pedestrians. The Q-MP ignores the yielding behavior of right-turn vehicles to conflicting pedestrian movements, which leads to high delay for vehicles. On the other hand, the delay incurred by pedestrians is high from the rule-based model since it imposes large waiting time tolerance to guarantee the operational efficiency of vehicles. The proposed algorithm outperforms both models since the states of both vehicles and pedestrians are taken into consideration to determine signal timings.
☆ OCC-MP: A Max-Pressure framework to prioritize transit and high occupancy vehicles
Max-pressure (MP) is a decentralized adaptive traffic signal control approach that has been shown to maximize throughput for private vehicles. However, MP-based signal control algorithms do not differentiate the movement of transit vehicles from private vehicles or between high and single-occupancy private vehicles. Prioritizing the movement of transit or other high occupancy vehicles (HOVs) is vital to reduce congestion and improve the reliability and efficiency of transit operations. This study proposes OCC-MP: a novel MP-based algorithm that considers both vehicle queues and passenger occupancies in computing the weights of movements. By weighing movements with higher passenger occupancies more heavily, transit and other HOVs are implicitly provided with priority, while accounting for any negative impacts of that priority on single occupancy vehicles. And, unlike rule-based transit signal priority (TSP) strategies, OCC-MP more naturally also accommodates conflicting transit routes at a signalized intersection and facilitates their movement, even in mixed traffic without dedicated lanes. Simulations on a grid network under varying demands and transit configurations demonstrate the effectiveness of OCC-MP at providing TSP while simultaneously reducing the negative impact imparted onto lower occupancy private vehicles. Furthermore, OCC-MP is shown to have a larger stable region for demand compared to rule-based TSP strategies integrated into the MP framework. The performance of OCC-MP is also shown to be robust to errors in passenger occupancy information from transit vehicles and can be applied when passenger occupancies of private vehicles are not available. Finally, OCC-MP can be applied in a partially connected vehicle (CV) environment when a subset of vehicles is able to provide information to the signal controller, outperforming baseline methods at low CV penetration rates.
☆ Analysis of Multistage Feedforward Operational Transconductance Amplifiers using Single-Pole Approximation
This paper presents analysis results of the operational transconductance amplifiers (OTAs) that combine feedforward paths and multistage amplifiers to achieve high-gain wideband operation as well as frequency compensation. To analyze multistage feedforward OTAs and provide an intuitive design method, the single-pole approximation model is employed for each substage of the OTA. Using the single-pole approximation model, the analysis is carried out from the two-stage OTA to the four-stage OTA in this work.
comment: 10 pages, 9 figures, preprint
☆ In-situ Controller Autotuning by Bayesian Optimization for Closed-loop Feedback Control of Laser Powder Bed Fusion Process
Open-loop control of laser powder bed fusion (LPBF) additive manufacturing (AM) has enabled the production of complex, high-criticality parts for various industries. This method relies on static parameter sets from extensive experimentation and simulations, hoping they remain stable and defect-free in production. Closed-loop control of LPBF can further enhance process stability and reduce defects despite complex thermal histories, process noise, hardware drift, and unexpected perturbations. Controller performance depends on parameter tuning, traditionally a manual, expertise-driven process with no guarantee of optimal performance and limited transferability between systems. This study proposes Bayesian Optimization (BO) to automate in-layer controller tuning by leveraging LPBF's layer-to-layer repetitive nature. Two approaches are introduced: online tuning, adjusting parameters iteratively during the process, and offline tuning, conducted in a setup such as laser exposures on a bare metal plate. These methods are experimentally implemented on an in-layer PI controller, and the performance is investigated on two wedge geometries prone to overheating. Results show that BO effectively tunes controllers using either method, significantly reducing overheating in controlled wedge specimens compared to uncontrolled ones. This study presents the first printed parts controlled by an in-layer controller subjected to microstructural analysis. Findings reveal partial presence of lack-of-fusion porosities due to insufficient laser power assigned by the controller, highlighting a significant challenge for utilizing laser power controllers. In summary, BO presents a promising method for automatic in-layer controller tuning in LPBF, enhancing control precision and mitigating overheating in production parts.
☆ Oligopoly Game Stabilisation Through Multilayer Congestion Dynamics
International trade and logistics are subject to factors including geopolitical instability, climate change, and black swan events such as the unforeseen closure of the Suez Canal. The problem of predicting local price change under modification of an underlying transport network or change in supply characteristics unites elements of game theory, network theory and transport. The Cournot Oligopoly models economic actors as rational players attempting to maximise profit by optimising supply quantities with analytical results now consolidated about equilibrium characteristics where transport conditions are fixed. Similarly, where supply and demand are fixed, the routing of goods in a transport network can be analytically solved through a traffic assignment problem. Hence we can solve the coupled Cournot-congestion problem by means of a 2-layer network. Where the layers are linked, inter-layer feedback wherein players attempt to maximise their utility occurs. In this respect we find players benefit from taking advantage of non-simultaneous responses to the market rather than moving to a new equilibrium. We draw conclusions about the nature of equilibria, finding that the concave utility curve property results in unique and stable equilibrium for each uncoupled layer, while linked layers have a non-unique stable equilibria for which general solutions are stated.
comment: 27 pages, 11 figures
☆ Parameter Dependent Chen--Fliess Series and Their Nonrecursive Interconnections
A class of parameter dependent Chen--Fliess series is introduced where the series coefficients are taken from a noncommutative ring of multivariable differential operators. Such series are shown in the linear case to represent formal solutions to Cauchy initial value problems for nonhomogeneous PDEs and thus are useful for characterizing the input-output maps of distributed control systems. It is also shown that this class of functional series is almost closed under the set of nonrecursive interconnections, that is, any finite combination of parallel and series interconnections without a closed-loop. Some sufficient conditions are needed for the series interconnection. Specific examples are given involving the transport equation and the wave equation.
☆ Distributed Utility Optimization in Vehicular Communication Systems
In this paper, we study the problem of utility maximization in the uplink of vehicle-to-infrastructure communication systems. The studied scenarios consider four practical aspects of mobile radio communication links: i) Interference between adjacent channels, ii) interference between roadside units along the way, iii) fast and slow channel fadings, and iv) Doppler shift effects. We present first the system model for the IEEE 802.11p standard, which considers a communication network between vehicles and roadside infrastructure. Next, we formulate the problem of utility maximization in the network, and propose a distributed optimization scheme. This distributed scheme is based on a two-loop feedback configuration, where an outer-loop establishes the optimal signal to interference-noise ratio (SINR) that maximizes the utility function per vehicle and defines a quality-of-service objective. Meanwhile, inner-control loops adjust the transmission power to achieve this optimal SINR reference in each vehicle node regardless of interference, time-varying channel profiles and network latency. The computation complexity of the distributed utility maximization scheme is analyzed for each feedback loop. Simulation results indicate that the proposed scheme reaches the objective SINRs that maximize utility and improve energy efficiency in the network with a low time cost. The results also show that the maximum utility is consistently achieved for different propagation scenarios inside the vehicular communication network.
comment: Index Terms: Vehicular communications, transmission power, utility maximization, feedback control
☆ Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes
Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article.
☆ Generalized Averaging Method for Power Electronics Modeling from DC to above Half the Switching Frequency
Modeling power electronic converters at frequencies close to or above half the switching frequency has been difficult due to the time-variant and discontinuous switching actions. This paper uses the properties of moving Fourier coefficients to develop the generalized averaging method, breaking though the limit of half the switching frequency. The paper also proposes the generalized average model for various switching signals, including pulse-width modulation (PWM), phase-shift modulation, pulse-frequency modulation (PFM), and state-dependent switching signals, so that circuits and modulators/controllers can be modeled separately and combined flexibly. Using the Laplace transform of moving Fourier coefficients, the coupling of signals and their sidebands at different frequencies is clearly described as the coupling of moving Fourier coefficients at the same frequency in a linear time-invariant system framework. The modeling method is applied to a PWM controlled boost converter, a V2 constant on-time controlled buck converter, and a PFM controlled LLC converter, for demonstration and validation. Experimental results of the converters in different operating modes show that the proposed models have higher accuracy than exiting models, especially in the frequency range close to or above half the switching frequency. The developed method can be applied to almost all types of power electronic converters.
☆ Verification and Synthesis of Compatible Control Lyapunov and Control Barrier Functions
Safety and stability are essential properties of control systems. Control Barrier Functions (CBFs) and Control Lyapunov Functions (CLFs) have been proposed to ensure safety and stability respectively. However, previous approaches typically verify and synthesize the CBFs and CLFs separately, satisfying their respective constraints, without proving that the CBFs and CLFs are compatible with each other, namely at every state, there exists control actions that satisfy both the CBF and CLF constraints simultaneously. There exists some recent works that synthesized compatible CLF and CBF, but relying on nominal polynomial or rational controllers, which is just a sufficient but not necessary condition for compatibility. In this work, we investigate verification and synthesis of compatible CBF and CLF independent from any nominal controllers. We derive exact necessary and sufficient conditions for compatibility, and further formulate Sum-Of-Squares program for the compatibility verification. Based on our verification framework, we also design an alternating nominal-controller-free synthesis method. We evaluate our method in a linear toy, a non-linear toy, and a power converter example.
☆ A Wireless, Multicolor Fluorescence Image Sensor Implant for Real-Time Monitoring in Cancer Therapy
Real-time monitoring of dynamic biological processes in the body is critical to understanding disease progression and treatment response. This data, for instance, can help address the lower than 50% response rates to cancer immunotherapy. However, current clinical imaging modalities lack the molecular contrast, resolution, and chronic usability for rapid and accurate response assessments. Here, we present a fully wireless image sensor featuring a 2.5$\times$5 mm$^2$ CMOS integrated circuit for multicolor fluorescence imaging deep in tissue. The sensor operates wirelessly via ultrasound (US) at 5 cm depth in oil, harvesting energy with 221 mW/cm$^{2}$ incident US power density (31% of FDA limits) and backscattering data at 13 kbps with a bit error rate <$10^{-6}$. In-situ fluorescence excitation is provided by micro-laser diodes controlled with a programmable on-chip driver. An optical frontend combining a multi-bandpass interference filter and a fiber optic plate provides >6 OD excitation blocking and enables three-color imaging for detecting multiple cell types. A 36$\times$40-pixel array captures images with <125 $\mu$m resolution. We demonstrate wireless, dual-color fluorescence imaging of both effector and suppressor immune cells in ex vivo mouse tumor samples with and without immunotherapy. These results show promise for providing rapid insight into therapeutic response and resistance, guiding personalized medicine.
comment: *equally contributing authors
☆ A Thermo-Electro-Mechanical Model for Long-Term Reliability of Aging Transmission Lines
Integrity and reliability of a national power grid system are essential to society's development and security. Among the power grid components, transmission lines are critical due to exposure and vulnerability to severe external conditions, including high winds, ice, and extreme temperatures. The combined effects of external agents with high electrical load and presence of damage precursors greatly affects the conducting material's properties due to a thermal runaway cycle that accelerates the aging process. In this paper, we develop a thermo-electro-mechanical model for long-term failure analysis of overhead transmission lines. A phase-field model of damage and fatigue, coupled with electrical and thermal modules, provides a detailed description of the conductor's temperature evolution. We define a limit state function based on maximum operating temperature to avoid excessive overheating and sagging. We study four representative scenarios deterministically, and propose the Probabilistic Collocation Method (PCM) as a tool to understand the stochastic behavior of the system. We use PCM in forward parametric uncertainty quantification, global sensitivity analysis, and computation of failure probability curves in a straightforward and computationally efficient fashion, and we quantify the most influential parameters that affect the failure predictability from a physics-based perspective.
☆ State and Input Constrained Output-Feedback Adaptive Optimal Control of Affine Nonlinear Systems
In this paper, a novel online, output-feedback, critic-only, model-based reinforcement learning framework is developed for safety-critical control systems operating in complex environments. The developed framework ensures system stability and safety, regardless of the lack of full-state measurement, while learning and implementing an optimal controller. The approach leverages linear matrix inequality-based observer design method to efficiently search for observer gains for effective state estimation. Then, approximate dynamic programming is used to develop an approximate controller that uses simulated experiences to guarantee the safety and stability of the closed-loop system. Safety is enforced by adding a recentered robust Lyapunov-like barrier function to the cost function that effectively enforces safety constraints, even in the presence of uncertainty in the state. Lyapunov-based stability analysis is used to guarantee uniform ultimate boundedness of the trajectories of the closed-loop system and ensure safety. Simulation studies are performed to demonstrate the effectiveness of the developed method through two real-world safety-critical scenarios, ensuring that the state trajectories of a given system remain in a given set and obstacle avoidance.
☆ A Thermal Study of Terahertz Induced Protein Interactions
Proteins can be regarded as thermal nanosensors in an intra-body network. Upon being stimulated by Terahertz (THz) frequencies that match their vibrational modes, protein molecules experience resonant absorption and dissipate their energy as heat, undergoing a thermal process. This paper aims to analyze the effect of THz signaling on the protein heat dissipation mechanism. We therefore deploy a mathematical framework based on the heat diffusion model to characterize how proteins absorb THz-electromagnetic (EM) energy from the stimulating EM fields and subsequently release this energy as heat to their immediate surroundings. We also conduct a parametric study to explain the impact of the signal power, pulse duration, and interparticle distance on the protein thermal analysis. In addition, we demonstrate the relationship between the change in temperature and the opening probability of thermally-gated ion channels. Our results indicate that a controlled temperature change can be achieved in an intra-body environment by exciting protein particles at their resonant frequencies. We further verify our results numerically using COMSOL Multiphysics and introduce an experimental framework that assesses the effects of THz radiation on protein particles. We conclude that under controlled heating, protein molecules can serve as hotspots that impact thermally-gated ion channels. Through the presented work, we infer that the heating process can be engineered on different time and length scales by controlling the THz-EM signal input.
comment: Accepted for publication in the IEEE Transactions on NanoBioscience
☆ Microgrid Planner: A Distributed Energy Resource Sizing Method
We present a heuristic search method for distrubuted energy resource sizing, released in Microgrid Planner, an open-source software platform. Our method is constructed to identify a wide range of microgrid design options that satisfy a given set of power load requirements, allowing a decision maker to weigh trade-offs between potential designs and select preferred solutions. We introduce a global binary search algorithm to build a diverse set of microgrid design options and refine them using a local linear search method.
♻ ☆ Revisiting Kinetic Monte Carlo Algorithms for Time-dependent Processes: from open-loop control to feedback control
Simulating stochastic systems with feedback control is challenging due to the complex interplay between the system's dynamics and the feedback-dependent control protocols. We present a single-step-trajectory probability analysis to time-dependent stochastic systems. Based on this analysis, we revisit several time-dependent kinetic Monte Carlo (KMC) algorithms designed for systems under open-loop-control protocols. Our analysis provides an unified alternative proof to these algorithms, summarized into a pedagogical tutorial. Moreover, with the trajectory probability analysis, we present a novel feedback-controlled KMC algorithm that accurately captures the dynamics systems controlled by external signal based on measurements of the system's state. Our method correctly captures the system dynamics and avoids the artificial Zeno effect that arises from incorrectly applying the direct Gillespie algorithm to feedback-controlled systems. This work provides a unified perspective on existing open-loop-control KMC algorithms and also offers a powerful and accurate tool for simulating stochastic systems with feedback control.
♻ ☆ Efficient Interaction-Aware Interval Analysis of Neural Network Feedback Loops
In this paper, we propose a computationally efficient framework for interval reachability of systems with neural network controllers. Our approach leverages inclusion functions for the open-loop system and the neural network controller to embed the closed-loop system into a larger-dimensional embedding system, where a single trajectory over-approximates the original system's behavior under uncertainty. We propose two methods for constructing closed-loop embedding systems, which account for the interactions between the system and the controller in different ways. The interconnection-based approach considers the worst-case evolution of each coordinate separately by substituting the neural network inclusion function into the open-loop inclusion function. The interaction-based approach uses novel Jacobian-based inclusion functions to capture the first-order interactions between the open-loop system and the controller by leveraging state-of-the-art neural network verifiers. Finally, we implement our approach in a Python framework called ReachMM to demonstrate its efficiency and scalability on benchmarks and examples ranging to $200$ state dimensions.
♻ ☆ A Hybrid Quantum Algorithm for Load Flow
The goal of the load flow study is to ensure that electrical power is delivered efficiently and reliably to end-users while maintaining the stability and security of the power system. Newton-Raphson is a numerical method used widely for load flow analysis. One of the most computationally expensive steps in this method is an equation-solving step. We propose to replace this step with HHL, a quantum algorithm for solving linear systems of equations. HHL is exponentially faster, but with caveats. In this study, a hybrid quantum algorithm is proposed for solving load flow. The Newton-Raphson method is used as a benchmark to compare the performance of the hybrid quantum algorithm. Although the simulation of the hybrid quantum algorithm takes much time, these preliminary results are encouraging and point to the potential for the use of quantum algorithms to develop hybrid quantum algorithms for load flow analysis and related problems.
♻ ☆ Tuning a Cascaded Online Feedback Optimization Controller for Provision of Distributed Flexibility
Coordinating a high number of flexibility providing units (e.g. to provide ancillary services for the transmission system) across various grid layers requires new control concepts. A flexibility request at a point of common coupling can be met by utilizing a cascaded control structure based on online feedback optimization. In this paper the influence of the parameterization of the individual controllers on the performance of the hierarchical flexibility provision is studied on a three-level test system. The results show a high interdependency between the choice of control parameters of one controller and the behavior of other controllers as well as a significant impact on the accuracy and speed of flexibility provision. With a careful tuning, a cascaded structure based on online feedback optimization can achieve efficient vertical coordination of flexibility providing units.
♻ ☆ Control-Oriented Deep Space Communications For Unmanned Space Exploration
In unmanned space exploration, the cooperation among space robots requires advanced communication techniques. In this paper, we propose a communication optimization scheme for a specific cooperation system named the "mother-daughter system". In this setup, the mother spacecraft orbits the planet, while daughter probes are distributed across the planetary surface. During each control cycle, the mother spacecraft senses the environment, computes control commands and distributes them to daughter probes for actions. They synergistically form sensing-communication-computing-control ($\mathbf{SC^3}$) loops. Given the indivisibility of the $\mathbf{SC^3}$ loop, we optimize the mother-daughter downlink for closed-loop control. The optimization objective is the linear quadratic regulator (LQR) cost, and the optimization parameters are the block length and transmit power. To solve the nonlinear mixed-integer problem, we first identify the optimal block length and then transform the power allocation problem into a tractable convex problem. We further derive the approximate closed-form solutions for the proposed scheme and two communication-oriented schemes: the max-sum rate scheme and the max-min rate scheme. On this basis, we analyze their power allocation principles. In particular, for time-insensitive control tasks, we find that the proposed scheme demonstrates equivalence to the max-min rate scheme. These findings are verified through simulations.
♻ ☆ Lagrangian-based online safe reinforcement learning for state-constrained systems
This paper proposes a safe reinforcement learning (RL) algorithm that approximately solves the state-constrained optimal control problem for continuous-time uncertain nonlinear systems. We formulate the safe RL problem as the minimization of a Lagrangian that includes the cost functional and a user-defined barrier Lyapunov function (BLF) encoding the state constraints. We show that the analytical solution obtained by the application of Karush-Kuhn-Tucker (KKT) conditions contains a state-dependent expression for the Lagrange multiplier, which is a function of uncertain terms in the system dynamics. We argue that a naive estimation of the Lagrange multiplier may lead to safety constraint violations. To obviate this challenge, we propose an Actor-Critic-Identifier-Lagrangian (ACIL) algorithm that learns optimal control policies from online data without compromising safety. We provide safety and boundedness guarantees with the proposed algorithm and compare its performance with existing offline/online RL methods via a simulation study.
♻ ☆ Non-myopic GOSPA-driven Gaussian Bernoulli Sensor Management
In this paper, we propose an algorithm for non-myopic sensor management for Bernoulli filtering, i.e., when there may be at most one target present in the scene. The algorithm is based on selecting the action that solves a Bellman-type minimisation problem, whose cost function is the mean square generalised optimal sub-pattern assignment (GOSPA) error, over a future time window. We also propose an implementation of the sensor management algorithm based on an upper bound of the mean square GOSPA error and a Gaussian single-target posterior. Finally, we develop a Monte Carlo tree search algorithm to find an approximate optimal action within a given computational budget. The benefits of the proposed approach are demonstrated via simulations.
comment: Paper accepted to IEEE Transactions on Aerospace and Electronic Systems, 25th June 2024
♻ ☆ A Convex Hull Cheapest Insertion Heuristic for the Non-Euclidean TSP
The convex hull cheapest insertion heuristic is known to produce good solutions to the Traveling Salesperson Problem in Euclidean spaces, but it has not been extended to the non-Euclidean case. The proposed adaptation uses multidimensional scaling to first project the points into a Euclidean space, thereby enabling the generation of the convex hull that initializes the algorithm. To evaluate the proposed algorithm, non-Euclidean spaces are created by adding impassable separators to the TSPLIB benchmark data-set, or by using the L1 norm as a metric. This adapted heuristic is demonstrated to outperform the commonly used Nearest Neighbor heuristic and Nearest Insertion heuristic in 89% and 99% of the cases studied, respectively. When the genetic algorithm and ant colony optimization algorithms are provided 1 minute of computation time, the proposed heuristic tour costs are lower than the mean metaheuristic solutions found in 87% and 95% of the instances, respectively.
comment: Manuscript submitted 27 January 2024 to the Operations Research Letters
♻ ☆ Analysis of a Two-degree-of-freedom Beam for Rotational Piezoelectric Energy Harvesting
This study introduces a two-degree-of-freedom piezoelectric energy harvester designed to harness rotational motion as an energy source. The harvester is built using a cut-out beam, which enables the first two resonant frequencies to be closely located in the low-frequency range. A distributed continuous model is developed and validated with experimental results. As the beam undergoes significant displacement due to rotational excitations, the geometric nonlinearity arising from longitudinal displacement is considered in the model to enhance its accuracy. It is observed that as the rotating speed increases, the increased centrifugal force causes the first resonant frequency to rise while the second resonant frequency decreases. The rotation-specific mode veering and the interchange of the first two modes are discussed. This study explores the potential to expand the bandwidth of the harvester using two types of nonlinear external force, namely mechanical stoppers and magnetic force. The results indicate that the proposed harvester can broaden the bandwidth of the first and second resonant frequencies. This research addresses the gap of combining multimodal and nonlinear force methods in rotation-al piezoelectric energy harvesting.
♻ ☆ Real-Time Machine-Learning-Based Optimization Using Input Convex LSTM
Neural network-based optimization and control have gradually supplanted first-principles model-based approaches in energy and manufacturing systems due to their efficient, data-driven process modeling that requires fewer resources. However, their non-convex nature significantly slows down the optimization and control processes, limiting their application in real-time decision-making processes. To address this challenge, we propose a novel Input Convex Long Short-Term Memory (ICLSTM) network to enhance the computational efficiency of neural network-based optimization. Through two case studies employing real-time neural network-based optimization for optimizing energy and chemical systems, we demonstrate the superior performance of ICLSTM-based optimization in terms of runtime. Specifically, in a real-time optimization problem of a real-world solar photovoltaic (PV) energy system at LHT Holdings in Singapore, ICLSTM-based optimization achieved an 8-fold speedup compared to conventional LSTM-based optimization. These results highlight the potential of ICLSTM networks to significantly enhance the efficiency of neural network-based optimization and control in practical applications. Source code is available at https://github.com/killingbear999/ICLSTM.
♻ ☆ ASID: Active Exploration for System Identification in Robotic Manipulation
Model-free control strategies such as reinforcement learning have shown the ability to learn control strategies without requiring an accurate model or simulator of the world. While this is appealing due to the lack of modeling requirements, such methods can be sample inefficient, making them impractical in many real-world domains. On the other hand, model-based control techniques leveraging accurate simulators can circumvent these challenges and use a large amount of cheap simulation data to learn controllers that can effectively transfer to the real world. The challenge with such model-based techniques is the requirement for an extremely accurate simulation, requiring both the specification of appropriate simulation assets and physical parameters. This requires considerable human effort to design for every environment being considered. In this work, we propose a learning system that can leverage a small amount of real-world data to autonomously refine a simulation model and then plan an accurate control strategy that can be deployed in the real world. Our approach critically relies on utilizing an initial (possibly inaccurate) simulator to design effective exploration policies that, when deployed in the real world, collect high-quality data. We demonstrate the efficacy of this paradigm in identifying articulation, mass, and other physical parameters in several challenging robotic manipulation tasks, and illustrate that only a small amount of real-world data can allow for effective sim-to-real transfer. Project website at https://weirdlabuw.github.io/asid
comment: Project website at https://weirdlabuw.github.io/asid
♻ ☆ System Identification for Continuous-time Linear Dynamical Systems
The problem of system identification for the Kalman filter, relying on the expectation-maximization (EM) procedure to learn the underlying parameters of a dynamical system, has largely been studied assuming that observations are sampled at equally-spaced time points. However, in many applications this is a restrictive and unrealistic assumption. This paper addresses system identification for the continuous-discrete filter, with the aim of generalizing learning for the Kalman filter by relying on a solution to a continuous-time It\^o stochastic differential equation (SDE) for the latent state and covariance dynamics. We introduce a novel two-filter, analytical form for the posterior with a Bayesian derivation, which yields analytical updates which do not require the forward-pass to be pre-computed. Using this analytical and efficient computation of the posterior, we provide an EM procedure which estimates the parameters of the SDE, naturally incorporating irregularly sampled measurements. Generalizing the learning of latent linear dynamical systems (LDS) to continuous-time may extend the use of the hybrid Kalman filter to data which is not regularly sampled or has intermittent missing values, and can extend the power of non-linear system identification methods such as switching LDS (SLDS), which rely on EM for the linear discrete-time Kalman filter as a sub-unit for learning locally linearized behavior of a non-linear system. We apply the method by learning the parameters of a latent, multivariate Fokker-Planck SDE representing a toggle-switch genetic circuit using biologically realistic parameters, and compare the efficacy of learning relative to the discrete-time Kalman filter as the step-size irregularity and spectral-radius of the dynamics-matrix increases.
comment: 31 pages, 3 figures. Only light changes and restructuring to previous version made
Multiagent Systems 3
☆ Cooperative Target Capture using Voronoi Region Shaping
This paper discusses a cooperative strategy for capturing a target using multiple pursuers in a planar scenario. Given an initial position distribution of pursuers, the Voronoi Diagram is employed to characterize the target's proximity region. The key idea is to dynamically shape that region using a policy that directs its vertices towards its instantaneous centroid. Analysis of the resulting dynamics deduces the velocity control inputs for the pursuers. As the main result, target's proximity region is shown to shrink exponentially irrespective of its speed and evasion policy. Simulation results demonstrate the characteristics of the proposed method.
☆ Formation Under Communication Constraints: Control Performance Meets Channel Capacity
In wireless communication-based formation control systems, the control performance is significantly impacted by the channel capacity of each communication link between agents. This relationship, however, remains under-investigated in the existing studies. To address this gap, the formation control problem of classical second-order multi-agent systems with bounded process noises was considered taking into account the channel capacity. More specifically, the model of communication links between agents is first established, based on a new concept -- guaranteed communication region, which characterizes all possible locations for successful message decoding in the present of control-system uncertainty. Furthermore, we rigorously prove that, the guaranteed communication region does not unboundedly increase with the transmission time, which indicates an important trade-off between the guaranteed communication region and the data rate. The fundamental limits of data rate for any desired accuracy are also obtained. Finally, the integrated design to achieve the desired formation accuracy is proposed, where an estimation-based controller and transmit power control strategy are developed.
☆ Multi-agent Cooperative Games Using Belief Map Assisted Training
In a multi-agent system, agents share their local observations to gain global situational awareness for decision making and collaboration using a message passing system. When to send a message, how to encode a message, and how to leverage the received messages directly affect the effectiveness of the collaboration among agents. When training a multi-agent cooperative game using reinforcement learning (RL), the message passing system needs to be optimized together with the agent policies. This consequently increases the model's complexity and poses significant challenges to the convergence and performance of learning. To address this issue, we propose the Belief-map Assisted Multi-agent System (BAMS), which leverages a neuro-symbolic belief map to enhance training. The belief map decodes the agent's hidden state to provide a symbolic representation of the agent's understanding of the environment and other agent's status. The simplicity of symbolic representation allows the gathering and comparison of the ground truth information with the belief, which provides an additional channel of feedback for the learning. Compared to the sporadic and delayed feedback coming from the reward in RL, the feedback from the belief map is more consistent and reliable. Agents using BAMS can learn a more effective message passing network to better understand each other, resulting in better performance in a cooperative predator and prey game with varying levels of map complexity and compare it to previous multi-agent message passing models. The simulation results showed that BAMS reduced training epochs by 66\%, and agents who apply the BAMS model completed the game with 34.62\% fewer steps on average.
Artificial Intelligence 140
☆ The Remarkable Robustness of LLMs: Stages of Inference?
We demonstrate and investigate the remarkable robustness of Large Language Models by deleting and swapping adjacent layers. We find that deleting and swapping interventions retain 72-95\% of the original model's prediction accuracy without fine-tuning, whereas models with more layers exhibit more robustness. Based on the results of the layer-wise intervention and further experiments, we hypothesize the existence of four universal stages of inference across eight different models: detokenization, feature engineering, prediction ensembling, and residual sharpening. The first stage integrates local information, lifting raw token representations into higher-level contextual representations. Next is the iterative refinement of task and entity-specific features. Then, the second half of the model begins with a phase transition, where hidden representations align more with the vocabulary space due to specialized model components. Finally, the last layer sharpens the following token distribution by eliminating obsolete features that add noise to the prediction.
☆ Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space
Modern generative models demonstrate impressive capabilities, likely stemming from an ability to identify and manipulate abstract concepts underlying their training data. However, fundamental questions remain: what determines the concepts a model learns, the order in which it learns them, and its ability to manipulate those concepts? To address these questions, we propose analyzing a model's learning dynamics via a framework we call the concept space, where each axis represents an independent concept underlying the data generating process. By characterizing learning dynamics in this space, we identify how the speed at which a concept is learned, and hence the order of concept learning, is controlled by properties of the data we term concept signal. Further, we observe moments of sudden turns in the direction of a model's learning dynamics in concept space. Surprisingly, these points precisely correspond to the emergence of hidden capabilities, i.e., where latent interventions show the model possesses the capability to manipulate a concept, but these capabilities cannot yet be elicited via naive input prompting. While our results focus on synthetically defined toy datasets, we hypothesize a general claim on emergence of hidden capabilities may hold: generative models possess latent capabilities that emerge suddenly and consistently during training, though a model might not exhibit these capabilities under naive input prompting.
comment: Preprint
☆ Fundamental Problems With Model Editing: How Should Rational Belief Revision Work in LLMs?
The model editing problem concerns how language models should learn new facts about the world over time. While empirical research on model editing has drawn widespread attention, the conceptual foundations of model editing remain shaky -- perhaps unsurprisingly, since model editing is essentially belief revision, a storied problem in philosophy that has eluded succinct solutions for decades. Model editing nonetheless demands a solution, since we need to be able to control the knowledge within language models. With this goal in mind, this paper critiques the standard formulation of the model editing problem and proposes a formal testbed for model editing research. We first describe 12 open problems with model editing, based on challenges with (1) defining the problem, (2) developing benchmarks, and (3) assuming LLMs have editable beliefs in the first place. Many of these challenges are extremely difficult to address, e.g. determining far-reaching consequences of edits, labeling probabilistic entailments between facts, and updating beliefs of agent simulators. Next, we introduce a semi-synthetic dataset for model editing based on Wikidata, where we can evaluate edits against labels given by an idealized Bayesian agent. This enables us to say exactly how belief revision in language models falls short of a desirable epistemic standard. We encourage further research exploring settings where such a gold standard can be compared against. Our code is publicly available at: https://github.com/peterbhase/LLM-belief-revision
comment: 23 pages, 4 figures
☆ IndoToxic2024: A Demographically-Enriched Dataset of Hate Speech and Toxicity Types for Indonesian Language
Hate speech poses a significant threat to social harmony. Over the past two years, Indonesia has seen a ten-fold increase in the online hate speech ratio, underscoring the urgent need for effective detection mechanisms. However, progress is hindered by the limited availability of labeled data for Indonesian texts. The condition is even worse for marginalized minorities, such as Shia, LGBTQ, and other ethnic minorities because hate speech is underreported and less understood by detection tools. Furthermore, the lack of accommodation for subjectivity in current datasets compounds this issue. To address this, we introduce IndoToxic2024, a comprehensive Indonesian hate speech and toxicity classification dataset. Comprising 43,692 entries annotated by 19 diverse individuals, the dataset focuses on texts targeting vulnerable groups in Indonesia, specifically during the hottest political event in the country: the presidential election. We establish baselines for seven binary classification tasks, achieving a macro-F1 score of 0.78 with a BERT model (IndoBERTweet) fine-tuned for hate speech classification. Furthermore, we demonstrate how incorporating demographic information can enhance the zero-shot performance of the large language model, gpt-3.5-turbo. However, we also caution that an overemphasis on demographic information can negatively impact the fine-tuned model performance due to data fragmentation.
☆ Efficient World Models with Context-Aware Tokenization ICML 2024
Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modelling, model-based RL positions itself as a strong contender. Recent advances in sequence modelling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments. In this work, we propose $\Delta$-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens. In the Crafter benchmark, $\Delta$-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches. We release our code and models at https://github.com/vmicheli/delta-iris.
comment: ICML 2024
☆ Jump Starting Bandits with LLM-Generated Prior Knowledge
We present substantial evidence demonstrating the benefits of integrating Large Language Models (LLMs) with a Contextual Multi-Armed Bandit framework. Contextual bandits have been widely used in recommendation systems to generate personalized suggestions based on user-specific contexts. We show that LLMs, pre-trained on extensive corpora rich in human knowledge and preferences, can simulate human behaviours well enough to jump-start contextual multi-armed bandits to reduce online learning regret. We propose an initialization algorithm for contextual bandits by prompting LLMs to produce a pre-training dataset of approximate human preferences for the bandit. This significantly reduces online learning regret and data-gathering costs for training such models. Our approach is validated empirically through two sets of experiments with different bandit setups: one which utilizes LLMs to serve as an oracle and a real-world experiment utilizing data from a conjoint survey experiment.
LiveBench: A Challenging, Contamination-Free LLM Benchmark
Test set contamination, wherein test data from a benchmark ends up in a newer model's training set, is a well-documented obstacle for fair LLM evaluation and can quickly render benchmarks obsolete. To mitigate this, many recent benchmarks crowdsource new prompts and evaluations from human or LLM judges; however, these can introduce significant biases, and break down when scoring hard questions. In this work, we introduce a new benchmark for LLMs designed to be immune to both test set contamination and the pitfalls of LLM judging and human crowdsourcing. We release LiveBench, the first benchmark that (1) contains frequently-updated questions from recent information sources, (2) scores answers automatically according to objective ground-truth values, and (3) contains a wide variety of challenging tasks, spanning math, coding, reasoning, language, instruction following, and data analysis. To achieve this, LiveBench contains questions that are based on recently-released math competitions, arXiv papers, news articles, and datasets, and it contains harder, contamination-free versions of tasks from previous benchmarks such as Big-Bench Hard, AMPS, and IFEval. We evaluate many prominent closed-source models, as well as dozens of open-source models ranging from 0.5B to 110B in size. LiveBench is difficult, with top models achieving below 65% accuracy. We release all questions, code, and model answers. Questions will be added and updated on a monthly basis, and we will release new tasks and harder versions of tasks over time so that LiveBench can distinguish between the capabilities of LLMs as they improve in the future. We welcome community engagement and collaboration for expanding the benchmark tasks and models.
☆ From Artificial Needles to Real Haystacks: Improving Retrieval Capabilities in LLMs by Finetuning on Synthetic Data
Recent studies have shown that Large Language Models (LLMs) struggle to accurately retrieve information and maintain reasoning capabilities when processing long-context inputs. To address these limitations, we propose a finetuning approach utilizing a carefully designed synthetic dataset comprising numerical key-value retrieval tasks. Our experiments on models like GPT-3.5 Turbo and Mistral 7B demonstrate that finetuning LLMs on this dataset significantly improves LLMs' information retrieval and reasoning capabilities in longer-context settings. We present an analysis of the finetuned models, illustrating the transfer of skills from synthetic to real task evaluations (e.g., $10.5\%$ improvement on $20$ documents MDQA at position $10$ for GPT-3.5 Turbo). We also find that finetuned LLMs' performance on general benchmarks remains almost constant while LLMs finetuned on other baseline long-context augmentation data can encourage hallucination (e.g., on TriviaQA, Mistral 7B finetuned on our synthetic data cause no performance drop while other baseline data can cause a drop that ranges from $2.33\%$ to $6.19\%$). Our study highlights the potential of finetuning on synthetic data for improving the performance of LLMs on longer-context tasks.
☆ HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
The rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of medical vision-text data, stemming from data privacy concerns and high annotation costs. While pioneering approaches utilize PubMed's large-scale, de-identified medical image-text pairs to address these limitations, they still fall short due to inherent data noise. To tackle this, we refined medical image-text pairs from PubMed and employed MLLMs (GPT-4V) in an 'unblinded' capacity to denoise and reformat the data, resulting in the creation of the PubMedVision dataset with 1.3 million medical VQA samples. Our validation demonstrates that: (1) PubMedVision can significantly enhance the medical multimodal capabilities of current MLLMs, showing significant improvement in benchmarks including the MMMU Health & Medicine track; (2) manual checks by medical experts and empirical results validate the superior data quality of our dataset compared to other data construction methods. Using PubMedVision, we train a 34B medical MLLM HuatuoGPT-Vision, which shows superior performance in medical multimodal scenarios among open-source MLLMs.
☆ Commodification of Compute
The rapid advancements in artificial intelligence, big data analytics, and cloud computing have precipitated an unprecedented demand for computational resources. However, the current landscape of computational resource allocation is characterized by significant inefficiencies, including underutilization and price volatility. This paper addresses these challenges by introducing a novel global platform for the commodification of compute hours, termed the Global Compute Exchange (GCX) (Patent Pending). The GCX leverages blockchain technology and smart contracts to create a secure, transparent, and efficient marketplace for buying and selling computational power. The GCX is built in a layered fashion, comprising Market, App, Clearing, Risk Management, Exchange (Offchain), and Blockchain (Onchain) layers, each ensuring a robust and efficient operation. This platform aims to revolutionize the computational resource market by fostering a decentralized, efficient, and transparent ecosystem that ensures equitable access to computing power, stimulates innovation, and supports diverse user needs on a global scale. By transforming compute hours into a tradable commodity, the GCX seeks to optimize resource utilization, stabilize pricing, and democratize access to computational resources. This paper explores the technological infrastructure, market potential, and societal impact of the GCX, positioning it as a pioneering solution poised to drive the next wave of innovation in commodities and compute.
☆ AI Data Readiness Inspector (AIDRIN) for Quantitative Assessment of Data Readiness for AI
"Garbage In Garbage Out" is a universally agreed quote by computer scientists from various domains, including Artificial Intelligence (AI). As data is the fuel for AI, models trained on low-quality, biased data are often ineffective. Computer scientists who use AI invest a considerable amount of time and effort in preparing the data for AI. However, there are no standard methods or frameworks for assessing the "readiness" of data for AI. To provide a quantifiable assessment of the readiness of data for AI processes, we define parameters of AI data readiness and introduce AIDRIN (AI Data Readiness Inspector). AIDRIN is a framework covering a broad range of readiness dimensions available in the literature that aid in evaluating the readiness of data quantitatively and qualitatively. AIDRIN uses metrics in traditional data quality assessment such as completeness, outliers, and duplicates for data evaluation. Furthermore, AIDRIN uses metrics specific to assess data for AI, such as feature importance, feature correlations, class imbalance, fairness, privacy, and FAIR (Findability, Accessibility, Interoperability, and Reusability) principle compliance. AIDRIN provides visualizations and reports to assist data scientists in further investigating the readiness of data. The AIDRIN framework enhances the efficiency of the machine learning pipeline to make informed decisions on data readiness for AI applications.
comment: 12 pages, 9 figures, Accepted to SSDBM 2024
☆ AutoRAG-HP: Automatic Online Hyper-Parameter Tuning for Retrieval-Augmented Generation
Recent advancements in Large Language Models have transformed ML/AI development, necessitating a reevaluation of AutoML principles for the Retrieval-Augmented Generation (RAG) systems. To address the challenges of hyper-parameter optimization and online adaptation in RAG, we propose the AutoRAG-HP framework, which formulates the hyper-parameter tuning as an online multi-armed bandit (MAB) problem and introduces a novel two-level Hierarchical MAB (Hier-MAB) method for efficient exploration of large search spaces. We conduct extensive experiments on tuning hyper-parameters, such as top-k retrieved documents, prompt compression ratio, and embedding methods, using the ALCE-ASQA and Natural Questions datasets. Our evaluation from jointly optimization all three hyper-parameters demonstrate that MAB-based online learning methods can achieve Recall@5 $\approx 0.8$ for scenarios with prominent gradients in search space, using only $\sim20\%$ of the LLM API calls required by the Grid Search approach. Additionally, the proposed Hier-MAB approach outperforms other baselines in more challenging optimization scenarios. The code will be made available at https://aka.ms/autorag.
☆ Application of ASV for Voice Identification after VC and Duration Predictor Improvement in TTS Models
One of the most crucial components in the field of biometric security is the automatic speaker verification system, which is based on the speaker's voice. It is possible to utilise ASVs in isolation or in conjunction with other AI models. In the contemporary era, the quality and quantity of neural networks are increasing exponentially. Concurrently, there is a growing number of systems that aim to manipulate data through the use of voice conversion and text-to-speech models. The field of voice biometrics forgery is aided by a number of challenges, including SSTC, ASVSpoof, and SingFake. This paper presents a system for automatic speaker verification. The primary objective of our model is the extraction of embeddings from the target speaker's audio in order to obtain information about important characteristics of his voice, such as pitch, energy, and the duration of phonemes. This information is used in our multivoice TTS pipeline, which is currently under development. However, this model was employed within the SSTC challenge to verify users whose voice had undergone voice conversion, where it demonstrated an EER of 20.669.
☆ Human-Aware Vision-and-Language Navigation: Bridging Simulation to Reality with Dynamic Human Interactions
Vision-and-Language Navigation (VLN) aims to develop embodied agents that navigate based on human instructions. However, current VLN frameworks often rely on static environments and optimal expert supervision, limiting their real-world applicability. To address this, we introduce Human-Aware Vision-and-Language Navigation (HA-VLN), extending traditional VLN by incorporating dynamic human activities and relaxing key assumptions. We propose the Human-Aware 3D (HA3D) simulator, which combines dynamic human activities with the Matterport3D dataset, and the Human-Aware Room-to-Room (HA-R2R) dataset, extending R2R with human activity descriptions. To tackle HA-VLN challenges, we present the Expert-Supervised Cross-Modal (VLN-CM) and Non-Expert-Supervised Decision Transformer (VLN-DT) agents, utilizing cross-modal fusion and diverse training strategies for effective navigation in dynamic human environments. A comprehensive evaluation, including metrics considering human activities, and systematic analysis of HA-VLN's unique challenges, underscores the need for further research to enhance HA-VLN agents' real-world robustness and adaptability. Ultimately, this work provides benchmarks and insights for future research on embodied AI and Sim2Real transfer, paving the way for more realistic and applicable VLN systems in human-populated environments.
comment: 30 pages, 18 figures, Project Page: https://lpercc.github.io/HA3D_simulator/
☆ Seeing Is Believing: Black-Box Membership Inference Attacks Against Retrieval Augmented Generation
Retrieval-Augmented Generation (RAG) is a state-of-the-art technique that enhances Large Language Models (LLMs) by retrieving relevant knowledge from an external, non-parametric database. This approach aims to mitigate common LLM issues such as hallucinations and outdated knowledge. Although existing research has demonstrated security and privacy vulnerabilities within RAG systems, making them susceptible to attacks like jailbreaks and prompt injections, the security of the RAG system's external databases remains largely underexplored. In this paper, we employ Membership Inference Attacks (MIA) to determine whether a sample is part of the knowledge database of a RAG system, using only black-box API access. Our core hypothesis posits that if a sample is a member, it will exhibit significant similarity to the text generated by the RAG system. To test this, we compute the cosine similarity and the model's perplexity to establish a membership score, thereby building robust features. We then introduce two novel attack strategies: a Threshold-based Attack and a Machine Learning-based Attack, designed to accurately identify membership. Experimental validation of our methods has achieved a ROC AUC of 82%.
☆ Tools Fail: Detecting Silent Errors in Faulty Tools
Tools have become a mainstay of LLMs, allowing them to retrieve knowledge not in their weights, to perform tasks on the web, and even to control robots. However, most ontologies and surveys of tool-use have assumed the core challenge for LLMs is choosing the tool. Instead, we introduce a framework for tools more broadly which guides us to explore a model's ability to detect "silent" tool errors, and reflect on how to plan. This more directly aligns with the increasingly popular use of models as tools. We provide an initial approach to failure recovery with promising results both on a controlled calculator setting and embodied agent planning.
comment: 18 pages, 12 figures
☆ T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.
☆ Hack Me If You Can: Aggregating AutoEncoders for Countering Persistent Access Threats Within Highly Imbalanced Data
Advanced Persistent Threats (APTs) are sophisticated, targeted cyberattacks designed to gain unauthorized access to systems and remain undetected for extended periods. To evade detection, APT cyberattacks deceive defense layers with breaches and exploits, thereby complicating exposure by traditional anomaly detection-based security methods. The challenge of detecting APTs with machine learning is compounded by the rarity of relevant datasets and the significant imbalance in the data, which makes the detection process highly burdensome. We present AE-APT, a deep learning-based tool for APT detection that features a family of AutoEncoder methods ranging from a basic one to a Transformer-based one. We evaluated our tool on a suite of provenance trace databases produced by the DARPA Transparent Computing program, where APT-like attacks constitute as little as 0.004% of the data. The datasets span multiple operating systems, including Android, Linux, BSD, and Windows, and cover two attack scenarios. The outcomes showed that AE-APT has significantly higher detection rates compared to its competitors, indicating superior performance in detecting and ranking anomalies.
comment: To appear Future Generation Computer Systems
☆ Think Step by Step: Chain-of-Gesture Prompting for Error Detection in Robotic Surgical Videos
Despite significant advancements in robotic systems and surgical data science, ensuring safe and optimal execution in robot-assisted minimally invasive surgery (RMIS) remains a complex challenge. Current surgical error detection methods involve two parts: identifying surgical gestures and then detecting errors within each gesture clip. These methods seldom consider the rich contextual and semantic information inherent in surgical videos, limiting their performance due to reliance on accurate gesture identification. Motivated by the chain-of-thought prompting in natural language processing, this letter presents a novel and real-time end-to-end error detection framework, Chain-of-Thought (COG) prompting, leveraging contextual information from surgical videos. This encompasses two reasoning modules designed to mimic the decision-making processes of expert surgeons. Concretely, we first design a Gestural-Visual Reasoning module, which utilizes transformer and attention architectures for gesture prompting, while the second, a Multi-Scale Temporal Reasoning module, employs a multi-stage temporal convolutional network with both slow and fast paths for temporal information extraction. We extensively validate our method on the public benchmark RMIS dataset JIGSAWS. Our method encapsulates the reasoning processes inherent to surgical activities enabling it to outperform the state-of-the-art by 4.6% in F1 score, 4.6% in Accuracy, and 5.9% in Jaccard index while processing each frame in 6.69 milliseconds on average, demonstrating the great potential of our approach in enhancing the safety and efficacy of RMIS procedures and surgical education. The code will be available.
comment: 8 pages, 4 figures
☆ Estimating Long-term Heterogeneous Dose-response Curve: Generalization Bound Leveraging Optimal Transport Weights
Long-term causal effect estimation is a significant but challenging problem in many applications. Existing methods rely on ideal assumptions to estimate long-term average effects, e.g., no unobserved confounders or a binary treatment,while in numerous real-world applications, these assumptions could be violated and average effects are unable to provide individual-level suggestions.In this paper,we address a more general problem of estimating the long-term heterogeneous dose-response curve (HDRC) while accounting for unobserved confounders. Specifically, to remove unobserved confounding in observational data, we introduce an optimal transport weighting framework to align the observational data to the experimental data with theoretical guarantees. Furthermore,to accurately predict the heterogeneous effects of continuous treatment, we establish a generalization bound on counterfactual prediction error by leveraging the reweighted distribution induced by optimal transport. Finally, we develop an HDRC estimator building upon the above theoretical foundations. Extensive experimental studies conducted on multiple synthetic and semi-synthetic datasets demonstrate the effectiveness of our proposed method.
☆ BISeizuRe: BERT-Inspired Seizure Data Representation to Improve Epilepsy Monitoring
This study presents a novel approach for EEG-based seizure detection leveraging a BERT-based model. The model, BENDR, undergoes a two-phase training process. Initially, it is pre-trained on the extensive Temple University Hospital EEG Corpus (TUEG), a 1.5 TB dataset comprising over 10,000 subjects, to extract common EEG data patterns. Subsequently, the model is fine-tuned on the CHB-MIT Scalp EEG Database, consisting of 664 EEG recordings from 24 pediatric patients, of which 198 contain seizure events. Key contributions include optimizing fine-tuning on the CHB-MIT dataset, where the impact of model architecture, pre-processing, and post-processing techniques are thoroughly examined to enhance sensitivity and reduce false positives per hour (FP/h). We also explored custom training strategies to ascertain the most effective setup. The model undergoes a novel second pre-training phase before subject-specific fine-tuning, enhancing its generalization capabilities. The optimized model demonstrates substantial performance enhancements, achieving as low as 0.23 FP/h, 2.5$\times$ lower than the baseline model, with a lower but still acceptable sensitivity rate, showcasing the effectiveness of applying a BERT-based approach on EEG-based seizure detection.
comment: 4 pages, 2 tables, 2 figures
☆ RAVEN: Multitask Retrieval Augmented Vision-Language Learning
The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
☆ BackMix: Mitigating Shortcut Learning in Echocardiography with Minimal Supervision MICCAI 2024
Neural networks can learn spurious correlations that lead to the correct prediction in a validation set, but generalise poorly because the predictions are right for the wrong reason. This undesired learning of naive shortcuts (Clever Hans effect) can happen for example in echocardiogram view classification when background cues (e.g. metadata) are biased towards a class and the model learns to focus on those background features instead of on the image content. We propose a simple, yet effective random background augmentation method called BackMix, which samples random backgrounds from other examples in the training set. By enforcing the background to be uncorrelated with the outcome, the model learns to focus on the data within the ultrasound sector and becomes invariant to the regions outside this. We extend our method in a semi-supervised setting, finding that the positive effects of BackMix are maintained with as few as 5% of segmentation labels. A loss weighting mechanism, wBackMix, is also proposed to increase the contribution of the augmented examples. We validate our method on both in-distribution and out-of-distribution datasets, demonstrating significant improvements in classification accuracy, region focus and generalisability. Our source code is available at: https://github.com/kitbransby/BackMix
comment: Accepted at MICCAI 2024 (Pre-print)
☆ YZS-model: A Predictive Model for Organic Drug Solubility Based on Graph Convolutional Networks and Transformer-Attention
The accurate prediction of drug molecule solubility is essential for determining their therapeutic effectiveness and safety, influencing the drug's ADME processes. Traditional solubility prediction techniques often fail to capture the complex nature of molecular tructures, leading to notable deviations between predictions and actual results. For example, the Discussion on Advanced Drug-Like Compound Structures. Lusci highlighted issues in capturing crucial cyclic structural information in molecules with ring structures. To overcome this issue, our research introduces a novel deep learning framework combining attention-based transformers, Long Short-Term Memory (LSTM) networks, and Graph Convolutional Networks (GCN), aimed at enhancing the precision of solubility predictions. Utilizing a training set of 9,943 compounds and testing on an anticancer compound dataset, our method achieved a correlation coefficient ($R^2$) of 0.55 and a Root Mean Square Error (RMSE) of 0.59, which outperforms the benchmark models' scores of 0.52 ($R^2$) and 0.61 (RMSE). Importantly, in an additional independent test, our model significantly outperformed the baseline with an RMSE of 1.05 compared to 1.28, a relative accuracy improvement of 45.9%. This research not only demonstrates the vast potential of deep learning for improving solubility prediction accuracy but also offers novel insights for drug design and selection in the future. Continued efforts will be directed towards optimizing the model architecture and extending its application to better support the drug development process, underscoring the pivotal role of deep learning in drug discovery.
comment: 18 pages, 12 figures, 6 tables
☆ DEX-TTS: Diffusion-based EXpressive Text-to-Speech with Style Modeling on Time Variability
Expressive Text-to-Speech (TTS) using reference speech has been studied extensively to synthesize natural speech, but there are limitations to obtaining well-represented styles and improving model generalization ability. In this study, we present Diffusion-based EXpressive TTS (DEX-TTS), an acoustic model designed for reference-based speech synthesis with enhanced style representations. Based on a general diffusion TTS framework, DEX-TTS includes encoders and adapters to handle styles extracted from reference speech. Key innovations contain the differentiation of styles into time-invariant and time-variant categories for effective style extraction, as well as the design of encoders and adapters with high generalization ability. In addition, we introduce overlapping patchify and convolution-frequency patch embedding strategies to improve DiT-based diffusion networks for TTS. DEX-TTS yields outstanding performance in terms of objective and subjective evaluation in English multi-speaker and emotional multi-speaker datasets, without relying on pre-training strategies. Lastly, the comparison results for the general TTS on a single-speaker dataset verify the effectiveness of our enhanced diffusion backbone. Demos are available here.
comment: Preprint
☆ Super-resolution imaging using super-oscillatory diffractive neural networks
Optical super-oscillation enables far-field super-resolution imaging beyond diffraction limits. However, the existing super-oscillatory lens for the spatial super-resolution imaging system still confronts critical limitations in performance due to the lack of a more advanced design method and the limited design degree of freedom. Here, we propose an optical super-oscillatory diffractive neural network, i.e., SODNN, that can achieve super-resolved spatial resolution for imaging beyond the diffraction limit with superior performance over existing methods. SODNN is constructed by utilizing diffractive layers to implement optical interconnections and imaging samples or biological sensors to implement nonlinearity, which modulates the incident optical field to create optical super-oscillation effects in 3D space and generate the super-resolved focal spots. By optimizing diffractive layers with 3D optical field constraints under an incident wavelength size of $\lambda$, we achieved a super-oscillatory spot with a full width at half maximum of 0.407$\lambda$ in the far field distance over 400$\lambda$ without side-lobes over the field of view, having a long depth of field over 10$\lambda$. Furthermore, the SODNN implements a multi-wavelength and multi-focus spot array that effectively avoids chromatic aberrations. Our research work will inspire the development of intelligent optical instruments to facilitate the applications of imaging, sensing, perception, etc.
comment: 18 pages, 7 figures, 1 table
☆ Towards Learning Abductive Reasoning using VSA Distributed Representations
We introduce the Abductive Rule Learner with Context-awareness (ARLC), a model that solves abstract reasoning tasks based on Learn-VRF. ARLC features a novel and more broadly applicable training objective for abductive reasoning, resulting in better interpretability and higher accuracy when solving Raven's progressive matrices (RPM). ARLC allows both programming domain knowledge and learning the rules underlying a data distribution. We evaluate ARLC on the I-RAVEN dataset, showcasing state-of-the-art accuracy across both in-distribution and out-of-distribution (unseen attribute-rule pairs) tests. ARLC surpasses neuro-symbolic and connectionist baselines, including large language models, despite having orders of magnitude fewer parameters. We show ARLC's robustness to post-programming training by incrementally learning from examples on top of programmed knowledge, which only improves its performance and does not result in catastrophic forgetting of the programmed solution. We validate ARLC's seamless transfer learning from a 2x2 RPM constellation to unseen constellations. Our code is available at https://github.com/IBM/abductive-rule-learner-with-context-awareness.
comment: Accepted at the 18th International Conference on Neural-Symbolic Learning and Reasoning (NeSy) 2024
☆ CHEW: A Dataset of CHanging Events in Wikipedia
We introduce CHEW, a novel dataset of changing events in Wikipedia expressed in naturally occurring text. We use CHEW for probing LLMs for their timeline understanding of Wikipedia entities and events in generative and classification experiments. Our results suggest that LLMs, despite having temporal information available, struggle to construct accurate timelines. We further show the usefulness of CHEW-derived embeddings for identifying meaning shift.
comment: Short Paper
☆ Computational Life: How Well-formed, Self-replicating Programs Emerge from Simple Interaction
The fields of Origin of Life and Artificial Life both question what life is and how it emerges from a distinct set of "pre-life" dynamics. One common feature of most substrates where life emerges is a marked shift in dynamics when self-replication appears. While there are some hypotheses regarding how self-replicators arose in nature, we know very little about the general dynamics, computational principles, and necessary conditions for self-replicators to emerge. This is especially true on "computational substrates" where interactions involve logical, mathematical, or programming rules. In this paper we take a step towards understanding how self-replicators arise by studying several computational substrates based on various simple programming languages and machine instruction sets. We show that when random, non self-replicating programs are placed in an environment lacking any explicit fitness landscape, self-replicators tend to arise. We demonstrate how this occurs due to random interactions and self-modification, and can happen with and without background random mutations. We also show how increasingly complex dynamics continue to emerge following the rise of self-replicators. Finally, we show a counterexample of a minimalistic programming language where self-replicators are possible, but so far have not been observed to arise.
comment: 19 pages
☆ Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs ACL 2024
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
comment: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
☆ Dimensions underlying the representational alignment of deep neural networks with humans
Determining the similarities and differences between humans and artificial intelligence is an important goal both in machine learning and cognitive neuroscience. However, similarities in representations only inform us about the degree of alignment, not the factors that determine it. Drawing upon recent developments in cognitive science, we propose a generic framework for yielding comparable representations in humans and deep neural networks (DNN). Applying this framework to humans and a DNN model of natural images revealed a low-dimensional DNN embedding of both visual and semantic dimensions. In contrast to humans, DNNs exhibited a clear dominance of visual over semantic features, indicating divergent strategies for representing images. While in-silico experiments showed seemingly-consistent interpretability of DNN dimensions, a direct comparison between human and DNN representations revealed substantial differences in how they process images. By making representations directly comparable, our results reveal important challenges for representational alignment, offering a means for improving their comparability.
☆ EmPO: Theory-Driven Dataset Construction for Empathetic Response Generation through Preference Optimization ACL
Empathetic response generation is a desirable aspect of conversational agents, crucial for facilitating engaging and emotionally intelligent multi-turn conversations between humans and machines. Leveraging large language models for this task has shown promising results, yet challenges persist in ensuring both the empathetic quality of the responses and retention of the generalization performance of the models. In this paper, we propose a novel approach where we construct theory-driven preference datasets and use them to align LLMs with preference optimization algorithms to address these challenges. To measure empathetic response generation, we employ the EmpatheticDialogues dataset, assessing empathy with the diff-EPITOME and BERTscore metrics, and evaluate the generalization performance on the MMLU benchmark. We make all datasets, source code, and models publicly available.
comment: v01, 4 pages short paper, ACL style
☆ Segment Anything Model for automated image data annotation: empirical studies using text prompts from Grounding DINO
Grounding DINO and the Segment Anything Model (SAM) have achieved impressive performance in zero-shot object detection and image segmentation, respectively. Together, they have a great potential in revolutionizing zero-shot semantic segmentation or data annotation. Yet, in specialized domains like medical image segmentation, objects of interest (e.g., organs, tissues, and tumors) may not fall in existing class names. To address this problem, the referring expression comprehension (REC) ability of Grounding DINO is leveraged to detect arbitrary targets by their language descriptions. However, recent studies have highlighted severe limitation of the REC framework in this application setting owing to its tendency to make false positive predictions when the target is absent in the given image. And, while this bottleneck is central to the prospect of open-set semantic segmentation, it is still largely unknown how much improvement can be achieved by studying the prediction errors. To this end, we perform empirical studies on eight publicly available datasets and reveal that these errors consistently follow a predictable pattern and can, thus, be mitigated by a simple strategy. Specifically, we show that these false positive detections with appreciable confidence scores generally occupy large image areas and can usually be filtered by their relative sizes. More importantly, we expect these observations to inspire future research in improving REC-based detection and automated segmentation. Using this technique, we evaluate the performance of SAM on multiple datasets from various specialized domains and report significant improvement in segmentation performance and annotation time savings over manual approaches.
☆ A look under the hood of the Interactive Deep Learning Enterprise (No-IDLE)
This DFKI technical report presents the anatomy of the No-IDLE prototype system (funded by the German Federal Ministry of Education and Research) that provides not only basic and fundamental research in interactive machine learning, but also reveals deeper insights into users' behaviours, needs, and goals. Machine learning and deep learning should become accessible to millions of end users. No-IDLE's goals and scienfific challenges centre around the desire to increase the reach of interactive deep learning solutions for non-experts in machine learning. One of the key innovations described in this technical report is a methodology for interactive machine learning combined with multimodal interaction which will become central when we start interacting with semi-intelligent machines in the upcoming area of neural networks and large language models.
comment: DFKI Technical Report
☆ FedMap: Iterative Magnitude-Based Pruning for Communication-Efficient Federated Learning
Federated Learning (FL) is a distributed machine learning approach that enables training on decentralized data while preserving privacy. However, FL systems often involve resource-constrained client devices with limited computational power, memory, storage, and bandwidth. This paper introduces FedMap, a novel method that aims to enhance the communication efficiency of FL deployments by collaboratively learning an increasingly sparse global model through iterative, unstructured pruning. Importantly, FedMap trains a global model from scratch, unlike other methods reported in the literature, making it ideal for privacy-critical use cases such as in the medical and finance domains, where suitable pre-training data is often limited. FedMap adapts iterative magnitude-based pruning to the FL setting, ensuring all clients prune and refine the same subset of the global model parameters, therefore gradually reducing the global model size and communication overhead. The iterative nature of FedMap, forming subsequent models as subsets of predecessors, avoids parameter reactivation issues seen in prior work, resulting in stable performance. In this paper we provide an extensive evaluation of FedMap across diverse settings, datasets, model architectures, and hyperparameters, assessing performance in both IID and non-IID environments. Comparative analysis against the baseline approach demonstrates FedMap's ability to achieve more stable client model performance. For IID scenarios, FedMap achieves over $90$\% pruning without significant performance degradation. In non-IID settings, it achieves at least $~80$\% pruning while maintaining accuracy. FedMap offers a promising solution to alleviate communication bottlenecks in FL systems while retaining model accuracy.
comment: Submitted to IEEE Transactions on Neural Networks and Learning Systems
☆ Accuracy on the wrong line: On the pitfalls of noisy data for out-of-distribution generalisation
"Accuracy-on-the-line" is a widely observed phenomenon in machine learning, where a model's accuracy on in-distribution (ID) and out-of-distribution (OOD) data is positively correlated across different hyperparameters and data configurations. But when does this useful relationship break down? In this work, we explore its robustness. The key observation is that noisy data and the presence of nuisance features can be sufficient to shatter the Accuracy-on-the-line phenomenon. In these cases, ID and OOD accuracy can become negatively correlated, leading to "Accuracy-on-the-wrong-line". This phenomenon can also occur in the presence of spurious (shortcut) features, which tend to overshadow the more complex signal (core, non-spurious) features, resulting in a large nuisance feature space. Moreover, scaling to larger datasets does not mitigate this undesirable behavior and may even exacerbate it. We formally prove a lower bound on Out-of-distribution (OOD) error in a linear classification model, characterizing the conditions on the noise and nuisance features for a large OOD error. We finally demonstrate this phenomenon across both synthetic and real datasets with noisy data and nuisance features.
☆ BiCo-Fusion: Bidirectional Complementary LiDAR-Camera Fusion for Semantic- and Spatial-Aware 3D Object Detection
3D object detection is an important task that has been widely applied in autonomous driving. Recently, fusing multi-modal inputs, i.e., LiDAR and camera data, to perform this task has become a new trend. Existing methods, however, either ignore the sparsity of Lidar features or fail to preserve the original spatial structure of LiDAR and the semantic density of camera features simultaneously due to the modality gap. To address issues, this letter proposes a novel bidirectional complementary Lidar-camera fusion framework, called BiCo-Fusion that can achieve robust semantic- and spatial-aware 3D object detection. The key insight is to mutually fuse the multi-modal features to enhance the semantics of LiDAR features and the spatial awareness of the camera features and adaptatively select features from both modalities to build a unified 3D representation. Specifically, we introduce Pre-Fusion consisting of a Voxel Enhancement Module (VEM) to enhance the semantics of voxel features from 2D camera features and Image Enhancement Module (IEM) to enhance the spatial characteristics of camera features from 3D voxel features. Both VEM and IEM are bidirectionally updated to effectively reduce the modality gap. We then introduce Unified Fusion to adaptively weight to select features from the enchanted Lidar and camera features to build a unified 3D representation. Extensive experiments demonstrate the superiority of our BiCo-Fusion against the prior arts. Project page: https://t-ys.github.io/BiCo-Fusion/.
comment: 8 pages, 5 figures
☆ CMRxRecon2024: A Multi-Modality, Multi-View K-Space Dataset Boosting Universal Machine Learning for Accelerated Cardiac MRI
Cardiac magnetic resonance imaging (MRI) has emerged as a clinically gold-standard technique for diagnosing cardiac diseases, thanks to its ability to provide diverse information with multiple modalities and anatomical views. Accelerated cardiac MRI is highly expected to achieve time-efficient and patient-friendly imaging, and then advanced image reconstruction approaches are required to recover high-quality, clinically interpretable images from undersampled measurements. However, the lack of publicly available cardiac MRI k-space dataset in terms of both quantity and diversity has severely hindered substantial technological progress, particularly for data-driven artificial intelligence. Here, we provide a standardized, diverse, and high-quality CMRxRecon2024 dataset to facilitate the technical development, fair evaluation, and clinical transfer of cardiac MRI reconstruction approaches, towards promoting the universal frameworks that enable fast and robust reconstructions across different cardiac MRI protocols in clinical practice. To the best of our knowledge, the CMRxRecon2024 dataset is the largest and most diverse publicly available cardiac k-space dataset. It is acquired from 330 healthy volunteers, covering commonly used modalities, anatomical views, and acquisition trajectories in clinical cardiac MRI workflows. Besides, an open platform with tutorials, benchmarks, and data processing tools is provided to facilitate data usage, advanced method development, and fair performance evaluation.
comment: 19 pages, 3 figures, 2 tables
☆ Lithium-Ion Battery System Health Monitoring and Fault Analysis from Field Data Using Gaussian Processes
Health monitoring, fault analysis, and detection are critical for the safe and sustainable operation of battery systems. We apply Gaussian process resistance models on lithium iron phosphate battery field data to effectively separate the time-dependent and operating point-dependent resistance. The data set contains 29 battery systems returned to the manufacturer for warranty, each with eight cells in series, totaling 232 cells and 131 million data rows. We develop probabilistic fault detection rules using recursive spatiotemporal Gaussian processes. These processes allow the quick processing of over a million data points, enabling advanced online monitoring and furthering the understanding of battery pack failure in the field. The analysis underlines that often, only a single cell shows abnormal behavior or a knee point, consistent with weakest-link failure for cells connected in series, amplified by local resistive heating. The results further the understanding of how batteries degrade and fail in the field and demonstrate the potential of efficient online monitoring based on data. We open-source the code and publish the large data set upon completion of the review of this article.
☆ FedMLP: Federated Multi-Label Medical Image Classification under Task Heterogeneity MICCAI 2024
Cross-silo federated learning (FL) enables decentralized organizations to collaboratively train models while preserving data privacy and has made significant progress in medical image classification. One common assumption is task homogeneity where each client has access to all classes during training. However, in clinical practice, given a multi-label classification task, constrained by the level of medical knowledge and the prevalence of diseases, each institution may diagnose only partial categories, resulting in task heterogeneity. How to pursue effective multi-label medical image classification under task heterogeneity is under-explored. In this paper, we first formulate such a realistic label missing setting in the multi-label FL domain and propose a two-stage method FedMLP to combat class missing from two aspects: pseudo label tagging and global knowledge learning. The former utilizes a warmed-up model to generate class prototypes and select samples with high confidence to supplement missing labels, while the latter uses a global model as a teacher for consistency regularization to prevent forgetting missing class knowledge. Experiments on two publicly-available medical datasets validate the superiority of FedMLP against the state-of-the-art both federated semi-supervised and noisy label learning approaches under task heterogeneity. Code is available at https://github.com/szbonaldo/FedMLP.
comment: Early accepted by MICCAI 2024
☆ Semi-supervised Concept Bottleneck Models
Concept Bottleneck Models (CBMs) have garnered increasing attention due to their ability to provide concept-based explanations for black-box deep learning models while achieving high final prediction accuracy using human-like concepts. However, the training of current CBMs heavily relies on the accuracy and richness of annotated concepts in the dataset. These concept labels are typically provided by experts, which can be costly and require significant resources and effort. Additionally, concept saliency maps frequently misalign with input saliency maps, causing concept predictions to correspond to irrelevant input features - an issue related to annotation alignment. To address these limitations, we propose a new framework called SSCBM (Semi-supervised Concept Bottleneck Model). Our SSCBM is suitable for practical situations where annotated data is scarce. By leveraging joint training on both labeled and unlabeled data and aligning the unlabeled data at the concept level, we effectively solve these issues. We proposed a strategy to generate pseudo labels and an alignment loss. Experiments demonstrate that our SSCBM is both effective and efficient. With only 20% labeled data, we achieved 93.19% (96.39% in a fully supervised setting) concept accuracy and 75.51% (79.82% in a fully supervised setting) prediction accuracy.
comment: 17 pages
☆ Alignment For Performance Improvement in Conversation Bots
This paper shows that alignment methods can achieve superior adherence to guardrails compared to instruction fine-tuning alone in conversational agents, also known as bots, within predefined guidelines or 'guardrails'. It examines traditional training approaches such as instruction fine-tuning and the recent advancements in direct alignment methods like Identity Preference Optimization (IPO), and Kahneman-Tversky Optimization (KTO). The effectiveness of alignment techniques both pre and post-instruction tuning is highlighted, illustrating their potential to optimize conversational bots in domains that require strict adherence to specified rules, such as customer care.
☆ Investigating and Defending Shortcut Learning in Personalized Diffusion Models
Personalized diffusion models have gained popularity for adapting pre-trained text-to-image models to generate images of specific topics with only a few images. However, recent studies find that these models are vulnerable to minor adversarial perturbation, and the fine-tuning performance is largely degraded on corrupted datasets. Such characteristics are further exploited to craft protective perturbation on sensitive images like portraits that prevent unauthorized generation. In response, diffusion-based purification methods have been proposed to remove these perturbations and retain generation performance. However, existing works lack detailed analysis of the fundamental shortcut learning vulnerability of personalized diffusion models and also turn to over-purifying the images cause information loss. In this paper, we take a closer look at the fine-tuning process of personalized diffusion models through the lens of shortcut learning and propose a hypothesis that could explain the underlying manipulation mechanisms of existing perturbation methods. Specifically, we find that the perturbed images are greatly shifted from their original paired prompt in the CLIP-based latent space. As a result, training with this mismatched image-prompt pair creates a construction that causes the models to dump their out-of-distribution noisy patterns to the identifier, thus causing serious performance degradation. Based on this observation, we propose a systematic approach to retain the training performance with purification that realigns the latent image and its semantic meaning and also introduces contrastive learning with a negative token to decouple the learning of wanted clean identity and the unwanted noisy pattern, that shows strong potential capacity against further adaptive perturbation.
comment: Preprint
☆ Evaluating AI Group Fairness: a Fuzzy Logic Perspective
Artificial intelligence systems often address fairness concerns by evaluating and mitigating measures of group discrimination, for example that indicate biases against certain genders or races. However, what constitutes group fairness depends on who is asked and the social context, whereas definitions are often relaxed to accept small deviations from the statistical constraints they set out to impose. Here we decouple definitions of group fairness both from the context and from relaxation-related uncertainty by expressing them in the axiomatic system of Basic fuzzy Logic (BL) with loosely understood predicates, like encountering group members. We then evaluate the definitions in subclasses of BL, such as Product or Lukasiewicz logics. Evaluation produces continuous instead of binary truth values by choosing the logic subclass and truth values for predicates that reflect uncertain context-specific beliefs, such as stakeholder opinions gathered through questionnaires. Internally, it follows logic-specific rules to compute the truth values of definitions. We show that commonly held propositions standardize the resulting mathematical formulas and we transcribe logic and truth value choices to layperson terms, so that anyone can answer them. We also use our framework to study several literature definitions of algorithmic fairness, for which we rationalize previous expedient practices that are non-probabilistic and show how to re-interpret their formulas and parameters in new contexts.
comment: preprint, 32 pages, 7 figures, 2 theorems, 6 appendices
☆ Federated Graph Semantic and Structural Learning
Federated graph learning collaboratively learns a global graph neural network with distributed graphs, where the non-independent and identically distributed property is one of the major challenges. Most relative arts focus on traditional distributed tasks like images and voices, incapable of graph structures. This paper firstly reveals that local client distortion is brought by both node-level semantics and graph-level structure. First, for node-level semantics, we find that contrasting nodes from distinct classes is beneficial to provide a well-performing discrimination. We pull the local node towards the global node of the same class and push it away from the global node of different classes. Second, we postulate that a well-structural graph neural network possesses similarity for neighbors due to the inherent adjacency relationships. However, aligning each node with adjacent nodes hinders discrimination due to the potential class inconsistency. We transform the adjacency relationships into the similarity distribution and leverage the global model to distill the relation knowledge into the local model, which preserves the structural information and discriminability of the local model. Empirical results on three graph datasets manifest the superiority of the proposed method over its counterparts.
☆ Reasoning About Action and Change
The purpose of this book is to provide an overview of AI research, ranging from basic work to interfaces and applications, with as much emphasis on results as on current issues. It is aimed at an audience of master students and Ph.D. students, and can be of interest as well for researchers and engineers who want to know more about AI. The book is split into three volumes.
☆ Learning Pareto Set for Multi-Objective Continuous Robot Control
For a control problem with multiple conflicting objectives, there exists a set of Pareto-optimal policies called the Pareto set instead of a single optimal policy. When a multi-objective control problem is continuous and complex, traditional multi-objective reinforcement learning (MORL) algorithms search for many Pareto-optimal deep policies to approximate the Pareto set, which is quite resource-consuming. In this paper, we propose a simple and resource-efficient MORL algorithm that learns a continuous representation of the Pareto set in a high-dimensional policy parameter space using a single hypernet. The learned hypernet can directly generate various well-trained policy networks for different user preferences. We compare our method with two state-of-the-art MORL algorithms on seven multi-objective continuous robot control problems. Experimental results show that our method achieves the best overall performance with the least training parameters. An interesting observation is that the Pareto set is well approximated by a curved line or surface in a high-dimensional parameter space. This observation will provide insight for researchers to design new MORL algorithms.
☆ Time Matters: Scaling Laws for Any Budget
A primary cost driver for training large models is wall-clock training time. We show that popular time estimates based on FLOPs are poor estimates, and construct a more accurate proxy based on memory copies. We show that with some simple accounting, we can estimate the training speed of a transformer model from its hyperparameters. Combined with a scaling law curve like Chinchilla, this lets us estimate the final loss of the model. We fit our estimate to real data with a linear regression, and apply the result to rewrite Chinchilla in terms of a model's estimated training time as opposed to the amount of training data. This gives an expression for the loss in terms of the model's hyperparameters alone. We show that this expression is accurate across a wide range of model hyperparameter values, enabling us to analytically make architectural decisions and train models more efficiently.
☆ TrustUQA: A Trustful Framework for Unified Structured Data Question Answering
Natural language question answering (QA) over structured data sources such as tables and knowledge graphs (KGs) have been widely investigated, for example with Large Language Models (LLMs). The main solutions include question to formal query parsing and retrieval-based answer generation. However, current methods of the former often suffer from weak generalization, failing to dealing with multiple sources simultaneously, while the later is limited in trustfulness. In this paper, we propose UnifiedTQA, a trustful QA framework that can simultaneously support multiple types of structured data in a unified way. To this end, it adopts an LLM-friendly and unified knowledge representation method called Condition Graph (CG), and uses an LLM and demonstration-based two-level method for CG querying. For enhancement, it is also equipped with dynamic demonstration retrieval. We have evaluated UnifiedTQA with 5 benchmarks covering 3 types of structured data. It outperforms 2 existing unified structured data QA methods and in comparison with the baselines that are specific to a data type, it achieves state-of-the-art on 2 of them. Further more, we demonstrates potential of our method for more general QA tasks, QA over mixed structured data and QA across structured data.
☆ The Rise of Artificial Intelligence in Educational Measurement: Opportunities and Ethical Challenges
The integration of artificial intelligence (AI) in educational measurement has revolutionized assessment methods, enabling automated scoring, rapid content analysis, and personalized feedback through machine learning and natural language processing. These advancements provide timely, consistent feedback and valuable insights into student performance, thereby enhancing the assessment experience. However, the deployment of AI in education also raises significant ethical concerns regarding validity, reliability, transparency, fairness, and equity. Issues such as algorithmic bias and the opacity of AI decision-making processes pose risks of perpetuating inequalities and affecting assessment outcomes. Responding to these concerns, various stakeholders, including educators, policymakers, and organizations, have developed guidelines to ensure ethical AI use in education. The National Council of Measurement in Education's Special Interest Group on AI in Measurement and Education (AIME) also focuses on establishing ethical standards and advancing research in this area. In this paper, a diverse group of AIME members examines the ethical implications of AI-powered tools in educational measurement, explores significant challenges such as automation bias and environmental impact, and proposes solutions to ensure AI's responsible and effective use in education.
comment: 59 pages, 3 figures, a joint work of the Special Interest Group on Artificial Intelligence in Measurement and Education (AIME) from the National Council of Measurement in Education (NCME)
☆ Autonomous Control of a Novel Closed Chain Five Bar Active Suspension via Deep Reinforcement Learning
Planetary exploration requires traversal in environments with rugged terrains. In addition, Mars rovers and other planetary exploration robots often carry sensitive scientific experiments and components onboard, which must be protected from mechanical harm. This paper deals with an active suspension system focused on chassis stabilisation and an efficient traversal method while encountering unavoidable obstacles. Soft Actor-Critic (SAC) was applied along with Proportional Integral Derivative (PID) control to stabilise the chassis and traverse large obstacles at low speeds. The model uses the rover's distance from surrounding obstacles, the height of the obstacle, and the chassis' orientation to actuate the control links of the suspension accurately. Simulations carried out in the Gazebo environment are used to validate the proposed active system.
comment: 15 pages, 11 figures
☆ 360 in the Wild: Dataset for Depth Prediction and View Synthesis
The large abundance of perspective camera datasets facilitated the emergence of novel learning-based strategies for various tasks, such as camera localization, single image depth estimation, or view synthesis. However, panoramic or omnidirectional image datasets, including essential information, such as pose and depth, are mostly made with synthetic scenes. In this work, we introduce a large scale 360$^{\circ}$ videos dataset in the wild. This dataset has been carefully scraped from the Internet and has been captured from various locations worldwide. Hence, this dataset exhibits very diversified environments (e.g., indoor and outdoor) and contexts (e.g., with and without moving objects). Each of the 25K images constituting our dataset is provided with its respective camera's pose and depth map. We illustrate the relevance of our dataset for two main tasks, namely, single image depth estimation and view synthesis.
☆ Sequential three-way group decision-making for double hierarchy hesitant fuzzy linguistic term set
Group decision-making (GDM) characterized by complexity and uncertainty is an essential part of various life scenarios. Most existing researches lack tools to fuse information quickly and interpret decision results for partially formed decisions. This limitation is particularly noticeable when there is a need to improve the efficiency of GDM. To address this issue, a novel multi-level sequential three-way decision for group decision-making (S3W-GDM) method is constructed from the perspective of granular computing. This method simultaneously considers the vagueness, hesitation, and variation of GDM problems under double hierarchy hesitant fuzzy linguistic term sets (DHHFLTS) environment. First, for fusing information efficiently, a novel multi-level expert information fusion method is proposed, and the concepts of expert decision table and the extraction/aggregation of decision-leveled information based on the multi-level granularity are defined. Second, the neighborhood theory, outranking relation and regret theory (RT) are utilized to redesign the calculations of conditional probability and relative loss function. Then, the granular structure of DHHFLTS based on the sequential three-way decision (S3WD) is defined to improve the decision-making efficiency, and the decision-making strategy and interpretation of each decision-level are proposed. Furthermore, the algorithm of S3W-GDM is given. Finally, an illustrative example of diagnosis is presented, and the comparative and sensitivity analysis with other methods are performed to verify the efficiency and rationality of the proposed method.
☆ Two-Pronged Human Evaluation of ChatGPT Self-Correction in Radiology Report Simplification
Radiology reports are highly technical documents aimed primarily at doctor-doctor communication. There has been an increasing interest in sharing those reports with patients, necessitating providing them patient-friendly simplifications of the original reports. This study explores the suitability of large language models in automatically generating those simplifications. We examine the usefulness of chain-of-thought and self-correction prompting mechanisms in this domain. We also propose a new evaluation protocol that employs radiologists and laypeople, where radiologists verify the factual correctness of simplifications, and laypeople assess simplicity and comprehension. Our experimental results demonstrate the effectiveness of self-correction prompting in producing high-quality simplifications. Our findings illuminate the preferences of radiologists and laypeople regarding text simplification, informing future research on this topic.
☆ FFN: a Fine-grained Chinese-English Financial Domain Parallel Corpus
Large Language Models (LLMs) have stunningly advanced the field of machine translation, though their effectiveness within the financial domain remains largely underexplored. To probe this issue, we constructed a fine-grained Chinese-English parallel corpus of financial news called FFN. We acquired financial news articles spanning between January 1st, 2014, to December 31, 2023, from mainstream media websites such as CNN, FOX, and China Daily. The dataset consists of 1,013 main text and 809 titles, all of which have been manually corrected. We measured the translation quality of two LLMs -- ChatGPT and ERNIE-bot, utilizing BLEU, TER and chrF scores as the evaluation metrics. For comparison, we also trained an OpenNMT model based on our dataset. We detail problems of LLMs and provide in-depth analysis, intending to stimulate further research and solutions in this largely uncharted territory. Our research underlines the need to optimize LLMs within the specific field of financial translation to ensure accuracy and quality.
comment: a simplified version of this paper is accepted by International Conference on Asian Language Processing 2024
☆ LICO: Large Language Models for In-Context Molecular Optimization
Optimizing black-box functions is a fundamental problem in science and engineering. To solve this problem, many approaches learn a surrogate function that estimates the underlying objective from limited historical evaluations. Large Language Models (LLMs), with their strong pattern-matching capabilities via pretraining on vast amounts of data, stand out as a potential candidate for surrogate modeling. However, directly prompting a pretrained language model to produce predictions is not feasible in many scientific domains due to the scarcity of domain-specific data in the pretraining corpora and the challenges of articulating complex problems in natural language. In this work, we introduce LICO, a general-purpose model that extends arbitrary base LLMs for black-box optimization, with a particular application to the molecular domain. To achieve this, we equip the language model with a separate embedding layer and prediction layer, and train the model to perform in-context predictions on a diverse set of functions defined over the domain. Once trained, LICO can generalize to unseen molecule properties simply via in-context prompting. LICO achieves state-of-the-art performance on PMO, a challenging molecular optimization benchmark comprising over 20 objective functions.
☆ Learning Retrieval Augmentation for Personalized Dialogue Generation EMNLP-2023
Personalized dialogue generation, focusing on generating highly tailored responses by leveraging persona profiles and dialogue context, has gained significant attention in conversational AI applications. However, persona profiles, a prevalent setting in current personalized dialogue datasets, typically composed of merely four to five sentences, may not offer comprehensive descriptions of the persona about the agent, posing a challenge to generate truly personalized dialogues. To handle this problem, we propose $\textbf{L}$earning Retrieval $\textbf{A}$ugmentation for $\textbf{P}$ersonalized $\textbf{D}$ial$\textbf{O}$gue $\textbf{G}$eneration ($\textbf{LAPDOG}$), which studies the potential of leveraging external knowledge for persona dialogue generation. Specifically, the proposed LAPDOG model consists of a story retriever and a dialogue generator. The story retriever uses a given persona profile as queries to retrieve relevant information from the story document, which serves as a supplementary context to augment the persona profile. The dialogue generator utilizes both the dialogue history and the augmented persona profile to generate personalized responses. For optimization, we adopt a joint training framework that collaboratively learns the story retriever and dialogue generator, where the story retriever is optimized towards desired ultimate metrics (e.g., BLEU) to retrieve content for the dialogue generator to generate personalized responses. Experiments conducted on the CONVAI2 dataset with ROCStory as a supplementary data source show that the proposed LAPDOG method substantially outperforms the baselines, indicating the effectiveness of the proposed method. The LAPDOG model code is publicly available for further exploration. https://github.com/hqsiswiliam/LAPDOG
comment: Accepted to EMNLP-2023
☆ Retain, Blend, and Exchange: A Quality-aware Spatial-Stereo Fusion Approach for Event Stream Recognition
Existing event stream-based pattern recognition models usually represent the event stream as the point cloud, voxel, image, etc., and design various deep neural networks to learn their features. Although considerable results can be achieved in simple cases, however, the model performance may be limited by monotonous modality expressions, sub-optimal fusion, and readout mechanisms. In this paper, we propose a novel dual-stream framework for event stream-based pattern recognition via differentiated fusion, termed EFV++. It models two common event representations simultaneously, i.e., event images and event voxels. The spatial and three-dimensional stereo information can be learned separately by utilizing Transformer and Graph Neural Network (GNN). We believe the features of each representation still contain both efficient and redundant features and a sub-optimal solution may be obtained if we directly fuse them without differentiation. Thus, we divide each feature into three levels and retain high-quality features, blend medium-quality features, and exchange low-quality features. The enhanced dual features will be fed into the fusion Transformer together with bottleneck features. In addition, we introduce a novel hybrid interaction readout mechanism to enhance the diversity of features as final representations. Extensive experiments demonstrate that our proposed framework achieves state-of-the-art performance on multiple widely used event stream-based classification datasets. Specifically, we achieve new state-of-the-art performance on the Bullying10k dataset, i.e., $90.51\%$, which exceeds the second place by $+2.21\%$. The source code of this paper has been released on \url{https://github.com/Event-AHU/EFV_event_classification/tree/EFVpp}.
comment: In Peer Review, Journal Extension of PRCV 2023
☆ Disentangling Knowledge-based and Visual Reasoning by Question Decomposition in KB-VQA
We study the Knowledge-Based visual question-answering problem, for which given a question, the models need to ground it into the visual modality to find the answer. Although many recent works use question-dependent captioners to verbalize the given image and use Large Language Models to solve the VQA problem, the research results show they are not reasonably performing for multi-hop questions. Our study shows that replacing a complex question with several simpler questions helps to extract more relevant information from the image and provide a stronger comprehension of it. Moreover, we analyze the decomposed questions to find out the modality of the information that is required to answer them and use a captioner for the visual questions and LLMs as a general knowledge source for the non-visual KB-based questions. Our results demonstrate the positive impact of using simple questions before retrieving visual or non-visual information. We have provided results and analysis on three well-known VQA datasets including OKVQA, A-OKVQA, and KRVQA, and achieved up to 2% improvement in accuracy.
☆ Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis CVPR 2024
This paper presents a novel non-rigid point set registration method that is inspired by unsupervised clustering analysis. Unlike previous approaches that treat the source and target point sets as separate entities, we develop a holistic framework where they are formulated as clustering centroids and clustering members, separately. We then adopt Tikhonov regularization with an $\ell_1$-induced Laplacian kernel instead of the commonly used Gaussian kernel to ensure smooth and more robust displacement fields. Our formulation delivers closed-form solutions, theoretical guarantees, independence from dimensions, and the ability to handle large deformations. Subsequently, we introduce a clustering-improved Nystr\"om method to effectively reduce the computational complexity and storage of the Gram matrix to linear, while providing a rigorous bound for the low-rank approximation. Our method achieves high accuracy results across various scenarios and surpasses competitors by a significant margin, particularly on shapes with substantial deformations. Additionally, we demonstrate the versatility of our method in challenging tasks such as shape transfer and medical registration.
comment: [CVPR 2024 Highlight] Project and code at: https://github.com/zikai1/CVPR24_PointSetReg
☆ Length Optimization in Conformal Prediction
Conditional validity and length efficiency are two crucial aspects of conformal prediction (CP). Achieving conditional validity ensures accurate uncertainty quantification for data subpopulations, while proper length efficiency ensures that the prediction sets remain informative and non-trivial. Despite significant efforts to address each of these issues individually, a principled framework that reconciles these two objectives has been missing in the CP literature. In this paper, we develop Conformal Prediction with Length-Optimization (CPL) - a novel framework that constructs prediction sets with (near-) optimal length while ensuring conditional validity under various classes of covariate shifts, including the key cases of marginal and group-conditional coverage. In the infinite sample regime, we provide strong duality results which indicate that CPL achieves conditional validity and length optimality. In the finite sample regime, we show that CPL constructs conditionally valid prediction sets. Our extensive empirical evaluations demonstrate the superior prediction set size performance of CPL compared to state-of-the-art methods across diverse real-world and synthetic datasets in classification, regression, and text-related settings.
☆ A Survey on Privacy Attacks Against Digital Twin Systems in AI-Robotics
Industry 4.0 has witnessed the rise of complex robots fueled by the integration of Artificial Intelligence/Machine Learning (AI/ML) and Digital Twin (DT) technologies. While these technologies offer numerous benefits, they also introduce potential privacy and security risks. This paper surveys privacy attacks targeting robots enabled by AI and DT models. Exfiltration and data leakage of ML models are discussed in addition to the potential extraction of models derived from first-principles (e.g., physics-based). We also discuss design considerations with DT-integrated robotics touching on the impact of ML model training, responsible AI and DT safeguards, data governance and ethical considerations on the effectiveness of these attacks. We advocate for a trusted autonomy approach, emphasizing the need to combine robotics, AI, and DT technologies with robust ethical frameworks and trustworthiness principles for secure and reliable AI robotic systems.
comment: 10 pages, 3 figures, 1 table
☆ All Random Features Representations are Equivalent
Random features are an important technique that make it possible to rewrite positive-definite kernels as infinite-dimensional dot products. Over time, increasingly elaborate random feature representations have been developed in pursuit of finite approximations with ever lower error. We resolve this arms race by deriving an optimal sampling policy, and show that under this policy all random features representations have the same approximation error. This establishes a lower bound that holds across all random feature representations, and shows that we are free to choose whatever representation we please, provided we sample optimally.
☆ Infinite Width Models That Work: Why Feature Learning Doesn't Matter as Much as You Think
Common infinite-width architectures such as Neural Tangent Kernels (NTKs) have historically shown weak performance compared to finite models. This has been attributed to the absence of feature learning. We show that this is not the case. In fact, we show that infinite width NTK models are able to access richer features than finite models by selecting relevant subfeatures from their (infinite) feature vector. In fact, we show experimentally that NTKs under-perform traditional finite models even when feature learning is artificially disabled. Instead, weak performance is due to the fact that existing constructions depend on weak optimizers like SGD. We provide an infinite width limit based on ADAM-like learning dynamics and demonstrate empirically that the resulting models erase this performance gap.
☆ PathAlign: A vision-language model for whole slide images in histopathology
Microscopic interpretation of histopathology images underlies many important diagnostic and treatment decisions. While advances in vision-language modeling raise new opportunities for analysis of such images, the gigapixel-scale size of whole slide images (WSIs) introduces unique challenges. Additionally, pathology reports simultaneously highlight key findings from small regions while also aggregating interpretation across multiple slides, often making it difficult to create robust image-text pairs. As such, pathology reports remain a largely untapped source of supervision in computational pathology, with most efforts relying on region-of-interest annotations or self-supervision at the patch-level. In this work, we develop a vision-language model based on the BLIP-2 framework using WSIs paired with curated text from pathology reports. This enables applications utilizing a shared image-text embedding space, such as text or image retrieval for finding cases of interest, as well as integration of the WSI encoder with a frozen large language model (LLM) for WSI-based generative text capabilities such as report generation or AI-in-the-loop interactions. We utilize a de-identified dataset of over 350,000 WSIs and diagnostic text pairs, spanning a wide range of diagnoses, procedure types, and tissue types. We present pathologist evaluation of text generation and text retrieval using WSI embeddings, as well as results for WSI classification and workflow prioritization (slide-level triaging). Model-generated text for WSIs was rated by pathologists as accurate, without clinically significant error or omission, for 78% of WSIs on average. This work demonstrates exciting potential capabilities for language-aligned WSI embeddings.
comment: 9 main pages and 19 pages of supplemental material; 3 main tables, 3 main figures and 11 supplemental tables, 7 supplemental figures
☆ Synthetic Cancer -- Augmenting Worms with LLMs
With increasingly sophisticated large language models (LLMs), the potential for abuse rises drastically. As a submission to the Swiss AI Safety Prize, we present a novel type of metamorphic malware leveraging LLMs for two key processes. First, LLMs are used for automatic code rewriting to evade signature-based detection by antimalware programs. The malware then spreads its copies via email by utilizing an LLM to socially engineer email replies to encourage recipients to execute the attached malware. Our submission includes a functional minimal prototype, highlighting the risks that LLMs pose for cybersecurity and underscoring the need for further research into intelligent malware.
comment: Won first place at the Swiss AI Safety Prize. Some technical details omitted, contact authors for more information
☆ What Matters in Detecting AI-Generated Videos like Sora?
Recent advancements in diffusion-based video generation have showcased remarkable results, yet the gap between synthetic and real-world videos remains under-explored. In this study, we examine this gap from three fundamental perspectives: appearance, motion, and geometry, comparing real-world videos with those generated by a state-of-the-art AI model, Stable Video Diffusion. To achieve this, we train three classifiers using 3D convolutional networks, each targeting distinct aspects: vision foundation model features for appearance, optical flow for motion, and monocular depth for geometry. Each classifier exhibits strong performance in fake video detection, both qualitatively and quantitatively. This indicates that AI-generated videos are still easily detectable, and a significant gap between real and fake videos persists. Furthermore, utilizing the Grad-CAM, we pinpoint systematic failures of AI-generated videos in appearance, motion, and geometry. Finally, we propose an Ensemble-of-Experts model that integrates appearance, optical flow, and depth information for fake video detection, resulting in enhanced robustness and generalization ability. Our model is capable of detecting videos generated by Sora with high accuracy, even without exposure to any Sora videos during training. This suggests that the gap between real and fake videos can be generalized across various video generative models. Project page: https://justin-crchang.github.io/3DCNNDetection.github.io/
☆ Meta-Gradient Search Control: A Method for Improving the Efficiency of Dyna-style Planning
We study how a Reinforcement Learning (RL) system can remain sample-efficient when learning from an imperfect model of the environment. This is particularly challenging when the learning system is resource-constrained and in continual settings, where the environment dynamics change. To address these challenges, our paper introduces an online, meta-gradient algorithm that tunes a probability with which states are queried during Dyna-style planning. Our study compares the aggregate, empirical performance of this meta-gradient method to baselines that employ conventional sampling strategies. Results indicate that our method improves efficiency of the planning process, which, as a consequence, improves the sample-efficiency of the overall learning process. On the whole, we observe that our meta-learned solutions avoid several pathologies of conventional planning approaches, such as sampling inaccurate transitions and those that stall credit assignment. We believe these findings could prove useful, in future work, for designing model-based RL systems at scale.
☆ Rethinking harmless refusals when fine-tuning foundation models ICLR 2024
In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emph{reason-based deception}, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
comment: ICLR 2024 AGI Workshop Poster
☆ Leveraging Machine-Generated Rationales to Facilitate Social Meaning Detection in Conversations
We present a generalizable classification approach that leverages Large Language Models (LLMs) to facilitate the detection of implicitly encoded social meaning in conversations. We design a multi-faceted prompt to extract a textual explanation of the reasoning that connects visible cues to underlying social meanings. These extracted explanations or rationales serve as augmentations to the conversational text to facilitate dialogue understanding and transfer. Our empirical results over 2,340 experimental settings demonstrate the significant positive impact of adding these rationales. Our findings hold true for in-domain classification, zero-shot, and few-shot domain transfer for two different social meaning detection tasks, each spanning two different corpora.
comment: To appear at The Proceedings of the Association for Computational Linguistics, 2024
☆ Handling Ontology Gaps in Semantic Parsing
The majority of Neural Semantic Parsing (NSP) models are developed with the assumption that there are no concepts outside the ones such models can represent with their target symbols (closed-world assumption). This assumption leads to generate hallucinated outputs rather than admitting their lack of knowledge. Hallucinations can lead to wrong or potentially offensive responses to users. Hence, a mechanism to prevent this behavior is crucial to build trusted NSP-based Question Answering agents. To that end, we propose the Hallucination Simulation Framework (HSF), a general setting for stimulating and analyzing NSP model hallucinations. The framework can be applied to any NSP task with a closed-ontology. Using the proposed framework and KQA Pro as the benchmark dataset, we assess state-of-the-art techniques for hallucination detection. We then present a novel hallucination detection strategy that exploits the computational graph of the NSP model to detect the NSP hallucinations in the presence of ontology gaps, out-of-domain utterances, and to recognize NSP errors, improving the F1-Score respectively by ~21, ~24% and ~1%. This is the first work in closed-ontology NSP that addresses the problem of recognizing ontology gaps. We release our code and checkpoints at https://github.com/amazon-science/handling-ontology-gaps-in-semantic-parsing.
☆ Using Large Language Models to Assist Video Content Analysis: An Exploratory Study of Short Videos on Depression SC
Despite the growing interest in leveraging Large Language Models (LLMs) for content analysis, current studies have primarily focused on text-based content. In the present work, we explored the potential of LLMs in assisting video content analysis by conducting a case study that followed a new workflow of LLM-assisted multimodal content analysis. The workflow encompasses codebook design, prompt engineering, LLM processing, and human evaluation. We strategically crafted annotation prompts to get LLM Annotations in structured form and explanation prompts to generate LLM Explanations for a better understanding of LLM reasoning and transparency. To test LLM's video annotation capabilities, we analyzed 203 keyframes extracted from 25 YouTube short videos about depression. We compared the LLM Annotations with those of two human coders and found that LLM has higher accuracy in object and activity Annotations than emotion and genre Annotations. Moreover, we identified the potential and limitations of LLM's capabilities in annotating videos. Based on the findings, we explore opportunities and challenges for future research and improvements to the workflow. We also discuss ethical concerns surrounding future studies based on LLM-assisted video analysis.
comment: 6 pages, 2 figures, under review in CSCW 24
☆ Captioning Visualizations with Large Language Models (CVLLM): A Tutorial
Automatically captioning visualizations is not new, but recent advances in large language models(LLMs) open exciting new possibilities. In this tutorial, after providing a brief review of Information Visualization (InfoVis) principles and past work in captioning, we introduce neural models and the transformer architecture used in generic LLMs. We then discuss their recent applications in InfoVis, with a focus on captioning. Additionally, we explore promising future directions in this field.
comment: 6 pages, 4 figures
☆ Too Good to be True? Turn Any Model Differentially Private With DP-Weights
Imagine training a machine learning model with Differentially Private Stochastic Gradient Descent (DP-SGD), only to discover post-training that the noise level was either too high, crippling your model's utility, or too low, compromising privacy. The dreaded realization hits: you must start the lengthy training process from scratch. But what if you could avoid this retraining nightmare? In this study, we introduce a groundbreaking approach (to our knowledge) that applies differential privacy noise to the model's weights after training. We offer a comprehensive mathematical proof for this novel approach's privacy bounds, use formal methods to validate its privacy guarantees, and empirically evaluate its effectiveness using membership inference attacks and performance evaluations. This method allows for a single training run, followed by post-hoc noise adjustments to achieve optimal privacy-utility trade-offs. We compare this novel fine-tuned model (DP-Weights model) to a traditional DP-SGD model, demonstrating that our approach yields statistically similar performance and privacy guarantees. Our results validate the efficacy of post-training noise application, promising significant time savings and flexibility in fine-tuning differential privacy parameters, making it a practical alternative for deploying differentially private models in real-world scenarios.
comment: For code visit the following repository, https://github.com/dzagardo/forgetnet/
☆ Investigating How Large Language Models Leverage Internal Knowledge to Perform Complex Reasoning
Despite significant advancements, there is a limited understanding of how large language models (LLMs) utilize knowledge for reasoning. To address this, we propose a method that deconstructs complex real-world questions into a graph, representing each question as a node with parent nodes of background knowledge needed to solve the question. We develop the DepthQA dataset, deconstructing questions into three depths: (i) recalling conceptual knowledge, (ii) applying procedural knowledge, and (iii) analyzing strategic knowledge. Based on a hierarchical graph, we quantify forward discrepancy, discrepancies in LLMs' performance on simpler sub-problems versus complex questions. We also measure backward discrepancy, where LLMs answer complex questions but struggle with simpler ones. Our analysis shows that smaller models have more discrepancies than larger models. Additionally, guiding models from simpler to complex questions through multi-turn interactions improves performance across model sizes, highlighting the importance of structured intermediate steps in knowledge reasoning. This work enhances our understanding of LLM reasoning and suggests ways to improve their problem-solving abilities.
comment: Work in progress; code is available at https://github.com/kaistAI/knowledge-reasoning
☆ Knowledge acquisition for dialogue agents using reinforcement learning on graph representations
We develop an artificial agent motivated to augment its knowledge base beyond its initial training. The agent actively participates in dialogues with other agents, strategically acquiring new information. The agent models its knowledge as an RDF knowledge graph, integrating new beliefs acquired through conversation. Responses in dialogue are generated by identifying graph patterns around these new integrated beliefs. We show that policies can be learned using reinforcement learning to select effective graph patterns during an interaction, without relying on explicit user feedback. Within this context, our study is a proof of concept for leveraging users as effective sources of information.
☆ Inclusivity in Large Language Models: Personality Traits and Gender Bias in Scientific Abstracts
Large language models (LLMs) are increasingly utilized to assist in scientific and academic writing, helping authors enhance the coherence of their articles. Previous studies have highlighted stereotypes and biases present in LLM outputs, emphasizing the need to evaluate these models for their alignment with human narrative styles and potential gender biases. In this study, we assess the alignment of three prominent LLMs - Claude 3 Opus, Mistral AI Large, and Gemini 1.5 Flash - by analyzing their performance on benchmark text-generation tasks for scientific abstracts. We employ the Linguistic Inquiry and Word Count (LIWC) framework to extract lexical, psychological, and social features from the generated texts. Our findings indicate that, while these models generally produce text closely resembling human authored content, variations in stylistic features suggest significant gender biases. This research highlights the importance of developing LLMs that maintain a diversity of writing styles to promote inclusivity in academic discourse.
☆ Development and Evaluation of a Retrieval-Augmented Generation Tool for Creating SAPPhIRE Models of Artificial Systems
Representing systems using the SAPPhIRE causality model is found useful in supporting design-by-analogy. However, creating a SAPPhIRE model of artificial or biological systems is an effort-intensive process that requires human experts to source technical knowledge from multiple technical documents regarding how the system works. This research investigates how to leverage Large Language Models (LLMs) in creating structured descriptions of systems using the SAPPhIRE model of causality. This paper, the second part of the two-part research, presents a new Retrieval-Augmented Generation (RAG) tool for generating information related to SAPPhIRE constructs of artificial systems and reports the results from a preliminary evaluation of the tool's success - focusing on the factual accuracy and reliability of outcomes.
☆ LoPT: Low-Rank Prompt Tuning for Parameter Efficient Language Models
In prompt tuning, a prefix or suffix text is added to the prompt, and the embeddings (soft prompts) or token indices (hard prompts) of the prefix/suffix are optimized to gain more control over language models for specific tasks. This approach eliminates the need for hand-crafted prompt engineering or explicit model fine-tuning. Prompt tuning is significantly more parameter-efficient than model fine-tuning, as it involves optimizing partial inputs of language models to produce desired outputs. In this work, we aim to further reduce the amount of trainable parameters required for a language model to perform well on specific tasks. We propose Low-rank Prompt Tuning (LoPT), a low-rank model for prompts that achieves efficient prompt optimization. The proposed method demonstrates similar outcomes to full parameter prompt tuning while reducing the number of trainable parameters by a factor of 5. It also provides promising results compared to the state-of-the-art methods that would require 10 to 20 times more parameters.
☆ Sparse Regression for Machine Translation
We use transductive regression techniques to learn mappings between source and target features of given parallel corpora and use these mappings to generate machine translation outputs. We show the effectiveness of $L_1$ regularized regression (\textit{lasso}) to learn the mappings between sparsely observed feature sets versus $L_2$ regularized regression. Proper selection of training instances plays an important role to learn correct feature mappings within limited computational resources and at expected accuracy levels. We introduce \textit{dice} instance selection method for proper selection of training instances, which plays an important role to learn correct feature mappings for improving the source and target coverage of the training set. We show that $L_1$ regularized regression performs better than $L_2$ regularized regression both in regression measurements and in the translation experiments using graph decoding. We present encouraging results when translating from German to English and Spanish to English. We also demonstrate results when the phrase table of a phrase-based decoder is replaced with the mappings we find with the regression model.
comment: 8 pages, 4 figures, 4 tables
☆ ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data
Audio signals provide rich information for the robot interaction and object properties through contact. These information can surprisingly ease the learning of contact-rich robot manipulation skills, especially when the visual information alone is ambiguous or incomplete. However, the usage of audio data in robot manipulation has been constrained to teleoperated demonstrations collected by either attaching a microphone to the robot or object, which significantly limits its usage in robot learning pipelines. In this work, we introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback, and a corresponding policy interface to learn robot manipulation policy directly from the demonstrations. We demonstrate the capabilities of our system through four contact-rich manipulation tasks that require either passively sensing the contact events and modes, or actively sensing the object surface materials and states. In addition, we show that our system can generalize to unseen in-the-wild environments, by learning from diverse in-the-wild human demonstrations. Project website: https://mani-wav.github.io/
☆ Lightweight Predictive 3D Gaussian Splats
Recent approaches representing 3D objects and scenes using Gaussian splats show increased rendering speed across a variety of platforms and devices. While rendering such representations is indeed extremely efficient, storing and transmitting them is often prohibitively expensive. To represent large-scale scenes, one often needs to store millions of 3D Gaussians, occupying gigabytes of disk space. This poses a very practical limitation, prohibiting widespread adoption.Several solutions have been proposed to strike a balance between disk size and rendering quality, noticeably reducing the visual quality. In this work, we propose a new representation that dramatically reduces the hard drive footprint while featuring similar or improved quality when compared to the standard 3D Gaussian splats. When compared to other compact solutions, ours offers higher quality renderings with significantly reduced storage, being able to efficiently run on a mobile device in real-time. Our key observation is that nearby points in the scene can share similar representations. Hence, only a small ratio of 3D points needs to be stored. We introduce an approach to identify such points which are called parent points. The discarded points called children points along with attributes can be efficiently predicted by tiny MLPs.
comment: Project Page: https://plumpuddings.github.io/LPGS//
♻ ☆ Physics-Guided Neural Networks for Intraventricular Vector Flow Mapping
Intraventricular vector flow mapping (iVFM) seeks to enhance and quantify color Doppler in cardiac imaging. In this study, we propose novel alternatives to the traditional iVFM optimization scheme by utilizing physics-informed neural networks (PINNs) and a physics-guided nnU-Net-based supervised approach. When evaluated on simulated color Doppler images derived from a patient-specific computational fluid dynamics model and in vivo Doppler acquisitions, both approaches demonstrate comparable reconstruction performance to the original iVFM algorithm. The efficiency of PINNs is boosted through dual-stage optimization and pre-optimized weights. On the other hand, the nnU-Net method excels in generalizability and real-time capabilities. Notably, nnU-Net shows superior robustness on sparse and truncated Doppler data while maintaining independence from explicit boundary conditions. Overall, our results highlight the effectiveness of these methods in reconstructing intraventricular vector blood flow. The study also suggests potential applications of PINNs in ultrafast color Doppler imaging and the incorporation of fluid dynamics equations to derive biomarkers for cardiovascular diseases based on blood flow.
comment: 12 pages, accepted for publication in IEEE TUFFC; camera ready corrections, corrected acknowledgments
♻ ☆ Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications
Large language models (LLMs) show inherent brittleness in their safety mechanisms, as evidenced by their susceptibility to jailbreaking and even non-malicious fine-tuning. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We develop methods to identify critical regions that are vital for safety guardrails, and that are disentangled from utility-relevant regions at both the neuron and rank levels. Surprisingly, the isolated regions we find are sparse, comprising about $3\%$ at the parameter level and $2.5\%$ at the rank level. Removing these regions compromises safety without significantly impacting utility, corroborating the inherent brittleness of the model's safety mechanisms. Moreover, we show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted. These findings underscore the urgent need for more robust safety strategies in LLMs.
comment: 22 pages, 9 figures. Project page is available at https://boyiwei.com/alignment-attribution/
♻ ☆ CHESS: Contextual Harnessing for Efficient SQL Synthesis
Utilizing large language models (LLMs) for transforming natural language questions into SQL queries (text-to-SQL) is a promising yet challenging approach, particularly when applied to real-world databases with complex and extensive schemas. In particular, effectively incorporating data catalogs and database values for SQL generation remains an obstacle, leading to suboptimal solutions. We address this problem by proposing a new pipeline that effectively retrieves relevant data and context, selects an efficient schema, and synthesizes correct and efficient SQL queries. To increase retrieval precision, our pipeline introduces a hierarchical retrieval method leveraging model-generated keywords, locality-sensitive hashing indexing, and vector databases. Additionally, we have developed an adaptive schema pruning technique that adjusts based on the complexity of the problem and the model's context size. Our approach generalizes to both frontier proprietary models like GPT-4 and open-source models such as Llama-3-70B. Through a series of ablation studies, we demonstrate the effectiveness of each component of our pipeline and its impact on the end-to-end performance. Our method achieves new state-of-the-art performance on the cross-domain challenging BIRD dataset.
♻ ☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
♻ ☆ Step-On-Feet Tuning: Scaling Self-Alignment of LLMs via Bootstrapping
Self-alignment is an effective way to reduce the cost of human annotation while ensuring promising model capability. However, most current methods complete the data collection and training steps in a single round, which may overlook the continuously improving ability of self-aligned models. This gives rise to a key query: What if we do multi-time bootstrapping self-alignment? Does this strategy enhance model performance or lead to rapid degradation? In this paper, our pioneering exploration delves into the impact of bootstrapping self-alignment on large language models. Our findings reveal that bootstrapping self-alignment markedly surpasses the single-round approach, by guaranteeing data diversity from in-context learning. To further exploit the capabilities of bootstrapping, we investigate and adjust the training order of data, which yields improved performance of the model. Drawing on these findings, we propose Step-On-Feet Tuning (SOFT) which leverages model's continuously enhanced few-shot ability to boost zero or one-shot performance. Based on easy-to-hard training recipe, we propose SOFT+ which further boost self-alignment's performance. Our experiments demonstrate the efficiency of SOFT (SOFT+) across various classification and generation tasks, highlighting the potential of bootstrapping self-alignment on continually enhancing model alignment performance.
♻ ☆ GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
♻ ☆ Assessing the nature of large language models: A caution against anthropocentrism
Generative AI models garnered a large amount of public attention and speculation with the release of OpenAIs chatbot, ChatGPT. At least two opinion camps exist: one excited about possibilities these models offer for fundamental changes to human tasks, and another highly concerned about power these models seem to have. To address these concerns, we assessed several LLMs, primarily GPT 3.5, using standard, normed, and validated cognitive and personality measures. For this seedling project, we developed a battery of tests that allowed us to estimate the boundaries of some of these models capabilities, how stable those capabilities are over a short period of time, and how they compare to humans. Our results indicate that LLMs are unlikely to have developed sentience, although its ability to respond to personality inventories is interesting. GPT3.5 did display large variability in both cognitive and personality measures over repeated observations, which is not expected if it had a human-like personality. Variability notwithstanding, LLMs display what in a human would be considered poor mental health, including low self-esteem, marked dissociation from reality, and in some cases narcissism and psychopathy, despite upbeat and helpful responses.
comment: 31 pages, 6 figures
♻ ☆ Refining Myocardial Infarction Detection: A Novel Multi-Modal Composite Kernel Strategy in One-Class Classification
Early detection of myocardial infarction (MI), a critical condition arising from coronary artery disease (CAD), is vital to prevent further myocardial damage. This study introduces a novel method for early MI detection using a one-class classification (OCC) algorithm in echocardiography. Our study overcomes the challenge of limited echocardiography data availability by adopting a novel approach based on Multi-modal Subspace Support Vector Data Description. The proposed technique involves a specialized MI detection framework employing multi-view echocardiography incorporating a composite kernel in the non-linear projection trick, fusing Gaussian and Laplacian sigmoid functions. Additionally, we enhance the update strategy of the projection matrices by adapting maximization for both or one of the modalities in the optimization process. Our method boosts MI detection capability by efficiently transforming features extracted from echocardiography data into an optimized lower-dimensional subspace. The OCC model trained specifically on target class instances from the comprehensive HMC-QU dataset that includes multiple echocardiography views indicates a marked improvement in MI detection accuracy. Our findings reveal that our proposed multi-view approach achieves a geometric mean of 71.24%, signifying a substantial advancement in echocardiography-based MI diagnosis and offering more precise and efficient diagnostic tools.
♻ ☆ Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA ACL 2024
Multipanel images, commonly seen as web screenshots, posters, etc., pervade our daily lives. These images, characterized by their composition of multiple subfigures in distinct layouts, effectively convey information to people. Toward building advanced multimodal AI applications, such as agents that understand complex scenes and navigate through webpages, the skill of multipanel visual reasoning is essential, and a comprehensive evaluation of models in this regard is important. Therefore, we introduce Multipanel Visual Question Answering (MultipanelVQA), a novel benchmark comprising 6,600 triplets of questions, answers, and multipanel images that specifically challenge models in comprehending multipanel images. Our evaluation shows that questions in the MultipanelVQA benchmark pose significant challenges to the state-of-the-art Multimodal Large Language Models (MLLMs) tested, even though humans can attain approximately 99% accuracy on these questions. Distinctively, the MultipanelVQA benchmark features synthetically generated multipanel images specifically crafted to isolate and assess the impact of various factors, such as the layout, on MLLMs' multipanel image comprehension abilities. As a result, in addition to benchmarking the capabilities of MLLMs in understanding multipanel images, we analyze various factors of the multipanel image that affect MLLMs' performance with synthetic data and offer insights for enhancement. Code and data are released at https://sites.google.com/view/multipanelvqa/home.
comment: ACL 2024
♻ ☆ Token-level Direct Preference Optimization
Fine-tuning pre-trained Large Language Models (LLMs) is essential to align them with human values and intentions. This process often utilizes methods like pairwise comparisons and KL divergence against a reference LLM, focusing on the evaluation of full answers generated by the models. However, the generation of these responses occurs in a token level, following a sequential, auto-regressive fashion. In this paper, we introduce Token-level Direct Preference Optimization (TDPO), a novel approach to align LLMs with human preferences by optimizing policy at the token level. Unlike previous methods, which face challenges in divergence efficiency, TDPO incorporates forward KL divergence constraints for each token, improving alignment and diversity. Utilizing the Bradley-Terry model for a token-based reward system, TDPO enhances the regulation of KL divergence, while preserving simplicity without the need for explicit reward modeling. Experimental results across various text tasks demonstrate TDPO's superior performance in balancing alignment with generation diversity. Notably, fine-tuning with TDPO strikes a better balance than DPO in the controlled sentiment generation and single-turn dialogue datasets, and significantly improves the quality of generated responses compared to both DPO and PPO-based RLHF methods. Our code is open-sourced at https://github.com/Vance0124/Token-level-Direct-Preference-Optimization.
♻ ☆ MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
comment: Github link: https://github.com/ncbi-nlp/MedCalc-Bench HuggingFace link: https://huggingface.co/datasets/nsk7153/MedCalc-Bench
♻ ☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.
♻ ☆ VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks MICCAI 2024
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at: https://github.com/naamiinepal/vlsm-adapter.
comment: Accepted at MICCAI 2024, the 27th International Conference on Medical Image Computing and Computer Assisted Intervention
♻ ☆ Next-Generation Database Interfaces: A Survey of LLM-based Text-to-SQL
Generating accurate SQL according to natural language questions (text-to-SQL) is a long-standing challenge due to the complexities involved in user question understanding, database schema comprehension, and SQL generation. Conventional text-to-SQL systems, comprising human engineering and deep neural networks, have made substantial progress. Subsequently, pre-trained language models (PLMs) have been developed and utilized for text-to-SQL tasks, achieving promising performance. As modern databases become more complex, the corresponding user questions also grow more challenging, leading PLMs with limited comprehension capabilities to produce incorrect SQL. This necessitates more sophisticated and tailored optimization methods for PLMs, which, in turn, restricts the applications of PLM-based systems. Most recently, large language models (LLMs) have demonstrated significant capabilities in natural language understanding as the model scale remains increasing. Therefore, integrating the LLM-based implementation can bring unique opportunities, improvements, and solutions to text-to-SQL research. In this survey, we present a comprehensive review of LLM-based text-to-SQL. Specifically, we propose a brief overview of the technical challenges and the evolutionary process of text-to-SQL. Then, we provide a detailed introduction to the datasets and metrics designed to evaluate text-to-SQL systems. After that, we present a systematic analysis of recent advances in LLM-based text-to-SQL. Finally, we discuss the remaining challenges in this field and propose expectations for future research directions.
♻ ☆ Integrating Intent Understanding and Optimal Behavior Planning for Behavior Tree Generation from Human Instructions
Robots executing tasks following human instructions in domestic or industrial environments essentially require both adaptability and reliability. Behavior Tree (BT) emerges as an appropriate control architecture for these scenarios due to its modularity and reactivity. Existing BT generation methods, however, either do not involve interpreting natural language or cannot theoretically guarantee the BTs' success. This paper proposes a two-stage framework for BT generation, which first employs large language models (LLMs) to interpret goals from high-level instructions, then constructs an efficient goal-specific BT through the Optimal Behavior Tree Expansion Algorithm (OBTEA). We represent goals as well-formed formulas in first-order logic, effectively bridging intent understanding and optimal behavior planning. Experiments in the service robot validate the proficiency of LLMs in producing grammatically correct and accurately interpreted goals, demonstrate OBTEA's superiority over the baseline BT Expansion algorithm in various metrics, and finally confirm the practical deployability of our framework. The project website is https://dids-ei.github.io/Project/LLM-OBTEA/.
♻ ☆ SAM: Semi-Active Mechanism for Extensible Continuum Manipulator and Real-time Hysteresis Compensation Control Algorithm
Cable-Driven Continuum Manipulators (CDCMs) enable scar-free procedures via natural orifices and improve target lesion accessibility through curved paths. However, CDCMs face limitations in workspace and control accuracy due to non-linear cable effects causing hysteresis. This paper introduces an extensible CDCM with a Semi-active Mechanism (SAM) to expand the workspace via translational motion without additional mechanical elements or actuation. We collect a hysteresis dataset using 8 fiducial markers and RGBD sensing. Based on this dataset, we develop a real-time hysteresis compensation control algorithm using the trained Temporal Convolutional Network (TCN) with a 1ms time latency, effectively estimating the manipulator's hysteresis behavior. Performance validation through random trajectory tracking tests and box pointing tasks shows the proposed controller significantly reduces hysteresis by up to 69.5% in joint space and approximately 26% in the box pointing task.
comment: 12 pages, 14 figures, 6 tables
♻ ☆ D-GRIL: End-to-End Topological Learning with 2-parameter Persistence
End-to-end topological learning using 1-parameter persistence is well-known. We show that the framework can be enhanced using 2-parameter persistence by adopting a recently introduced 2-parameter persistence based vectorization technique called GRIL. We establish a theoretical foundation of differentiating GRIL producing D-GRIL. We show that D-GRIL can be used to learn a bifiltration function on standard benchmark graph datasets. Further, we exhibit that this framework can be applied in the context of bio-activity prediction in drug discovery.
♻ ☆ MAGIS: LLM-Based Multi-Agent Framework for GitHub Issue Resolution
In software development, resolving the emergent issues within GitHub repositories is a complex challenge that involves not only the incorporation of new code but also the maintenance of existing code. Large Language Models (LLMs) have shown promise in code generation but face difficulties in resolving Github issues, particularly at the repository level. To overcome this challenge, we empirically study the reason why LLMs fail to resolve GitHub issues and analyze the major factors. Motivated by the empirical findings, we propose a novel LLM-based Multi-Agent framework for GitHub Issue reSolution, MAGIS, consisting of four agents customized for software evolution: Manager, Repository Custodian, Developer, and Quality Assurance Engineer agents. This framework leverages the collaboration of various agents in the planning and coding process to unlock the potential of LLMs to resolve GitHub issues. In experiments, we employ the SWE-bench benchmark to compare MAGIS with popular LLMs, including GPT-3.5, GPT-4, and Claude-2. MAGIS can resolve 13.94% GitHub issues, significantly outperforming the baselines. Specifically, MAGIS achieves an eight-fold increase in resolved ratio over the direct application of GPT-4, the advanced LLM.
♻ ☆ WsiCaption: Multiple Instance Generation of Pathology Reports for Gigapixel Whole-Slide Images
Whole slide images are the foundation of digital pathology for the diagnosis and treatment of carcinomas. Writing pathology reports is laborious and error-prone for inexperienced pathologists. To reduce the workload and improve clinical automation, we investigate how to generate pathology reports given whole slide images. On the data end, we curated the largest WSI-text dataset (PathText). In specific, we collected nearly 10000 high-quality WSI-text pairs for visual-language models by recognizing and cleaning pathology reports which narrate diagnostic slides in TCGA. On the model end, we propose the multiple instance generative model (MI-Gen) which can produce pathology reports for gigapixel WSIs. We benchmark our model on the largest subset of TCGA-PathoText. Experimental results show our model can generate pathology reports which contain multiple clinical clues and achieve competitive performance on certain slide-level tasks. We observe that simple semantic extraction from the pathology reports can achieve the best performance (0.838 of F1 score) on BRCA subtyping surpassing previous state-of-the-art approaches. Our collected dataset and related code are available.
♻ ☆ Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
♻ ☆ Mapping the Potential of Explainable AI for Fairness Along the AI Lifecycle
The widespread use of artificial intelligence (AI) systems across various domains is increasingly surfacing issues related to algorithmic fairness, especially in high-stakes scenarios. Thus, critical considerations of how fairness in AI systems might be improved -- and what measures are available to aid this process -- are overdue. Many researchers and policymakers see explainable AI (XAI) as a promising way to increase fairness in AI systems. However, there is a wide variety of XAI methods and fairness conceptions expressing different desiderata, and the precise connections between XAI and fairness remain largely nebulous. Besides, different measures to increase algorithmic fairness might be applicable at different points throughout an AI system's lifecycle. Yet, there currently is no coherent mapping of fairness desiderata along the AI lifecycle. In this paper, we we distill eight fairness desiderata, map them along the AI lifecycle, and discuss how XAI could help address each of them. We hope to provide orientation for practical applications and to inspire XAI research specifically focused on these fairness desiderata.
♻ ☆ REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent's learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's performance in test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help in the optimization of the training tasks, resulting in improved learning efficiency and final performance.
comment: arXiv admin note: text overlap with arXiv:2307.01452 by other authors
♻ ☆ Examining Common Paradigms in Multi-Task Learning
While multi-task learning (MTL) has gained significant attention in recent years, its underlying mechanisms remain poorly understood. Recent methods did not yield consistent performance improvements over single task learning (STL) baselines, underscoring the importance of gaining more profound insights about challenges specific to MTL. In our study, we investigate paradigms in MTL in the context of STL: First, the impact of the choice of optimizer has only been mildly investigated in MTL. We show the pivotal role of common STL tools such as the Adam optimizer in MTL empirically in various experiments. To further investigate Adam's effectiveness, we theoretical derive a partial loss-scale invariance under mild assumptions. Second, the notion of gradient conflicts has often been phrased as a specific problem in MTL. We delve into the role of gradient conflicts in MTL and compare it to STL. For angular gradient alignment we find no evidence that this is a unique problem in MTL. We emphasize differences in gradient magnitude as the main distinguishing factor. Overall, we find surprising similarities between STL and MTL suggesting to consider methods from both fields in a broader context.
comment: -
♻ ☆ Games of Knightian Uncertainty as AGI testbeds
Arguably, for the latter part of the late 20th and early 21st centuries, games have been seen as the drosophila of AI. Games are a set of exciting testbeds, whose solutions (in terms of identifying optimal players) would lead to machines that would possess some form of general intelligence, or at the very least help us gain insights toward building intelligent machines. Following impressive successes in traditional board games like Go, Chess, and Poker, but also video games like the Atari 2600 collection, it is clear that this is not the case. Games have been attacked successfully, but we are nowhere near AGI developments (or, as harsher critics might say, useful AI developments!). In this short vision paper, we argue that for game research to become again relevant to the AGI pathway, we need to be able to address \textit{Knightian uncertainty} in the context of games, i.e. agents need to be able to adapt to rapid changes in game rules on the fly with no warning, no previous data, and no model access.
♻ ☆ Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.
♻ ☆ RDF Stream Taxonomy: Systematizing RDF Stream Types in Research and Practice
Over the years, RDF streaming was explored in research and practice from many angles, resulting in a wide range of RDF stream definitions. This variety presents a major challenge in discussing and integrating streaming systems, due to the lack of a common language. This work attempts to address this critical research gap, by systematizing RDF stream types present in the literature in a novel taxonomy. The proposed RDF Stream Taxonomy (RDF-STaX) is embodied in an OWL 2 DL ontology that follows the FAIR principles, making it readily applicable in practice. Extensive documentation and additional resources are provided, to foster the adoption of the ontology. Three use cases for the ontology are presented with accompanying competency questions, demonstrating the usefulness of the resource. Additionally, this work introduces a novel nanopublications dataset, which serves as a collaborative, living state-of-the-art review of RDF streaming. The results of a multifaceted evaluation of the resource are presented, testing its logical validity, use case coverage, and adherence to the community's best practices, while also comparing it to other works. RDF-STaX is expected to help drive innovation in RDF streaming, by fostering scientific discussion, cooperation, and tool interoperability.
♻ ☆ A Convex Hull Cheapest Insertion Heuristic for the Non-Euclidean TSP
The convex hull cheapest insertion heuristic is known to produce good solutions to the Traveling Salesperson Problem in Euclidean spaces, but it has not been extended to the non-Euclidean case. The proposed adaptation uses multidimensional scaling to first project the points into a Euclidean space, thereby enabling the generation of the convex hull that initializes the algorithm. To evaluate the proposed algorithm, non-Euclidean spaces are created by adding impassable separators to the TSPLIB benchmark data-set, or by using the L1 norm as a metric. This adapted heuristic is demonstrated to outperform the commonly used Nearest Neighbor heuristic and Nearest Insertion heuristic in 89% and 99% of the cases studied, respectively. When the genetic algorithm and ant colony optimization algorithms are provided 1 minute of computation time, the proposed heuristic tour costs are lower than the mean metaheuristic solutions found in 87% and 95% of the instances, respectively.
comment: Manuscript submitted 27 January 2024 to the Operations Research Letters
♻ ☆ FAITH: Frequency-domain Attention In Two Horizons for Time Series Forecasting
Time Series Forecasting plays a crucial role in various fields such as industrial equipment maintenance, meteorology, energy consumption, traffic flow and financial investment. However, despite their considerable advantages over traditional statistical approaches, current deep learning-based predictive models often exhibit a significant deviation between their forecasting outcomes and the ground truth. This discrepancy is largely due to an insufficient emphasis on extracting the sequence's latent information, particularly its global information within the frequency domain and the relationship between different variables. To address this issue, we propose a novel model Frequency-domain Attention In Two Horizons, which decomposes time series into trend and seasonal components using a multi-scale sequence adaptive decomposition and fusion architecture, and processes them separately. FAITH utilizes Frequency Channel feature Extraction Module and Frequency Temporal feature Extraction Module to capture inter-channel relationships and temporal global information in the sequence, significantly improving its ability to handle long-term dependencies and complex patterns. Furthermore, FAITH achieves theoretically linear complexity by modifying the time-frequency domain transformation method, effectively reducing computational costs. Extensive experiments on 6 benchmarks for long-term forecasting and 3 benchmarks for short-term forecasting demonstrate that FAITH outperforms existing models in many fields, such as electricity, weather and traffic, proving its effectiveness and superiority both in long-term and short-term time series forecasting tasks. Our codes and data are available at https://github.com/LRQ577/FAITH.
♻ ☆ Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process MICCAI 2024
Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg
comment: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg
♻ ☆ Discovering hidden physics using ML-based multimodal super-resolution measurement and its application to fusion plasmas
A non-linear complex system governed by multi-spatial and multi-temporal physics scales cannot be fully understood with a single diagnostic, as each provides only a partial view and much information is lost during data extraction. Combining multiple diagnostics also results in imperfect projections of the system's physics. By identifying hidden inter-correlations between diagnostics, we can leverage mutual support to fill in these gaps, but uncovering these inter-correlations analytically is too complex. We introduce a groundbreaking machine learning methodology to address this issue. Our multimodal approach generates super resolution data encompassing multiple physics phenomena, capturing detailed structural evolution and responses to perturbations previously unobservable. This methodology addresses a critical problem in fusion plasmas: the Edge Localized Mode (ELM), a plasma instability that can severely damage reactor walls. One method to stabilize ELM is using resonant magnetic perturbation to trigger magnetic islands. However, low spatial and temporal resolution of measurements limits the analysis of these magnetic islands due to their small size, rapid dynamics, and complex interactions within the plasma. With super-resolution diagnostics, we can experimentally verify theoretical models of magnetic islands for the first time, providing unprecedented insights into their role in ELM stabilization. This advancement aids in developing effective ELM suppression strategies for future fusion reactors like ITER and has broader applications, potentially revolutionizing diagnostics in fields such as astronomy, astrophysics, and medical imaging.
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Global Benchmark Database
This paper presents Global Benchmark Database (GBD), a comprehensive suite of tools for provisioning and sustainably maintaining benchmark instances and their metadata. The availability of benchmark metadata is essential for many tasks in empirical research, e.g., for the data-driven compilation of benchmarks, the domain-specific analysis of runtime experiments, or the instance-specific selection of solvers. In this paper, we introduce the data model of GBD as well as its interfaces and provide examples of how to interact with them. We also demonstrate the integration of custom data sources and explain how to extend GBD with additional problem domains, instance formats and feature extractors.
♻ ☆ Evaluating the Social Impact of Generative AI Systems in Systems and Society
Generative AI systems across modalities, ranging from text (including code), image, audio, and video, have broad social impacts, but there is no official standard for means of evaluating those impacts or for which impacts should be evaluated. In this paper, we present a guide that moves toward a standard approach in evaluating a base generative AI system for any modality in two overarching categories: what can be evaluated in a base system independent of context and what can be evaluated in a societal context. Importantly, this refers to base systems that have no predetermined application or deployment context, including a model itself, as well as system components, such as training data. Our framework for a base system defines seven categories of social impact: bias, stereotypes, and representational harms; cultural values and sensitive content; disparate performance; privacy and data protection; financial costs; environmental costs; and data and content moderation labor costs. Suggested methods for evaluation apply to listed generative modalities and analyses of the limitations of existing evaluations serve as a starting point for necessary investment in future evaluations. We offer five overarching categories for what can be evaluated in a broader societal context, each with its own subcategories: trustworthiness and autonomy; inequality, marginalization, and violence; concentration of authority; labor and creativity; and ecosystem and environment. Each subcategory includes recommendations for mitigating harm.
comment: Forthcoming in Hacker, Engel, Hammer, Mittelstadt (eds), Oxford Handbook on the Foundations and Regulation of Generative AI. Oxford University Press
♻ ☆ Deep Support Vectors
Deep learning has achieved tremendous success. \nj{However,} unlike SVMs, which provide direct decision criteria and can be trained with a small dataset, it still has significant weaknesses due to its requirement for massive datasets during training and the black-box characteristics on decision criteria. \nj{This paper addresses} these issues by identifying support vectors in deep learning models. To this end, we propose the DeepKKT condition, an adaptation of the traditional Karush-Kuhn-Tucker (KKT) condition for deep learning models, and confirm that generated Deep Support Vectors (DSVs) using this condition exhibit properties similar to traditional support vectors. This allows us to apply our method to few-shot dataset distillation problems and alleviate the black-box characteristics of deep learning models. Additionally, we demonstrate that the DeepKKT condition can transform conventional classification models into generative models with high fidelity, particularly as latent \jh{generative} models using class labels as latent variables. We validate the effectiveness of DSVs \nj{using common datasets (ImageNet, CIFAR10 \nj{and} CIFAR100) on the general architectures (ResNet and ConvNet)}, proving their practical applicability. (See Fig.~\ref{fig:generated})
♻ ☆ Inference Attacks: A Taxonomy, Survey, and Promising Directions
The prosperity of machine learning has also brought people's concerns about data privacy. Among them, inference attacks can implement privacy breaches in various MLaaS scenarios and model training/prediction phases. Specifically, inference attacks can perform privacy inference on undisclosed target training sets based on outputs of the target model, including but not limited to statistics, membership, semantics, data representation, etc. For instance, infer whether the target data has the characteristics of AIDS. In addition, the rapid development of the machine learning community in recent years, especially the surge of model types and application scenarios, has further stimulated the inference attacks' research. Thus, studying inference attacks and analyzing them in depth is urgent and significant. However, there is still a gap in the systematic discussion of inference attacks from taxonomy, global perspective, attack, and defense perspectives. This survey provides an in-depth and comprehensive inference of attacks and corresponding countermeasures in ML-as-a-service based on taxonomy and the latest researches. Without compromising researchers' intuition, we first propose the 3MP taxonomy based on the community research status, trying to normalize the confusing naming system of inference attacks. Also, we analyze the pros and cons of each type of inference attack, their workflow, countermeasure, and how they interact with other attacks. In the end, we point out several promising directions for researchers from a more comprehensive and novel perspective.
♻ ☆ AsyncDiff: Parallelizing Diffusion Models by Asynchronous Denoising
Diffusion models have garnered significant interest from the community for their great generative ability across various applications. However, their typical multi-step sequential-denoising nature gives rise to high cumulative latency, thereby precluding the possibilities of parallel computation. To address this, we introduce AsyncDiff, a universal and plug-and-play acceleration scheme that enables model parallelism across multiple devices. Our approach divides the cumbersome noise prediction model into multiple components, assigning each to a different device. To break the dependency chain between these components, it transforms the conventional sequential denoising into an asynchronous process by exploiting the high similarity between hidden states in consecutive diffusion steps. Consequently, each component is facilitated to compute in parallel on separate devices. The proposed strategy significantly reduces inference latency while minimally impacting the generative quality. Specifically, for the Stable Diffusion v2.1, AsyncDiff achieves a 2.7x speedup with negligible degradation and a 4.0x speedup with only a slight reduction of 0.38 in CLIP Score, on four NVIDIA A5000 GPUs. Our experiments also demonstrate that AsyncDiff can be readily applied to video diffusion models with encouraging performances. The code is available at https://github.com/czg1225/AsyncDiff.
comment: Work in progress. Project Page: https://czg1225.github.io/asyncdiff_page/
♻ ☆ Time Series Modeling for Heart Rate Prediction: From ARIMA to Transformers
Cardiovascular disease (CVD) is a leading cause of death globally, necessitating precise forecasting models for monitoring vital signs like heart rate, blood pressure, and ECG. Traditional models, such as ARIMA and Prophet, are limited by their need for manual parameter tuning and challenges in handling noisy, sparse, and highly variable medical data. This study investigates advanced deep learning models, including LSTM, and transformer-based architectures, for predicting heart rate time series from the MIT-BIH Database. Results demonstrate that deep learning models, particularly PatchTST, significantly outperform traditional models across multiple metrics, capturing complex patterns and dependencies more effectively. This research underscores the potential of deep learning to enhance patient monitoring and CVD management, suggesting substantial clinical benefits. Future work should extend these findings to larger, more diverse datasets and real-world clinical applications to further validate and optimize model performance.
comment: Accepted by 2024 6th International Conference on Electronic Engineering and Informatics
♻ ☆ Enhancing Text-based Knowledge Graph Completion with Zero-Shot Large Language Models: A Focus on Semantic Enhancement
The design and development of text-based knowledge graph completion (KGC) methods leveraging textual entity descriptions are at the forefront of research. These methods involve advanced optimization techniques such as soft prompts and contrastive learning to enhance KGC models. The effectiveness of text-based methods largely hinges on the quality and richness of the training data. Large language models (LLMs) can utilize straightforward prompts to alter text data, thereby enabling data augmentation for KGC. Nevertheless, LLMs typically demand substantial computational resources. To address these issues, we introduce a framework termed constrained prompts for KGC (CP-KGC). This CP-KGC framework designs prompts that adapt to different datasets to enhance semantic richness. Additionally, CP-KGC employs a context constraint strategy to effectively identify polysemous entities within KGC datasets. Through extensive experimentation, we have verified the effectiveness of this framework. Even after quantization, the LLM (Qwen-7B-Chat-int4) still enhances the performance of text-based KGC methods \footnote{Code and datasets are available at \href{https://github.com/sjlmg/CP-KGC}{https://github.com/sjlmg/CP-KGC}}. This study extends the performance limits of existing models and promotes further integration of KGC with LLMs.
comment: new version
♻ ☆ Fast Sampling via Discrete Non-Markov Diffusion Models
Discrete diffusion models have emerged as powerful tools for high-quality data generation. Despite their success in discrete spaces, such as text generation tasks, the acceleration of discrete diffusion models remains under explored. In this paper, we propose a discrete non-Markov diffusion model, which admits an accelerated reverse sampling for discrete data generation. Our method significantly reduces the number of function evaluations (i.e., calls to the neural network), making the sampling process much faster. Furthermore, we study the transition from finite to infinite step sampling, offering new insights into bridging the gap between discrete and continuous-time processes for discrete diffusion models. Extensive experiments on natural language generation and machine translation tasks demonstrate the superior performance of our method in terms of both generation speed and sample quality compared to existing methods for discrete diffusion models.
comment: 33 pages, 5 figures, 12 tables
♻ ☆ Can Large Language Model Summarizers Adapt to Diverse Scientific Communication Goals? ACL 2024
In this work, we investigate the controllability of large language models (LLMs) on scientific summarization tasks. We identify key stylistic and content coverage factors that characterize different types of summaries such as paper reviews, abstracts, and lay summaries. By controlling stylistic features, we find that non-fine-tuned LLMs outperform humans in the MuP review generation task, both in terms of similarity to reference summaries and human preferences. Also, we show that we can improve the controllability of LLMs with keyword-based classifier-free guidance (CFG) while achieving lexical overlap comparable to strong fine-tuned baselines on arXiv and PubMed. However, our results also indicate that LLMs cannot consistently generate long summaries with more than 8 sentences. Furthermore, these models exhibit limited capacity to produce highly abstractive lay summaries. Although LLMs demonstrate strong generic summarization competency, sophisticated content control without costly fine-tuning remains an open problem for domain-specific applications.
comment: ACL 2024 camera ready
♻ ☆ Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through fine-tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
♻ ☆ MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases ICML 2024
This paper addresses the growing need for efficient large language models (LLMs) on mobile devices, driven by increasing cloud costs and latency concerns. We focus on designing top-quality LLMs with fewer than a billion parameters, a practical choice for mobile deployment. Contrary to prevailing belief emphasizing the pivotal role of data and parameter quantity in determining model quality, our investigation underscores the significance of model architecture for sub-billion scale LLMs. Leveraging deep and thin architectures, coupled with embedding sharing and grouped-query attention mechanisms, we establish a strong baseline network denoted as MobileLLM, which attains a remarkable 2.7%/4.3% accuracy boost over preceding 125M/350M state-of-the-art models. Additionally, we propose an immediate block-wise weight-sharing approach with no increase in model size and only marginal latency overhead. The resultant models, denoted as MobileLLM-LS, demonstrate a further accuracy enhancement of 0.7%/0.8% than MobileLLM 125M/350M. Moreover, MobileLLM model family shows significant improvements compared to previous sub-billion models on chat benchmarks, and demonstrates close correctness to LLaMA-v2 7B in API calling tasks, highlighting the capability of small models for common on-device use cases.
comment: ICML 2024. Code is available at https://github.com/facebookresearch/MobileLLM
♻ ☆ Can Large Language Models Follow Concept Annotation Guidelines? A Case Study on Scientific and Financial Domains ACL 2024
Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new concepts or facts from ground-truth labels. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept guidelines for sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 and GPT-4 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that Falcon-180B-chat is outperformed by Llama-2-70B-chat is most cases, which indicates that careful fine-tuning is more effective than increasing model scale. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs.
comment: ACL 2024 camera ready
♻ ☆ Taking Training Seriously: Human Guidance and Management-Based Regulation of Artificial Intelligence
Fervent calls for more robust governance of the harms associated with artificial intelligence (AI) are leading to the adoption around the world of what regulatory scholars have called a management-based approach to regulation. Recent initiatives in the United States and Europe, as well as the adoption of major self-regulatory standards by the International Organization for Standardization, share in common a core management-based paradigm. These management-based initiatives seek to motivate an increase in human oversight of how AI tools are trained and developed. Refinements and systematization of human-guided training techniques will thus be needed to fit within this emerging era of management-based regulatory paradigm. If taken seriously, human-guided training can alleviate some of the technical and ethical pressures on AI, boosting AI performance with human intuition as well as better addressing the needs for fairness and effective explainability. In this paper, we discuss the connection between the emerging management-based regulatory frameworks governing AI and the need for human oversight during training. We broadly cover some of the technical components involved in human-guided training and then argue that the kinds of high-stakes use cases for AI that appear of most concern to regulators should lean more on human-guided training than on data-only training. We hope to foster a discussion between legal scholars and computer scientists involving how to govern a domain of technology that is vast, heterogenous, and dynamic in its applications and risks.
comment: 9 pages, 1 figure
♻ ☆ Towards Open-set Camera 3D Object Detection
Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.
♻ ☆ Methodology of Adapting Large English Language Models for Specific Cultural Contexts
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
comment: 11 pages, 2 figures
♻ ☆ Neural Operator for Accelerating Coronal Magnetic Field Model
Studying the sun's outer atmosphere is challenging due to its complex magnetic fields impacting solar activities. Magnetohydrodynamics (MHD) simulations help model these interactions but are extremely time-consuming (usually on a scale of days). Our research applies the Fourier Neural Operator (FNO) to accelerate the coronal magnetic field modeling, specifically, the Bifrost MHD model. We apply Tensorized FNO (TFNO) to generate solutions from partial differential equations (PDEs) over a 3D domain efficiently. TFNO's performance is compared with other deep learning methods, highlighting its accuracy and scalability. Physics analysis confirms that TFNO is reliable and capable of accelerating MHD simulations with high precision. This advancement improves efficiency in data handling, enhances predictive capabilities, and provides a better understanding of magnetic topologies.
♻ ☆ A Semantic Framework for Neural-Symbolic Computing
Two approaches to AI, neural networks and symbolic systems, have been proven very successful for an array of AI problems. However, neither has been able to achieve the general reasoning ability required for human-like intelligence. It has been argued that this is due to inherent weaknesses in each approach. Luckily, these weaknesses appear to be complementary, with symbolic systems being adept at the kinds of things neural networks have trouble with and vice-versa. The field of neural-symbolic AI attempts to exploit this asymmetry by combining neural networks and symbolic AI into integrated systems. Often this has been done by encoding symbolic knowledge into neural networks. Unfortunately, although many different methods for this have been proposed, there is no common definition of an encoding to compare them. We seek to rectify this problem by introducing a semantic framework for neural-symbolic AI, which is then shown to be general enough to account for a large family of neural-symbolic systems. We provide a number of examples and proofs of the application of the framework to the neural encoding of various forms of knowledge representation and neural network. These, at first sight disparate approaches, are all shown to fall within the framework's formal definition of what we call semantic encoding for neural-symbolic AI.
♻ ☆ Symbolic Prompt Program Search: A Structure-Aware Approach to Efficient Compile-Time Prompt Optimization
In many modern LLM applications, such as retrieval augmented generation, prompts have become programs themselves. In these settings, prompt programs are repeatedly called with different user queries or data instances. A big practical challenge is optimizing such prompt programs. Recent work has mostly focused on either simple prompt programs or assumed that the general structure of a prompt program is fixed. We introduce SAMMO, a framework to perform symbolic prompt program search for compile-time optimizations of prompt programs. SAMMO represents prompt programs on a symbolic level which allows for a rich set of transformations that can be searched over during optimization. We show that SAMMO generalizes previous methods and improves the performance of complex prompts on (1) instruction tuning, (2) RAG pipeline tuning, and (3) prompt compression, across several different LLMs. We make all code available open-source at https://github.com/microsoft/sammo .
♻ ☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models ACL 2024
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and received over 25,000 stars and 3,000 forks.
comment: 13 pages, accepted to ACL 2024 System Demonstration Track
♻ ☆ Cross-Modality Program Representation Learning for Electronic Design Automation with High-Level Synthesis
In recent years, domain-specific accelerators (DSAs) have gained popularity for applications such as deep learning and autonomous driving. To facilitate DSA designs, programmers use high-level synthesis (HLS) to compile a high-level description written in C/C++ into a design with low-level hardware description languages that eventually synthesize DSAs on circuits. However, creating a high-quality HLS design still demands significant domain knowledge, particularly in microarchitecture decisions expressed as \textit{pragmas}. Thus, it is desirable to automate such decisions with the help of machine learning for predicting the quality of HLS designs, requiring a deeper understanding of the program that consists of original code and pragmas. Naturally, these programs can be considered as sequence data. In addition, these programs can be compiled and converted into a control data flow graph (CDFG). But existing works either fail to leverage both modalities or combine the two in shallow or coarse ways. We propose ProgSG, a model that allows interaction between the source code sequence modality and the graph modality in a deep and fine-grained way. To alleviate the scarcity of labeled designs, a pre-training method is proposed based on a suite of compiler's data flow analysis tasks. Experimental results show that ProgSG reduces the RMSE of design performance predictions by up to $22\%$, and identifies designs with an average of $1.10\times$ and $1.26\times$ (up to $8.17\times$ and $13.31\times$) performance improvement in design space exploration (DSE) task compared to HARP and AutoDSE, respectively.
comment: 14 pages, 8 figures. arXiv admin note: text overlap with arXiv:2305.10838
♻ ☆ Condition Monitoring with Incomplete Data: An Integrated Variational Autoencoder and Distance Metric Framework
Condition monitoring of industrial systems is crucial for ensuring safety and maintenance planning, yet notable challenges arise in real-world settings due to the limited or non-existent availability of fault samples. This paper introduces an innovative solution to this problem by proposing a new method for fault detection and condition monitoring for unseen data. Adopting an approach inspired by zero-shot learning, our method can identify faults and assign a relative health index to various operational conditions. Typically, we have plenty of data on normal operations, some data on compromised conditions, and very few (if any) samples of severe faults. We use a variational autoencoder to capture the probabilistic distribution of previously seen and new unseen conditions. The health status is determined by comparing each sample's deviation from a normal operation reference distribution in the latent space. Faults are detected by establishing a threshold for the health indexes, allowing the model to identify severe, unseen faults with high accuracy, even amidst noise. We validate our approach using the run-to-failure IMS-bearing dataset and compare it with other methods. The health indexes generated by our model closely match the established descriptive model of bearing wear, attesting to the robustness and reliability of our method. These findings highlight the potential of our methodology in augmenting fault detection capabilities within industrial domains, thereby contributing to heightened safety protocols and optimized maintenance practices.
comment: Accepted in the 2024 IEEE 20th International Conference on Automation Science and Engineering (CASE 2024)
♻ ☆ Equivariant Symmetry Breaking Sets
Equivariant neural networks (ENNs) have been shown to be extremely effective in applications involving underlying symmetries. By construction ENNs cannot produce lower symmetry outputs given a higher symmetry input. However, symmetry breaking occurs in many physical systems and we may obtain a less symmetric stable state from an initial highly symmetric one. Hence, it is imperative that we understand how to systematically break symmetry in ENNs. In this work, we propose a novel symmetry breaking framework that is fully equivariant and is the first which fully addresses spontaneous symmetry breaking. We emphasize that our approach is general and applicable to equivariance under any group. To achieve this, we introduce the idea of symmetry breaking sets (SBS). Rather than redesign existing networks, we design sets of symmetry breaking objects which we feed into our network based on the symmetry of our inputs and outputs. We show there is a natural way to define equivariance on these sets, which gives an additional constraint. Minimizing the size of these sets equates to data efficiency. We prove that minimizing these sets translates to a well studied group theory problem, and tabulate solutions to this problem for the point groups. Finally, we provide some examples of symmetry breaking to demonstrate how our approach works in practice.
comment: 43 pages, 18 figures
♻ ☆ A New Era in Software Security: Towards Self-Healing Software via Large Language Models and Formal Verification
This paper introduces an innovative approach that combines Large Language Models (LLMs) with Formal Verification strategies for automatic software vulnerability repair. Initially, we employ Bounded Model Checking (BMC) to identify vulnerabilities and extract counterexamples. These counterexamples are supported by mathematical proofs and the stack trace of the vulnerabilities. Using a specially designed prompt, we combine the original source code with the identified vulnerability, including its stack trace and counterexample that specifies the line number and error type. This combined information is then fed into an LLM, which is instructed to attempt to fix the code. The new code is subsequently verified again using BMC to ensure the fix succeeded. We present the ESBMC-AI framework as a proof of concept, leveraging the well-recognized and industry-adopted Efficient SMT-based Context-Bounded Model Checker (ESBMC) and a pre-trained transformer model to detect and fix errors in C programs, particularly in critical software components. We evaluated our approach on 50,000 C programs randomly selected from the FormAI dataset with their respective vulnerability classifications. Our results demonstrate ESBMC-AI's capability to automate the detection and repair of issues such as buffer overflow, arithmetic overflow, and pointer dereference failures with high accuracy. ESBMC-AI is a pioneering initiative, integrating LLMs with BMC techniques, offering potential integration into the continuous integration and deployment (CI/CD) process within the software development lifecycle.
♻ ☆ Meta-GCN: A Dynamically Weighted Loss Minimization Method for Dealing with the Data Imbalance in Graph Neural Networks
Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.
♻ ☆ Guylingo: The Republic of Guyana Creole Corpora NAACL 2024
While major languages often enjoy substantial attention and resources, the linguistic diversity across the globe encompasses a multitude of smaller, indigenous, and regional languages that lack the same level of computational support. One such region is the Caribbean. While commonly labeled as "English speaking", the ex-British Caribbean region consists of a myriad of Creole languages thriving alongside English. In this paper, we present Guylingo: a comprehensive corpus designed for advancing NLP research in the domain of Creolese (Guyanese English-lexicon Creole), the most widely spoken language in the culturally rich nation of Guyana. We first outline our framework for gathering and digitizing this diverse corpus, inclusive of colloquial expressions, idioms, and regional variations in a low-resource language. We then demonstrate the challenges of training and evaluating NLP models for machine translation in Creole. Lastly, we discuss the unique opportunities presented by recent NLP advancements for accelerating the formal adoption of Creole languages as official languages in the Caribbean.
comment: Accepted to NAACL 2024 Main Conference Special Theme Track: Languages of Latin America and The Caribbean
♻ ☆ Bioptic -- A Target-Agnostic Efficacy-Based Small Molecules Search Engine
Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100\% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.
♻ ☆ Shaping New Norms for AI
As Artificial Intelligence (AI) becomes increasingly integrated into our lives, the need for new norms is urgent. However, AI evolves at a much faster pace than the characteristic time of norm formation, posing an unprecedented challenge to our societies. This paper examines possible criticalities of the processes of norm formation surrounding AI. Thus, it focuses on how new norms can be established, rather than on what these norms should be. It distinguishes different scenarios based on the centralisation or decentralisation of the norm formation process, analysing the cases where new norms are shaped by formal authorities, informal institutions, or emerge spontaneously in a bottom-up fashion. On the latter point, the paper reports a conversation with ChatGPT in which the LLM discusses some of the emerging norms it has observed. Far from seeking exhaustiveness, this article aims to offer readers interpretive tools to understand society's response to the growing pervasiveness of AI. An outlook on how AI could influence the formation of future social norms emphasises the importance for open societies to anchor their formal deliberation process in an open, inclusive, and transparent public discourse.
Information Retrieval 18
☆ Which Neurons Matter in IR? Applying Integrated Gradients-based Methods to Understand Cross-Encoders ICTIR 2024
With the recent addition of Retrieval-Augmented Generation (RAG), the scope and importance of Information Retrieval (IR) has expanded. As a result, the importance of a deeper understanding of IR models also increases. However, interpretability in IR remains under-explored, especially when it comes to the models' inner mechanisms. In this paper, we explore the possibility of adapting Integrated Gradient-based methods in an IR context to identify the role of individual neurons within the model. In particular, we provide new insights into the role of what we call "relevance" neurons, as well as how they deal with unseen data. Finally, we carry out an in-depth pruning study to validate our findings.
comment: Accepted at ICTIR 2024
☆ Grounded and Transparent Response Generation for Conversational Information-Seeking Systems WSDM '24
While previous conversational information-seeking (CIS) research has focused on passage retrieval, reranking, and query rewriting, the challenge of synthesizing retrieved information into coherent responses remains. The proposed research delves into the intricacies of response generation in CIS systems. Open-ended information-seeking dialogues introduce multiple challenges that may lead to potential pitfalls in system responses. The study focuses on generating responses grounded in the retrieved passages and being transparent about the system's limitations. Specific research questions revolve around obtaining confidence-enriched information nuggets, automatic detection of incomplete or incorrect responses, generating responses communicating the system's limitations, and evaluating enhanced responses. By addressing these research tasks the study aspires to contribute to the advancement of conversational response generation, fostering more trustworthy interactions in CIS dialogues, and paving the way for grounded and transparent systems to meet users' needs in an information-driven world.
comment: Proceedings of the 17th ACM International Conference on Web Search and Data Mining (WSDM '24), 2024
☆ FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts
Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks.
☆ RAVEN: Multitask Retrieval Augmented Vision-Language Learning
The scaling of large language models to encode all the world's knowledge in model parameters is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. Existing methods focus on models designed for single tasks. Furthermore, they're limited by the need for resource intensive pre training, additional parameter requirements, unaddressed modality prioritization and lack of clear benefit over non-retrieval baselines. This paper introduces RAVEN, a multitask retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning. By integrating retrieval augmented samples without the need for additional retrieval-specific parameters, we show that the model acquires retrieval properties that are effective across multiple tasks. Our results and extensive ablations across retrieved modalities for the image captioning and VQA tasks indicate significant performance improvements compared to non retrieved baselines +1 CIDEr on MSCOCO, +4 CIDEr on NoCaps and nearly a +3\% accuracy on specific VQA question types. This underscores the efficacy of applying RAG approaches to VLMs, marking a stride toward more efficient and accessible multimodal learning.
☆ Statements: Universal Information Extraction from Tables with Large Language Models for ESG KPIs ACL 2024
Environment, Social, and Governance (ESG) KPIs assess an organization's performance on issues such as climate change, greenhouse gas emissions, water consumption, waste management, human rights, diversity, and policies. ESG reports convey this valuable quantitative information through tables. Unfortunately, extracting this information is difficult due to high variability in the table structure as well as content. We propose Statements, a novel domain agnostic data structure for extracting quantitative facts and related information. We propose translating tables to statements as a new supervised deep-learning universal information extraction task. We introduce SemTabNet - a dataset of over 100K annotated tables. Investigating a family of T5-based Statement Extraction Models, our best model generates statements which are 82% similar to the ground-truth (compared to baseline of 21%). We demonstrate the advantages of statements by applying our model to over 2700 tables from ESG reports. The homogeneous nature of statements permits exploratory data analysis on expansive information found in large collections of ESG reports.
comment: Accepted at the NLP4Climate workshop in the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024)
☆ Efficient course recommendations with T5-based ranking and summarization SIGIR 2024
In this paper, we implement and evaluate a two-stage retrieval pipeline for a course recommender system that ranks courses for skill-occupation pairs. The in-production recommender system BrightFit provides course recommendations from multiple sources. Some of the course descriptions are long and noisy, while retrieval and ranking in an online system have to be highly efficient. We developed a two-step retrieval pipeline with RankT5 finetuned on MSMARCO as re-ranker. We compare two summarizers for course descriptions: a LongT5 model that we finetuned for the task, and a generative LLM (Vicuna) with in-context learning. We experiment with quantization to reduce the size of the ranking model and increase inference speed. We evaluate our rankers on two newly labelled datasets, with an A/B test, and with a user questionnaire. On the two labelled datasets, our proposed two-stage ranking with automatic summarization achieves a substantial improvement over the in-production (BM25) ranker: nDCG@10 scores improve from 0.482 to 0.684 and from 0.447 to 0.844 on the two datasets. We also achieve a 40% speed-up by using a quantized version of RankT5. The improved quality of the ranking was confirmed by the questionnaire completed by 29 respondents, but not by the A/B test. In the A/B test, a higher clickthrough rate was observed for the BM25-ranking than for the proposed two-stage retrieval. We conclude that T5-based re-ranking and summarization for online course recommendation can obtain much better effectiveness than single-step lexical retrieval, and that quantization has a large effect on RankT5. In the online evaluation, however, other factors than relevance play a role (such as speed and interpretability of the retrieval results), as well as individual preferences.
comment: ReNeuIR 2024 (at SIGIR 2024) - 3rd Workshop on Reaching Efficiency in Neural Information Retrieval, 18 July, 2024, Washington D.C, USA
☆ Towards a Formal Characterization of User Simulation Objectives in Conversational Information Access SIGIR
User simulation is a promising approach for automatically training and evaluating conversational information access agents, enabling the generation of synthetic dialogues and facilitating reproducible experiments at scale. However, the objectives of user simulation for the different uses remain loosely defined, hindering the development of effective simulators. In this work, we formally characterize the distinct objectives for user simulators: training aims to maximize behavioral similarity to real users, while evaluation focuses on the accurate prediction of real-world conversational agent performance. Through an empirical study, we demonstrate that optimizing for one objective does not necessarily lead to improved performance on the other. This finding underscores the need for tailored design considerations depending on the intended use of the simulator. By establishing clear objectives and proposing concrete measures to evaluate user simulators against those objectives, we pave the way for the development of simulators that are specifically tailored to their intended use, ultimately leading to more effective conversational agents.
comment: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA
☆ Amplify Graph Learning for Recommendation via Sparsity Completion
Graph learning models have been widely deployed in collaborative filtering (CF) based recommendation systems. Due to the issue of data sparsity, the graph structure of the original input lacks potential positive preference edges, which significantly reduces the performance of recommendations. In this paper, we study how to enhance the graph structure for CF more effectively, thereby optimizing the representation of graph nodes. Previous works introduced matrix completion techniques into CF, proposing the use of either stochastic completion methods or superficial structure completion to address this issue. However, most of these approaches employ random numerical filling that lack control over noise perturbations and limit the in-depth exploration of higher-order interaction features of nodes, resulting in biased graph representations. In this paper, we propose an Amplify Graph Learning framework based on Sparsity Completion (called AGL-SC). First, we utilize graph neural network to mine direct interaction features between user and item nodes, which are used as the inputs of the encoder. Second, we design a factorization-based method to mine higher-order interaction features. These features serve as perturbation factors in the latent space of the hidden layer to facilitate generative enhancement. Finally, by employing the variational inference, the above multi-order features are integrated to implement the completion and enhancement of missing graph structures. We conducted benchmark and strategy experiments on four real-world datasets related to recommendation tasks. The experimental results demonstrate that AGL-SC significantly outperforms the state-of-the-art methods.
☆ Multi-modal Food Recommendation using Clustering and Self-supervised Learning
Food recommendation systems serve as pivotal components in the realm of digital lifestyle services, designed to assist users in discovering recipes and food items that resonate with their unique dietary predilections. Typically, multi-modal descriptions offer an exhaustive profile for each recipe, thereby ensuring recommendations that are both personalized and accurate. Our preliminary investigation of two datasets indicates that pre-trained multi-modal dense representations might precipitate a deterioration in performance compared to ID features when encapsulating interactive relationships. This observation implies that ID features possess a relative superiority in modeling interactive collaborative signals. Consequently, contemporary cutting-edge methodologies augment ID features with multi-modal information as supplementary features, overlooking the latent semantic relations between recipes. To rectify this, we present CLUSSL, a novel food recommendation framework that employs clustering and self-supervised learning. Specifically, CLUSSL formulates a modality-specific graph tailored to each modality with discrete/continuous features, thereby transforming semantic features into structural representation. Furthermore, CLUSSL procures recipe representations pertinent to different modalities via graph convolutional operations. A self-supervised learning objective is proposed to foster independence between recipe representations derived from different unimodal graphs. Comprehensive experiments on real-world datasets substantiate that CLUSSL consistently surpasses state-of-the-art recommendation benchmarks in performance.
comment: Working paper
☆ A Surprisingly Simple yet Effective Multi-Query Rewriting Method for Conversational Passage Retrieval SIGIR
Conversational passage retrieval is challenging as it often requires the resolution of references to previous utterances and needs to deal with the complexities of natural language, such as coreference and ellipsis. To address these challenges, pre-trained sequence-to-sequence neural query rewriters are commonly used to generate a single de-contextualized query based on conversation history. Previous research shows that combining multiple query rewrites for the same user utterance has a positive effect on retrieval performance. We propose the use of a neural query rewriter to generate multiple queries and show how to integrate those queries in the passage retrieval pipeline efficiently. The main strength of our approach lies in its simplicity: it leverages how the beam search algorithm works and can produce multiple query rewrites at no additional cost. Our contributions further include devising ways to utilize multi-query rewrites in both sparse and dense first-pass retrieval. We demonstrate that applying our approach on top of a standard passage retrieval pipeline delivers state-of-the-art performance without sacrificing efficiency.
comment: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval
☆ Towards Personalized Federated Multi-scenario Multi-task Recommendation
In modern recommender system applications, such as e-commerce, predicting multiple targets like click-through rate (CTR) and post-view click-through \& conversion rate (CTCVR) is common. Multi-task recommender systems are gaining traction in research and practical use. Existing multi-task recommender systems tackle diverse business scenarios, merging and modeling these scenarios unlocks shared knowledge to boost overall performance. As new and more complex real-world recommendation scenarios have emerged, data privacy issues make it difficult to train a single global multi-task recommendation model that processes multiple separate scenarios. In this paper, we propose a novel framework for personalized federated multi-scenario multi-task recommendation, called PF-MSMTrec. We assign each scenario to a dedicated client, with each client utilizing the Mixture-of-Experts (MMoE) structure. Our proposed method aims to tackle the unique challenge posed by multiple optimization conflicts in this setting. We introduce a bottom-up joint learning mechanism. Firstly, we design a parameter template to decouple the parameters of the expert network. Thus, scenario parameters are shared knowledge for federated parameter aggregation, while task-specific parameters are personalized local parameters. Secondly, we conduct personalized federated learning for the parameters of each expert network through a federated communication round, utilizing three modules: federated batch normalization, conflict coordination, and personalized aggregation. Finally, we perform another round of personalized federated parameter aggregation on the task tower network to obtain the prediction results for multiple tasks. We conduct extensive experiments on two public datasets, and the results demonstrate that our proposed method surpasses state-of-the-art methods.
☆ Zero-shot Composed Image Retrieval Considering Query-target Relationship Leveraging Masked Image-text Pairs ICIP 2024
This paper proposes a novel zero-shot composed image retrieval (CIR) method considering the query-target relationship by masked image-text pairs. The objective of CIR is to retrieve the target image using a query image and a query text. Existing methods use a textual inversion network to convert the query image into a pseudo word to compose the image and text and use a pre-trained visual-language model to realize the retrieval. However, they do not consider the query-target relationship to train the textual inversion network to acquire information for retrieval. In this paper, we propose a novel zero-shot CIR method that is trained end-to-end using masked image-text pairs. By exploiting the abundant image-text pairs that are convenient to obtain with a masking strategy for learning the query-target relationship, it is expected that accurate zero-shot CIR using a retrieval-focused textual inversion network can be realized. Experimental results show the effectiveness of the proposed method.
comment: Accepted as a conference paper in IEEE ICIP 2024
☆ ELCoRec: Enhance Language Understanding with Co-Propagation of Numerical and Categorical Features for Recommendation
Large language models have been flourishing in the natural language processing (NLP) domain, and their potential for recommendation has been paid much attention to. Despite the intelligence shown by the recommendation-oriented finetuned models, LLMs struggle to fully understand the user behavior patterns due to their innate weakness in interpreting numerical features and the overhead for long context, where the temporal relations among user behaviors, subtle quantitative signals among different ratings, and various side features of items are not well explored. Existing works only fine-tune a sole LLM on given text data without introducing that important information to it, leaving these problems unsolved. In this paper, we propose ELCoRec to Enhance Language understanding with CoPropagation of numerical and categorical features for Recommendation. Concretely, we propose to inject the preference understanding capability into LLM via a GAT expert model where the user preference is better encoded by parallelly propagating the temporal relations, and rating signals as well as various side information of historical items. The parallel propagation mechanism could stabilize heterogeneous features and offer an informative user preference encoding, which is then injected into the language models via soft prompting at the cost of a single token embedding. To further obtain the user's recent interests, we proposed a novel Recent interaction Augmented Prompt (RAP) template. Experiment results over three datasets against strong baselines validate the effectiveness of ELCoRec. The code is available at https://anonymous.4open.science/r/CIKM_Code_Repo-E6F5/README.md.
☆ TocBERT: Medical Document Structure Extraction Using Bidirectional Transformers
Text segmentation holds paramount importance in the field of Natural Language Processing (NLP). It plays an important role in several NLP downstream tasks like information retrieval and document summarization. In this work, we propose a new solution, namely TocBERT, for segmenting texts using bidirectional transformers. TocBERT represents a supervised solution trained on the detection of titles and sub-titles from their semantic representations. This task was formulated as a named entity recognition (NER) problem. The solution has been applied on a medical text segmentation use-case where the Bio-ClinicalBERT model is fine-tuned to segment discharge summaries of the MIMIC-III dataset. The performance of TocBERT has been evaluated on a human-labeled ground truth corpus of 250 notes. It achieved an F1-score of 84.6% when evaluated on a linear text segmentation problem and 72.8% on a hierarchical text segmentation problem. It outperformed a carefully designed rule-based solution, particularly in distinguishing titles from subtitles.
comment: 6 pages, 6 figures
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Error Bounds of Supervised Classification from Information-Theoretic Perspective
There remains a list of unanswered research questions on deep learning (DL), including the remarkable generalization power of overparametrized neural networks, the efficient optimization performance despite the non-convexity, and the mechanisms behind flat minima in generalization. In this paper, we adopt an information-theoretic perspective to explore the theoretical foundations of supervised classification using deep neural networks (DNNs). Our analysis introduces the concepts of fitting error and model risk, which, together with generalization error, constitute an upper bound on the expected risk. We demonstrate that the generalization errors are bounded by the complexity, influenced by both the smoothness of distribution and the sample size. Consequently, task complexity serves as a reliable indicator of the dataset's quality, guiding the setting of regularization hyperparameters. Furthermore, the derived upper bound fitting error links the back-propagated gradient, Neural Tangent Kernel (NTK), and the model's parameter count with the fitting error. Utilizing the triangle inequality, we establish an upper bound on the expected risk. This bound offers valuable insights into the effects of overparameterization, non-convex optimization, and the flat minima in DNNs.Finally, empirical verification confirms a significant positive correlation between the derived theoretical bounds and the practical expected risk, confirming the practical relevance of the theoretical findings.
♻ ☆ A Hierarchical Neural Framework for Classification and its Explanation in Large Unstructured Legal Documents CIKM 2023
Automatic legal judgment prediction and its explanation suffer from the problem of long case documents exceeding tens of thousands of words, in general, and having a non-uniform structure. Predicting judgments from such documents and extracting their explanation becomes a challenging task, more so on documents with no structural annotation. We define this problem as "scarce annotated legal documents" and explore their lack of structural information and their long lengths with a deep-learning-based classification framework which we call MESc; "Multi-stage Encoder-based Supervised with-clustering"; for judgment prediction. We explore the adaptability of LLMs with multi-billion parameters (GPT-Neo, and GPT-J) to legal texts and their intra-domain(legal) transfer learning capacity. Alongside this, we compare their performance and adaptability with MESc and the impact of combining embeddings from their last layers. For such hierarchical models, we also propose an explanation extraction algorithm named ORSE; Occlusion sensitivity-based Relevant Sentence Extractor; based on the input-occlusion sensitivity of the model, to explain the predictions with the most relevant sentences from the document. We explore these methods and test their effectiveness with extensive experiments and ablation studies on legal documents from India, the European Union, and the United States with the ILDC dataset and a subset of the LexGLUE dataset. MESc achieves a minimum total performance gain of approximately 2 points over previous state-of-the-art proposed methods, while ORSE applied on MESc achieves a total average gain of 50% over the baseline explainability scores.
comment: Published as non archival paper in the The 3rd International Workshop on Mining and Learning in the Legal Domain (MLLD-2023) at CIKM 2023, Birmingham, United Kingdom. (https://sites.google.com/view/mlld2023/)
♻ ☆ Bioptic -- A Target-Agnostic Efficacy-Based Small Molecules Search Engine
Recent successes in virtual screening have been made possible by large models and extensive chemical libraries. However, combining these elements is challenging: the larger the model, the more expensive it is to run, making ultra-large libraries unfeasible. To address this, we developed a target-agnostic, efficacy-based molecule search model, which allows us to find structurally dissimilar molecules with similar biological activities. We used the best practices to design fast retrieval system, based on processor-optimized SIMD instructions, enabling us to screen the ultra-large 40B Enamine REAL library with 100\% recall rate. We extensively benchmarked our model and several state-of-the-art models for both speed performance and retrieval quality of novel molecules.
Multimedia 2
☆ Taming Data and Transformers for Audio Generation
Generating ambient sounds and effects is a challenging problem due to data scarcity and often insufficient caption quality, making it difficult to employ large-scale generative models for the task. In this work, we tackle the problem by introducing two new models. First, we propose AutoCap, a high-quality and efficient automatic audio captioning model. We show that by leveraging metadata available with the audio modality, we can substantially improve the quality of captions. AutoCap reaches CIDEr score of 83.2, marking a 3.2% improvement from the best available captioning model at four times faster inference speed. We then use AutoCap to caption clips from existing datasets, obtaining 761,000 audio clips with high-quality captions, forming the largest available audio-text dataset. Second, we propose GenAu, a scalable transformer-based audio generation architecture that we scale up to 1.25B parameters and train with our new dataset. When compared to state-of-the-art audio generators, GenAu obtains significant improvements of 15.7% in FAD score, 22.7% in IS, and 13.5% in CLAP score, indicating significantly improved quality of generated audio compared to previous works. This shows that the quality of data is often as important as its quantity. Besides, since AutoCap is fully automatic, new audio samples can be added to the training dataset, unlocking the training of even larger generative models for audio synthesis.
comment: Project Webpage: https://snap-research.github.io/GenAU/
♻ ☆ Towards Alleviating Text-to-Image Retrieval Hallucination for CLIP in Zero-shot Learning
Pretrained cross-modal models, for instance, the most representative CLIP, have recently led to a boom in using pre-trained models for cross-modal zero-shot tasks, considering the generalization properties. However, we analytically discover that CLIP suffers from the text-to-image retrieval hallucination, adversely limiting its capabilities under zero-shot learning: CLIP would select the image with the highest score when asked to figure out which image perfectly matches one given query text among several candidate images even though CLIP knows contents in the image. Accordingly, we propose a Balanced Score with Auxiliary Prompts (BSAP) to mitigate the CLIP's text-to-image retrieval hallucination under zero-shot learning. Specifically, we first design auxiliary prompts to provide multiple reference outcomes for every single image retrieval, then the outcomes derived from each retrieved image in conjunction with the target text are normalized to obtain the final similarity, which alleviates hallucinations in the model. Additionally, we can merge CLIP's original results and BSAP to obtain a more robust hybrid outcome (BSAP-H). Extensive experiments on two typical zero-shot learning tasks, i.e., Referring Expression Comprehension (REC) and Referring Image Segmentation (RIS), are conducted to demonstrate the effectiveness of our BSAP. Specifically, when evaluated on the validation dataset of RefCOCO in REC, BSAP increases CLIP's performance by 20.6%. Further, we validate that our strategy could be applied in other types of pretrained cross-modal models, such as ALBEF and BLIP.
comment: This work has been submitted to the lEEE for possible publication. Copyright may betransferred without notice, after which this version may no longer be accessible
Computer Vision and Pattern Recognition 128
☆ On Scaling Up 3D Gaussian Splatting Training
3D Gaussian Splatting (3DGS) is increasingly popular for 3D reconstruction due to its superior visual quality and rendering speed. However, 3DGS training currently occurs on a single GPU, limiting its ability to handle high-resolution and large-scale 3D reconstruction tasks due to memory constraints. We introduce Grendel, a distributed system designed to partition 3DGS parameters and parallelize computation across multiple GPUs. As each Gaussian affects a small, dynamic subset of rendered pixels, Grendel employs sparse all-to-all communication to transfer the necessary Gaussians to pixel partitions and performs dynamic load balancing. Unlike existing 3DGS systems that train using one camera view image at a time, Grendel supports batched training with multiple views. We explore various optimization hyperparameter scaling strategies and find that a simple sqrt(batch size) scaling rule is highly effective. Evaluations using large-scale, high-resolution scenes show that Grendel enhances rendering quality by scaling up 3DGS parameters across multiple GPUs. On the Rubble dataset, we achieve a test PSNR of 27.28 by distributing 40.4 million Gaussians across 16 GPUs, compared to a PSNR of 26.28 using 11.2 million Gaussians on a single GPU. Grendel is an open-source project available at: https://github.com/nyu-systems/Grendel-GS
comment: Code: https://github.com/nyu-systems/Grendel-GS ; Project page: https://daohanlu.github.io/scaling-up-3dgs
☆ MatchTime: Towards Automatic Soccer Game Commentary Generation
Soccer is a globally popular sport with a vast audience, in this paper, we consider constructing an automatic soccer game commentary model to improve the audiences' viewing experience. In general, we make the following contributions: First, observing the prevalent video-text misalignment in existing datasets, we manually annotate timestamps for 49 matches, establishing a more robust benchmark for soccer game commentary generation, termed as SN-Caption-test-align; Second, we propose a multi-modal temporal alignment pipeline to automatically correct and filter the existing dataset at scale, creating a higher-quality soccer game commentary dataset for training, denoted as MatchTime; Third, based on our curated dataset, we train an automatic commentary generation model, named MatchVoice. Extensive experiments and ablation studies have demonstrated the effectiveness of our alignment pipeline, and training model on the curated datasets achieves state-of-the-art performance for commentary generation, showcasing that better alignment can lead to significant performance improvements in downstream tasks.
comment: Technical Report; Project Page: https://haoningwu3639.github.io/MatchTime/
☆ MultiDiff: Consistent Novel View Synthesis from a Single Image CVPR 2024
We introduce MultiDiff, a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature, as there exist multiple, plausible explanations for unobserved areas. To address this issue, we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views, increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes, allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation, MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements, while reducing inference time by an order of magnitude. For additional consistency and image quality improvements, we introduce a novel, structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging, real-world datasets RealEstate10K and ScanNet. Finally, our model naturally supports multi-view consistent editing without the need for further tuning.
comment: Project page: https://sirwyver.github.io/MultiDiff Video: https://youtu.be/zBC4z4qXW_4 - CVPR 2024
☆ ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.
comment: 31 pages, 15 figures
☆ CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/
comment: 121 pages, 90 figures
☆ Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration
Although deep learning-based image restoration methods have made significant progress, they still struggle with limited generalization to real-world scenarios due to the substantial domain gap caused by training on synthetic data. Existing methods address this issue by improving data synthesis pipelines, estimating degradation kernels, employing deep internal learning, and performing domain adaptation and regularization. Previous domain adaptation methods have sought to bridge the domain gap by learning domain-invariant knowledge in either feature or pixel space. However, these techniques often struggle to extend to low-level vision tasks within a stable and compact framework. In this paper, we show that it is possible to perform domain adaptation via the noise-space using diffusion models. In particular, by leveraging the unique property of how the multi-step denoising process is influenced by auxiliary conditional inputs, we obtain meaningful gradients from noise prediction to gradually align the restored results of both synthetic and real-world data to a common clean distribution. We refer to this method as denoising as adaptation. To prevent shortcuts during training, we present useful techniques such as channel shuffling and residual-swapping contrastive learning. Experimental results on three classical image restoration tasks, namely denoising, deblurring, and deraining, demonstrate the effectiveness of the proposed method. Code will be released at: https://github.com/KangLiao929/Noise-DA/.
comment: Github Repository: https://github.com/KangLiao929/Noise-DA/
☆ Robust Surgical Phase Recognition From Annotation Efficient Supervision
Surgical phase recognition is a key task in computer-assisted surgery, aiming to automatically identify and categorize the different phases within a surgical procedure. Despite substantial advancements, most current approaches rely on fully supervised training, requiring expensive and time-consuming frame-level annotations. Timestamp supervision has recently emerged as a promising alternative, significantly reducing annotation costs while maintaining competitive performance. However, models trained on timestamp annotations can be negatively impacted by missing phase annotations, leading to a potential drawback in real-world scenarios. In this work, we address this issue by proposing a robust method for surgical phase recognition that can handle missing phase annotations effectively. Furthermore, we introduce the SkipTag@K annotation approach to the surgical domain, enabling a flexible balance between annotation effort and model performance. Our method achieves competitive results on two challenging datasets, demonstrating its efficacy in handling missing phase annotations and its potential for reducing annotation costs. Specifically, we achieve an accuracy of 85.1\% on the MultiBypass140 dataset using only 3 annotated frames per video, showcasing the effectiveness of our method and the potential of the SkipTag@K setup. We perform extensive experiments to validate the robustness of our method and provide valuable insights to guide future research in surgical phase recognition. Our work contributes to the advancement of surgical workflow recognition and paves the way for more efficient and reliable surgical phase recognition systems.
☆ GaussianDreamerPro: Text to Manipulable 3D Gaussians with Highly Enhanced Quality
Recently, 3D Gaussian splatting (3D-GS) has achieved great success in reconstructing and rendering real-world scenes. To transfer the high rendering quality to generation tasks, a series of research works attempt to generate 3D-Gaussian assets from text. However, the generated assets have not achieved the same quality as those in reconstruction tasks. We observe that Gaussians tend to grow without control as the generation process may cause indeterminacy. Aiming at highly enhancing the generation quality, we propose a novel framework named GaussianDreamerPro. The main idea is to bind Gaussians to reasonable geometry, which evolves over the whole generation process. Along different stages of our framework, both the geometry and appearance can be enriched progressively. The final output asset is constructed with 3D Gaussians bound to mesh, which shows significantly enhanced details and quality compared with previous methods. Notably, the generated asset can also be seamlessly integrated into downstream manipulation pipelines, e.g. animation, composition, and simulation etc., greatly promoting its potential in wide applications. Demos are available at https://taoranyi.com/gaussiandreamerpro/.
comment: Project page: https://taoranyi.com/gaussiandreamerpro/
☆ DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance
Recent surge in large-scale generative models has spurred the development of vast fields in computer vision. In particular, text-to-image diffusion models have garnered widespread adoption across diverse domain due to their potential for high-fidelity image generation. Nonetheless, existing large-scale diffusion models are confined to generate images of up to 1K resolution, which is far from meeting the demands of contemporary commercial applications. Directly sampling higher-resolution images often yields results marred by artifacts such as object repetition and distorted shapes. Addressing the aforementioned issues typically necessitates training or fine-tuning models on higher resolution datasets. However, this undertaking poses a formidable challenge due to the difficulty in collecting large-scale high-resolution contents and substantial computational resources. While several preceding works have proposed alternatives, they often fail to produce convincing results. In this work, we probe the generative ability of diffusion models at higher resolution beyond its original capability and propose a novel progressive approach that fully utilizes generated low-resolution image to guide the generation of higher resolution image. Our method obviates the need for additional training or fine-tuning which significantly lowers the burden of computational costs. Extensive experiments and results validate the efficiency and efficacy of our method.
☆ Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference
Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair. This is arguably achieved by incorporating (i) 3D/2.5D shape perception from a single image, (ii) render-and-compare simulation, and (iii) rich semantic cue awareness to furnish (coarse) reference-query correspondence. Existing methods implement (i) by a 3D CAD model or well-calibrated multiple images and (ii) by training a network on specific objects, which necessitate laborious ground-truth labeling and tedious training, potentially leading to challenges in generalization. Moreover, (iii) was less exploited in the paradigm of (ii), despite that the coarse correspondence from (iii) enhances the compare process by filtering out non-overlapped parts under substantial pose differences/occlusions. Motivated by this, we propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable renderer, and (iii) with semantic cues from a pretrained model like DINOv2. Specifically, our differentiable renderer takes the 2.5D rotatable mesh textured by the RGB and the semantic maps (obtained by DINOv2 from the RGB input), then renders new RGB and semantic maps (with back-surface culling) under a novel rotated view. The refinement loss comes from comparing the rendered RGB and semantic maps with the query ones, back-propagating the gradients through the differentiable renderer to refine the 3D relative pose. As a result, our method can be readily applied to unseen objects, given only a single RGB-D reference, without label/training. Extensive experiments on LineMOD, LM-O, and YCB-V show that our training-free method significantly outperforms the SOTA supervised methods, especially under the rigorous Acc@5/10/15{\deg} metrics and the challenging cross-dataset settings.
comment: The codes are available at https://github.com/ethanygao/training-free_generalizable_relative_pose
☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm
☆ Unveiling the Unknown: Conditional Evidence Decoupling for Unknown Rejection
In this paper, we focus on training an open-set object detector under the condition of scarce training samples, which should distinguish the known and unknown categories. Under this challenging scenario, the decision boundaries of unknowns are difficult to learn and often ambiguous. To mitigate this issue, we develop a novel open-set object detection framework, which delves into conditional evidence decoupling for the unknown rejection. Specifically, we select pseudo-unknown samples by leveraging the discrepancy in attribution gradients between known and unknown classes, alleviating the inadequate unknown distribution coverage of training data. Subsequently, we propose a Conditional Evidence Decoupling Loss (CEDL) based on Evidential Deep Learning (EDL) theory, which decouples known and unknown properties in pseudo-unknown samples to learn distinct knowledge, enhancing separability between knowns and unknowns. Additionally, we propose an Abnormality Calibration Loss (ACL), which serves as a regularization term to adjust the output probability distribution, establishing robust decision boundaries for the unknown rejection. Our method has achieved the superiority performance over previous state-of-the-art approaches, improving the mean recall of unknown class by 7.24% across all shots in VOC10-5-5 dataset settings and 1.38% in VOC-COCO dataset settings. The code is available via https://github.com/zjzwzw/CED-FOOD.
☆ Facial Image Feature Analysis and its Specialization for Fréchet Distance and Neighborhoods
Assessing distances between images and image datasets is a fundamental task in vision-based research. It is a challenging open problem in the literature and despite the criticism it receives, the most ubiquitous method remains the Fr\'echet Inception Distance. The Inception network is trained on a specific labeled dataset, ImageNet, which has caused the core of its criticism in the most recent research. Improvements were shown by moving to self-supervision learning over ImageNet, leaving the training data domain as an open question. We make that last leap and provide the first analysis on domain-specific feature training and its effects on feature distance, on the widely-researched facial image domain. We provide our findings and insights on this domain specialization for Fr\'echet distance and image neighborhoods, supported by extensive experiments and in-depth user studies.
☆ Repeat and Concatenate: 2D to 3D Image Translation with 3D to 3D Generative Modeling CVPR
This paper investigates a 2D to 3D image translation method with a straightforward technique, enabling correlated 2D X-ray to 3D CT-like reconstruction. We observe that existing approaches, which integrate information across multiple 2D views in the latent space, lose valuable signal information during latent encoding. Instead, we simply repeat and concatenate the 2D views into higher-channel 3D volumes and approach the 3D reconstruction challenge as a straightforward 3D to 3D generative modeling problem, sidestepping several complex modeling issues. This method enables the reconstructed 3D volume to retain valuable information from the 2D inputs, which are passed between channel states in a Swin UNETR backbone. Our approach applies neural optimal transport, which is fast and stable to train, effectively integrating signal information across multiple views without the requirement for precise alignment; it produces non-collapsed reconstructions that are highly faithful to the 2D views, even after limited training. We demonstrate correlated results, both qualitatively and quantitatively, having trained our model on a single dataset and evaluated its generalization ability across six datasets, including out-of-distribution samples.
comment: CVPRW 2024 - DCA in MI; Best Paper Award
☆ BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data
Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes "BiTrack", a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.
☆ DoubleTake: Geometry Guided Depth Estimation
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.
☆ From Majority to Minority: A Diffusion-based Augmentation for Underrepresented Groups in Skin Lesion Analysis
AI-based diagnoses have demonstrated dermatologist-level performance in classifying skin cancer. However, such systems are prone to under-performing when tested on data from minority groups that lack sufficient representation in the training sets. Although data collection and annotation offer the best means for promoting minority groups, these processes are costly and time-consuming. Prior works have suggested that data from majority groups may serve as a valuable information source to supplement the training of diagnosis tools for minority groups. In this work, we propose an effective diffusion-based augmentation framework that maximizes the use of rich information from majority groups to benefit minority groups. Using groups with different skin types as a case study, our results show that the proposed framework can generate synthetic images that improve diagnostic results for the minority groups, even when there is little or no reference data from these target groups. The practical value of our work is evident in medical imaging analysis, where under-diagnosis persists as a problem for certain groups due to insufficient representation.
☆ Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process MICCAI 2024
Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg
comment: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg
☆ XLD: A Cross-Lane Dataset for Benchmarking Novel Driving View Synthesis
Thoroughly testing autonomy systems is crucial in the pursuit of safe autonomous driving vehicles. It necessitates creating safety-critical scenarios that go beyond what can be safely collected from real-world data, as many of these scenarios occur infrequently on public roads. However, the evaluation of most existing NVS methods relies on sporadic sampling of image frames from the training data, comparing the rendered images with ground truth images using metrics. Unfortunately, this evaluation protocol falls short of meeting the actual requirements in closed-loop simulations. Specifically, the true application demands the capability to render novel views that extend beyond the original trajectory (such as cross-lane views), which are challenging to capture in the real world. To address this, this paper presents a novel driving view synthesis dataset and benchmark specifically designed for autonomous driving simulations. This dataset is unique as it includes testing images captured by deviating from the training trajectory by 1-4 meters. It comprises six sequences encompassing various time and weather conditions. Each sequence contains 450 training images, 150 testing images, and their corresponding camera poses and intrinsic parameters. Leveraging this novel dataset, we establish the first realistic benchmark for evaluating existing NVS approaches under front-only and multi-camera settings. The experimental findings underscore the significant gap that exists in current approaches, revealing their inadequate ability to fulfill the demanding prerequisites of cross-lane or closed-loop simulation. Our dataset is released publicly at the project page: https://3d-aigc.github.io/XLD/.
comment: project page: https://3d-aigc.github.io/XLD/
☆ On Reducing Activity with Distillation and Regularization for Energy Efficient Spiking Neural Networks
Interest in spiking neural networks (SNNs) has been growing steadily, promising an energy-efficient alternative to formal neural networks (FNNs), commonly known as artificial neural networks (ANNs). Despite increasing interest, especially for Edge applications, these event-driven neural networks suffered from their difficulty to be trained compared to FNNs. To alleviate this problem, a number of innovative methods have been developed to provide performance more or less equivalent to that of FNNs. However, the spiking activity of a network during inference is usually not considered. While SNNs may usually have performance comparable to that of FNNs, it is often at the cost of an increase of the network's activity, thus limiting the benefit of using them as a more energy-efficient solution. In this paper, we propose to leverage Knowledge Distillation (KD) for SNNs training with surrogate gradient descent in order to optimize the trade-off between performance and spiking activity. Then, after understanding why KD led to an increase in sparsity, we also explored Activations regularization and proposed a novel method with Logits Regularization. These approaches, validated on several datasets, clearly show a reduction in network spiking activity (-26.73% on GSC and -14.32% on CIFAR-10) while preserving accuracy.
☆ AlignedCut: Visual Concepts Discovery on Brain-Guided Universal Feature Space
We study the intriguing connection between visual data, deep networks, and the brain. Our method creates a universal channel alignment by using brain voxel fMRI response prediction as the training objective. We discover that deep networks, trained with different objectives, share common feature channels across various models. These channels can be clustered into recurring sets, corresponding to distinct brain regions, indicating the formation of visual concepts. Tracing the clusters of channel responses onto the images, we see semantically meaningful object segments emerge, even without any supervised decoder. Furthermore, the universal feature alignment and the clustering of channels produce a picture and quantification of how visual information is processed through the different network layers, which produces precise comparisons between the networks.
☆ Continuous Sign Language Recognition Using Intra-inter Gloss Attention
Many continuous sign language recognition (CSLR) studies adopt transformer-based architectures for sequence modeling due to their powerful capacity for capturing global contexts. Nevertheless, vanilla self-attention, which serves as the core module of the transformer, calculates a weighted average over all time steps; therefore, the local temporal semantics of sign videos may not be fully exploited. In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module, to leverage the relationships among frames within glosses and the semantic and grammatical dependencies between glosses in the video. In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk. This localized self-attention significantly reduces complexity and eliminates noise introduced by considering non-relative frames. In the inter-gloss attention module, we first aggregate the chunk-level features within each gloss chunk by average pooling along the temporal dimension. Subsequently, multi-head self-attention is applied to all chunk-level features. Given the non-significance of the signer-environment interaction, we utilize segmentation to remove the background of the videos. This enables the proposed model to direct its focus toward the signer. Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge, improve the accuracy of CSLR, and achieve the word error rate (WER) of 20.4 on the test set which is a competitive result compare to the state-of-the-art which uses additional supervisions.
☆ Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation
Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by networks. To take the uncertainty into account in multi-modal information fusion, this paper proposes a novel Multi-modal Evidential Fusion Network (MEFN) comprising a Cross-Modal Feature Learning (CFL) module and a Multi-modal Trusted Fusion (MTF) module. The CFL module reduces the domain gap upon modality conversion and highlights common tumor features, thereby alleviating the needs of the segmentation module to handle modality specificity. The MTF module utilizes mutual attention mechanisms and an uncertainty calibrator to fuse modality features based on modality uncertainty and then fuse the segmentation results under the guidance of Dempster-Shafer Theory. Besides, a new uncertainty perceptual loss is introduced to force the model focusing on uncertain features and hence improve its ability to extract trusted modality information. Extensive comparative experiments are conducted on two publicly available PET/CT datasets to evaluate the performance of our proposed method whose results demonstrate that our MEFN significantly outperforms state-of-the-art methods with improvements of 2.15% and 3.23% in DSC scores on the AutoPET dataset and the Hecktor dataset, respectively. More importantly, our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results, which is particularly important for clinical applications. Our code will be available at https://github.com/QPaws/MEFN.
☆ Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution
Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL.
comment: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)
☆ Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI
When we are primarily interested in solving several problems jointly with a given prescribed high performance accuracy for each target application, then Foundation Models should for most cases be used rather than problem-specific models. We focus on the specific Computer Vision application of Foundation Models for Earth Observation (EO) and geospatial AI. These models can solve important problems we are tackling, including for example land cover classification, crop type mapping, flood segmentation, building density estimation, and road regression segmentation. In this paper, we show that for a limited number of labelled data, Foundation Models achieve improved performance compared to problem-specific models. In this work, we also present our proposed evaluation benchmark for Foundation Models for EO. Benchmarking the generalization performance of Foundation Models is important as it has become difficult to standardize a fair comparison across the many different models that have been proposed recently. We present the results using our evaluation benchmark for EO Foundation Models and show that Foundation Models are label efficient in the downstream tasks and help us solve problems we are tackling in EO and remote sensing.
comment: 5 pages, 2 figures, Submitted
☆ RealTalk: Real-time and Realistic Audio-driven Face Generation with 3D Facial Prior-guided Identity Alignment Network
Person-generic audio-driven face generation is a challenging task in computer vision. Previous methods have achieved remarkable progress in audio-visual synchronization, but there is still a significant gap between current results and practical applications. The challenges are two-fold: 1) Preserving unique individual traits for achieving high-precision lip synchronization. 2) Generating high-quality facial renderings in real-time performance. In this paper, we propose a novel generalized audio-driven framework RealTalk, which consists of an audio-to-expression transformer and a high-fidelity expression-to-face renderer. In the first component, we consider both identity and intra-personal variation features related to speaking lip movements. By incorporating cross-modal attention on the enriched facial priors, we can effectively align lip movements with audio, thus attaining greater precision in expression prediction. In the second component, we design a lightweight facial identity alignment (FIA) module which includes a lip-shape control structure and a face texture reference structure. This novel design allows us to generate fine details in real-time, without depending on sophisticated and inefficient feature alignment modules. Our experimental results, both quantitative and qualitative, on public datasets demonstrate the clear advantages of our method in terms of lip-speech synchronization and generation quality. Furthermore, our method is efficient and requires fewer computational resources, making it well-suited to meet the needs of practical applications.
☆ CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data
Confidence assessments of semantic segmentation algorithms in remote sensing are important. It is a desirable property of models to a priori know if they produce an incorrect output. Evaluations of the confidence assigned to the estimates of models for the task of classification in Earth Observation (EO) are crucial as they can be used to achieve improved semantic segmentation performance and prevent high error rates during inference and deployment. The model we develop, the Confidence Assessments of classification algorithms for Semantic segmentation (CAS) model, performs confidence evaluations at both the segment and pixel levels, and outputs both labels and confidence. The outcome of this work has important applications. The main application is the evaluation of EO Foundation Models on semantic segmentation downstream tasks, in particular land cover classification using satellite Copernicus Sentinel-2 data. The evaluation shows that the proposed model is effective and outperforms other alternative baseline models.
comment: 5 pages, 7 figures, 4 tables, Submitted
☆ Generalized Deepfake Attribution
The landscape of fake media creation changed with the introduction of Generative Adversarial Networks (GAN s). Fake media creation has been on the rise with the rapid advances in generation technology, leading to new challenges in Detecting fake media. A fundamental characteristic of GAN s is their sensitivity to parameter initialization, known as seeds. Each distinct seed utilized during training leads to the creation of unique model instances, resulting in divergent image outputs despite employing the same architecture. This means that even if we have one GAN architecture, it can produce countless variations of GAN models depending on the seed used. Existing methods for attributing deepfakes work well only if they have seen the specific GAN model during training. If the GAN architectures are retrained with a different seed, these methods struggle to attribute the fakes. This seed dependency issue made it difficult to attribute deepfakes with existing methods. We proposed a generalized deepfake attribution network (GDA-N et) to attribute fake images to their respective GAN architectures, even if they are generated from a retrained version of the GAN architecture with a different seed (cross-seed) or from the fine-tuned version of the existing GAN model. Extensive experiments on cross-seed and fine-tuned data of GAN models show that our method is highly effective compared to existing methods. We have provided the source code to validate our results.
☆ On the Role of Visual Grounding in VQA
Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in standard benchmarks. To uncover the impact of SC learning, Out-of-Distribution (OOD) tests have been proposed that expose a lack of VG with low accuracy. These tests have since been at the center of VG research and served as basis for various investigations into VG's impact on accuracy. However, the role of VG in VQA still remains not fully understood and has not yet been properly formalized. In this work, we seek to clarify VG's role in VQA by formalizing it on a conceptual level. We propose a novel theoretical framework called "Visually Grounded Reasoning" (VGR) that uses the concepts of VG and Reasoning to describe VQA inference in ideal OOD testing. By consolidating fundamental insights into VG's role in VQA, VGR helps to reveal rampant VG-related SC exploitation in OOD testing, which explains why the relationship between VG and OOD accuracy has been difficult to define. Finally, we propose an approach to create OOD tests that properly emphasize a requirement for VG, and show how to improve performance on them.
☆ Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.
☆ Generative artificial intelligence in ophthalmology: multimodal retinal images for the diagnosis of Alzheimer's disease with convolutional neural networks
Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 AmyloidPET negative subjects were used for classification. Denoising Diffusion Probabilistic Models (DDPMs) were trained to generate synthetic images and unimodal CNNs were pretrained on synthetic data and finetuned on real data or trained solely on real data. Multimodal classifiers were developed to combine predictions of the four unimodal CNNs with patient metadata. Class activation maps of the unimodal classifiers provided insight into the network's attention to inputs. Results. DDPMs generated diverse, realistic images without memorization. Pretraining unimodal CNNs with synthetic data improved AUPR at most from 0.350 to 0.579. Integration of metadata in multimodal CNNs improved AUPR from 0.486 to 0.634, which was the best overall best classifier. Class activation maps highlighted relevant retinal regions which correlated with AD. Conclusion. Our method for generating and leveraging synthetic data has the potential to improve AmyloidPET prediction from multimodal retinal imaging. A DDPM can generate realistic and unique multimodal synthetic retinal images. Our best performing unimodal and multimodal classifiers were not pretrained on synthetic data, however pretraining with synthetic data slightly improved classification performance for two out of the four modalities.
☆ ConStyle v2: A Strong Prompter for All-in-One Image Restoration
This paper introduces ConStyle v2, a strong plug-and-play prompter designed to output clean visual prompts and assist U-Net Image Restoration models in handling multiple degradations. The joint training process of IRConStyle, an Image Restoration framework consisting of ConStyle and a general restoration network, is divided into two stages: first, pre-training ConStyle alone, and then freezing its weights to guide the training of the general restoration network. Three improvements are proposed in the pre-training stage to train ConStyle: unsupervised pre-training, adding a pretext task (i.e. classification), and adopting knowledge distillation. Without bells and whistles, we can get ConStyle v2, a strong prompter for all-in-one Image Restoration, in less than two GPU days and doesn't require any fine-tuning. Extensive experiments on Restormer (transformer-based), NAFNet (CNN-based), MAXIM-1S (MLP-based), and a vanilla CNN network demonstrate that ConStyle v2 can enhance any U-Net style Image Restoration models to all-in-one Image Restoration models. Furthermore, models guided by the well-trained ConStyle v2 exhibit superior performance in some specific degradation compared to ConStyle.
☆ Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools
Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.
comment: Manuscript word count: 3000, Number of figures: 2, Number of tables: 3
☆ CoDA: Interactive Segmentation and Morphological Analysis of Dendroid Structures Exemplified on Stony Cold-Water Corals
Herein, we present CoDA, the Coral Dendroid structure Analyzer, a visual analytics suite that allows for the first time to investigate the ontogenetic morphological development of complex dendroid coral colonies, exemplified on three important framework-forming dendroid cold-water corals: Lophelia pertusa (Linnaeus, 1758), Madrepora oculata (Linnaeus, 1758), and Goniocorella dumosa (Alcock, 1902). Input to CoDA is an initial instance segmentation of the coral polyp cavities (calices), from which it estimates the skeleton tree of the colony and extracts classical morphological measurements and advanced shape features of the individual corallites. CoDA also works as a proofreading and error correction tool by helping to identify wrong parts in the skeleton tree and providing tools to quickly correct these errors. The final skeleton tree enables the derivation of additional information about the calices/corallite instances that otherwise could not be obtained, including their ontogenetic generation and branching patterns - the basis of a fully quantitative statistical analysis of the coral colony morphology. Part of CoDA is CoDAGraph, a feature-rich link-and-brush user interface for visualizing the extracted features and 2D graph layouts of the skeleton tree, enabling the real-time exploration of complex coral colonies and their building blocks, the individual corallites and branches. In the future, we expect CoDA to greatly facilitate the analysis of large stony corals of different species and morphotypes, as well as other dendroid structures, enabling new insights into the influence of genetic and environmental factors on their ontogenetic morphological development.
☆ GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension IJCAI 2024
There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.
comment: IJCAI 2024
☆ Guiding Video Prediction with Explicit Procedural Knowledge ICCV
We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.
comment: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
☆ Unlocking the Potential of Operations Research for Multi-Graph Matching
We consider the incomplete multi-graph matching problem, which is a generalization of the NP-hard quadratic assignment problem for matching multiple finite sets. Multi-graph matching plays a central role in computer vision, e.g., for matching images or shapes, so that a number of dedicated optimization techniques have been proposed. While the closely related NP-hard multi-dimensional assignment problem (MDAP) has been studied for decades in the operations research community, it only considers complete matchings and has a different cost structure. We bridge this gap and transfer well-known approximation algorithms for the MDAP to incomplete multi-graph matching. To this end, we revisit respective algorithms, adapt them to incomplete multi-graph matching, and propose their extended and parallelized versions. Our experimental validation shows that our new method substantially outperforms the previous state of the art in terms of objective and runtime. Our algorithm matches, for example, 29 images with more than 500 keypoints each in less than two minutes, whereas the fastest considered competitor requires at least half an hour while producing far worse results.
☆ Trimming the Fat: Efficient Compression of 3D Gaussian Splats through Pruning
In recent times, the utilization of 3D models has gained traction, owing to the capacity for end-to-end training initially offered by Neural Radiance Fields and more recently by 3D Gaussian Splatting (3DGS) models. The latter holds a significant advantage by inherently easing rapid convergence during training and offering extensive editability. However, despite rapid advancements, the literature still lives in its infancy regarding the scalability of these models. In this study, we take some initial steps in addressing this gap, showing an approach that enables both the memory and computational scalability of such models. Specifically, we propose "Trimming the fat", a post-hoc gradient-informed iterative pruning technique to eliminate redundant information encoded in the model. Our experimental findings on widely acknowledged benchmarks attest to the effectiveness of our approach, revealing that up to 75% of the Gaussians can be removed while maintaining or even improving upon baseline performance. Our approach achieves around 50$\times$ compression while preserving performance similar to the baseline model, and is able to speed-up computation up to 600~FPS.
☆ Joint Stream: Malignant Region Learning for Breast Cancer Diagnosis
Early diagnosis of breast cancer (BC) significantly contributes to reducing the mortality rate worldwide. The detection of different factors and biomarkers such as Estrogen receptor (ER), Progesterone receptor (PR), Human epidermal growth factor receptor 2 (HER2) gene, Histological grade (HG), Auxiliary lymph node (ALN) status, and Molecular subtype (MS) can play a significant role in improved BC diagnosis. However, the existing methods predict only a single factor which makes them less suitable to use in diagnosis and designing a strategy for treatment. In this paper, we propose to classify the six essential indicating factors (ER, PR, HER2, ALN, HG, MS) for early BC diagnosis using H\&E stained WSI's. To precisely capture local neighboring relationships, we use spatial and frequency domain information from the large patch size of WSI's malignant regions. Furthermore, to cater the variable number of regions of interest sizes and give due attention to each region, we propose a malignant region learning attention network. Our experimental results demonstrate that combining spatial and frequency information using the malignant region learning module significantly improves multi-factor and single-factor classification performance on publicly available datasets.
comment: Under Review (Biomedical Signal Processing and Control)
☆ EFCNet: Every Feature Counts for Small Medical Object Segmentation
This paper explores the segmentation of very small medical objects with significant clinical value. While Convolutional Neural Networks (CNNs), particularly UNet-like models, and recent Transformers have shown substantial progress in image segmentation, our empirical findings reveal their poor performance in segmenting the small medical objects and lesions concerned in this paper. This limitation may be attributed to information loss during their encoding and decoding process. In response to this challenge, we propose a novel model named EFCNet for small object segmentation in medical images. Our model incorporates two modules: the Cross-Stage Axial Attention Module (CSAA) and the Multi-Precision Supervision Module (MPS). These modules address information loss during encoding and decoding procedures, respectively. Specifically, CSAA integrates features from all stages of the encoder to adaptively learn suitable information needed in different decoding stages, thereby reducing information loss in the encoder. On the other hand, MPS introduces a novel multi-precision supervision mechanism to the decoder. This mechanism prioritizes attention to low-resolution features in the initial stages of the decoder, mitigating information loss caused by subsequent convolution and sampling processes and enhancing the model's global perception. We evaluate our model on two benchmark medical image datasets. The results demonstrate that EFCNet significantly outperforms previous segmentation methods designed for both medical and normal images.
☆ GS-Octree: Octree-based 3D Gaussian Splatting for Robust Object-level 3D Reconstruction Under Strong Lighting
The 3D Gaussian Splatting technique has significantly advanced the construction of radiance fields from multi-view images, enabling real-time rendering. While point-based rasterization effectively reduces computational demands for rendering, it often struggles to accurately reconstruct the geometry of the target object, especially under strong lighting. To address this challenge, we introduce a novel approach that combines octree-based implicit surface representations with Gaussian splatting. Our method consists of four stages. Initially, it reconstructs a signed distance field (SDF) and a radiance field through volume rendering, encoding them in a low-resolution octree. The initial SDF represents the coarse geometry of the target object. Subsequently, it introduces 3D Gaussians as additional degrees of freedom, which are guided by the SDF. In the third stage, the optimized Gaussians further improve the accuracy of the SDF, allowing it to recover finer geometric details compared to the initial SDF obtained in the first stage. Finally, it adopts the refined SDF to further optimize the 3D Gaussians via splatting, eliminating those that contribute little to visual appearance. Experimental results show that our method, which leverages the distribution of 3D Gaussians with SDFs, reconstructs more accurate geometry, particularly in images with specular highlights caused by strong lighting.
☆ VDG: Vision-Only Dynamic Gaussian for Driving Simulation
Dynamic Gaussian splatting has led to impressive scene reconstruction and image synthesis advances in novel views. Existing methods, however, heavily rely on pre-computed poses and Gaussian initialization by Structure from Motion (SfM) algorithms or expensive sensors. For the first time, this paper addresses this issue by integrating self-supervised VO into our pose-free dynamic Gaussian method (VDG) to boost pose and depth initialization and static-dynamic decomposition. Moreover, VDG can work with only RGB image input and construct dynamic scenes at a faster speed and larger scenes compared with the pose-free dynamic view-synthesis method. We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. Additional video and source code will be posted on our project page at https://3d-aigc.github.io/VDG.
☆ Human-free Prompted Based Anomaly Detection: prompt optimization with Meta-guiding prompt scheme
Pre-trained vision-language models (VLMs) are highly adaptable to various downstream tasks through few-shot learning, making prompt-based anomaly detection a promising approach. Traditional methods depend on human-crafted prompts that require prior knowledge of specific anomaly types. Our goal is to develop a human-free prompt-based anomaly detection framework that optimally learns prompts through data-driven methods, eliminating the need for human intervention. The primary challenge in this approach is the lack of anomalous samples during the training phase. Additionally, the Vision Transformer (ViT)-based image encoder in VLMs is not ideal for pixel-wise anomaly segmentation due to a locality feature mismatch between the original image and the output feature map. To tackle the first challenge, we have developed the Object-Attention Anomaly Generation Module (OAGM) to synthesize anomaly samples for training. Furthermore, our Meta-Guiding Prompt-Tuning Scheme (MPTS) iteratively adjusts the gradient-based optimization direction of learnable prompts to avoid overfitting to the synthesized anomalies. For the second challenge, we propose Locality-Aware Attention, which ensures that each local patch feature attends only to nearby patch features, preserving the locality features corresponding to their original locations. This framework allows for the optimal prompt embeddings by searching in the continuous latent space via backpropagation, free from human semantic constraints. Additionally, the modified locality-aware attention improves the precision of pixel-wise anomaly segmentation.
☆ MammothModa: Multi-Modal Large Language Model
In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
comment: Technical report
☆ VIPriors 4: Visual Inductive Priors for Data-Efficient Deep Learning Challenges
The fourth edition of the "VIPriors: Visual Inductive Priors for Data-Efficient Deep Learning" workshop features two data-impaired challenges. These challenges address the problem of training deep learning models for computer vision tasks with limited data. Participants are limited to training models from scratch using a low number of training samples and are not allowed to use any form of transfer learning. We aim to stimulate the development of novel approaches that incorporate inductive biases to improve the data efficiency of deep learning models. Significant advancements are made compared to the provided baselines, where winning solutions surpass the baselines by a considerable margin in both tasks. As in previous editions, these achievements are primarily attributed to heavy use of data augmentation policies and large model ensembles, though novel prior-based methods seem to contribute more to successful solutions compared to last year. This report highlights the key aspects of the challenges and their outcomes.
☆ Human-Aware 3D Scene Generation with Spatially-constrained Diffusion Models
Generating 3D scenes from human motion sequences supports numerous applications, including virtual reality and architectural design. However, previous auto-regression-based human-aware 3D scene generation methods have struggled to accurately capture the joint distribution of multiple objects and input humans, often resulting in overlapping object generation in the same space. To address this limitation, we explore the potential of diffusion models that simultaneously consider all input humans and the floor plan to generate plausible 3D scenes. Our approach not only satisfies all input human interactions but also adheres to spatial constraints with the floor plan. Furthermore, we introduce two spatial collision guidance mechanisms: human-object collision avoidance and object-room boundary constraints. These mechanisms help avoid generating scenes that conflict with human motions while respecting layout constraints. To enhance the diversity and accuracy of human-guided scene generation, we have developed an automated pipeline that improves the variety and plausibility of human-object interactions in the existing 3D FRONT HUMAN dataset. Extensive experiments on both synthetic and real-world datasets demonstrate that our framework can generate more natural and plausible 3D scenes with precise human-scene interactions, while significantly reducing human-object collisions compared to previous state-of-the-art methods. Our code and data will be made publicly available upon publication of this work.
☆ 3D-MVP: 3D Multiview Pretraining for Robotic Manipulation
Recent works have shown that visual pretraining on egocentric datasets using masked autoencoders (MAE) can improve generalization for downstream robotics tasks. However, these approaches pretrain only on 2D images, while many robotics applications require 3D scene understanding. In this work, we propose 3D-MVP, a novel approach for 3D multi-view pretraining using masked autoencoders. We leverage Robotic View Transformer (RVT), which uses a multi-view transformer to understand the 3D scene and predict gripper pose actions. We split RVT's multi-view transformer into visual encoder and action decoder, and pretrain its visual encoder using masked autoencoding on large-scale 3D datasets such as Objaverse. We evaluate 3D-MVP on a suite of virtual robot manipulation tasks and demonstrate improved performance over baselines. We also show promising results on a real robot platform with minimal finetuning. Our results suggest that 3D-aware pretraining is a promising approach to improve sample efficiency and generalization of vision-based robotic manipulation policies. We will release code and pretrained models for 3D-MVP to facilitate future research. Project site: https://jasonqsy.github.io/3DMVP
☆ SynRS3D: A Synthetic Dataset for Global 3D Semantic Understanding from Monocular Remote Sensing Imagery
Global semantic 3D understanding from single-view high-resolution remote sensing (RS) imagery is crucial for Earth Observation (EO). However, this task faces significant challenges due to the high costs of annotations and data collection, as well as geographically restricted data availability. To address these challenges, synthetic data offer a promising solution by being easily accessible and thus enabling the provision of large and diverse datasets. We develop a specialized synthetic data generation pipeline for EO and introduce SynRS3D, the largest synthetic RS 3D dataset. SynRS3D comprises 69,667 high-resolution optical images that cover six different city styles worldwide and feature eight land cover types, precise height information, and building change masks. To further enhance its utility, we develop a novel multi-task unsupervised domain adaptation (UDA) method, RS3DAda, coupled with our synthetic dataset, which facilitates the RS-specific transition from synthetic to real scenarios for land cover mapping and height estimation tasks, ultimately enabling global monocular 3D semantic understanding based on synthetic data. Extensive experiments on various real-world datasets demonstrate the adaptability and effectiveness of our synthetic dataset and proposed RS3DAda method. SynRS3D and related codes will be available.
☆ A Refer-and-Ground Multimodal Large Language Model for Biomedicine MICCAI2024
With the rapid development of multimodal large language models (MLLMs), especially their capabilities in visual chat through refer and ground functionalities, their significance is increasingly recognized. However, the biomedical field currently exhibits a substantial gap in this area, primarily due to the absence of a dedicated refer and ground dataset for biomedical images. To address this challenge, we devised the Med-GRIT-270k dataset. It comprises 270k question-and-answer pairs and spans eight distinct medical imaging modalities. Most importantly, it is the first dedicated to the biomedical domain and integrating refer and ground conversations. The key idea is to sample large-scale biomedical image-mask pairs from medical segmentation datasets and generate instruction datasets from text using chatGPT. Additionally, we introduce a Refer-and-Ground Multimodal Large Language Model for Biomedicine (BiRD) by using this dataset and multi-task instruction learning. Extensive experiments have corroborated the efficacy of the Med-GRIT-270k dataset and the multi-modal, fine-grained interactive capabilities of the BiRD model. This holds significant reference value for the exploration and development of intelligent biomedical assistants.
comment: Accepted by MICCAI2024
☆ Artificial Immune System of Secure Face Recognition Against Adversarial Attacks
Insect production for food and feed presents a promising supplement to ensure food safety and address the adverse impacts of agriculture on climate and environment in the future. However, optimisation is required for insect production to realise its full potential. This can be by targeted improvement of traits of interest through selective breeding, an approach which has so far been underexplored and underutilised in insect farming. Here we present a comprehensive review of the selective breeding framework in the context of insect production. We systematically evaluate adjustments of selective breeding techniques to the realm of insects and highlight the essential components integral to the breeding process. The discussion covers every step of a conventional breeding scheme, such as formulation of breeding objectives, phenotyping, estimation of genetic parameters and breeding values, selection of appropriate breeding strategies, and mitigation of issues associated with genetic diversity depletion and inbreeding. This review combines knowledge from diverse disciplines, bridging the gap between animal breeding, quantitative genetics, evolutionary biology, and entomology, offering an integrated view of the insect breeding research area and uniting knowledge which has previously remained scattered across diverse fields of expertise.
☆ Exclusive Style Removal for Cross Domain Novel Class Discovery
As a promising field in open-world learning, \textit{Novel Class Discovery} (NCD) is usually a task to cluster unseen novel classes in an unlabeled set based on the prior knowledge of labeled data within the same domain. However, the performance of existing NCD methods could be severely compromised when novel classes are sampled from a different distribution with the labeled ones. In this paper, we explore and establish the solvability of NCD in cross domain setting with the necessary condition that style information must be removed. Based on the theoretical analysis, we introduce an exclusive style removal module for extracting style information that is distinctive from the baseline features, thereby facilitating inference. Moreover, this module is easy to integrate with other NCD methods, acting as a plug-in to improve performance on novel classes with different distributions compared to the seen labeled set. Additionally, recognizing the non-negligible influence of different backbones and pre-training strategies on the performance of the NCD methods, we build a fair benchmark for future NCD research. Extensive experiments on three common datasets demonstrate the effectiveness of our proposed module.
☆ LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
☆ CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection
Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1) developing fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2) developing a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3) developing a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.
☆ Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots, especially when faced with the challenges posed by unknown and dynamic environments. This task requires robots to explore and build a semantic understanding of their surroundings, generate feasible plans to achieve manipulation goals, adapt to environmental changes, and comprehend natural language instructions from humans. To address these challenges, we propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities of pretraining visual-language models (VLMs) combined with dense 3D entity reconstruction to build 3D semantic maps. Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and process natural language user instructions for zero-shot OVMM tasks under dynamic environment settings, with an overall navigation and task success rate of 80.95% and 73.33% over 105 episodes, and better SFT and SPL by 157.18% and 19.53% respectively compared to the baseline. Furthermore, the framework is capable of replanning towards the next most probable candidate location based on the spatial semantic context derived from the 3D semantic map when initial plans fail, keeping an average success rate of 76.67%.
comment: Open-vocabulary, Mobile Manipulation, Dynamic Environments, 3D Semantic Maps, Zero-shot, LLMs, VLMs, 18 pages, 2 figures
☆ The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval
Recent studies have shown promising results in utilizing multimodal large language models (MLLMs) for computer vision tasks such as object detection and semantic segmentation. However, many challenging video tasks remain under-explored. Video-language tasks necessitate spatial and temporal comprehension and require significant compute. Therefore, prior works have developed complex, highly specialized architectures or leveraged additional input signals such as video transcripts to best encode contextual and temporal information, which limits their generality and can be impractical. One particularly challenging task is video moment retrieval, which requires precise temporal and contextual grounding. This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval. We introduce Mr. BLIP (Mr. as in Moment Retrieval), a multimodal, single-stage model that requires no expensive video-language pretraining, no additional input signal (e.g., no transcript or audio), and has a simpler and more versatile design than prior state-of-the-art methods. We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions and illustrate our method's versatility with a new state-of-the-art in temporal action localization on ActivityNet. Notably, we attain over 9% (absolute) higher Recall (at 0.5 and 0.7 IoU) on the challenging long-video multi-moment QVHighlights benchmark. Our code is publicly available.
comment: 16 pages, 3 figures
☆ A Lung Nodule Dataset with Histopathology-based Cancer Type Annotation
Recently, Computer-Aided Diagnosis (CAD) systems have emerged as indispensable tools in clinical diagnostic workflows, significantly alleviating the burden on radiologists. Nevertheless, despite their integration into clinical settings, CAD systems encounter limitations. Specifically, while CAD systems can achieve high performance in the detection of lung nodules, they face challenges in accurately predicting multiple cancer types. This limitation can be attributed to the scarcity of publicly available datasets annotated with expert-level cancer type information. This research aims to bridge this gap by providing publicly accessible datasets and reliable tools for medical diagnosis, facilitating a finer categorization of different types of lung diseases so as to offer precise treatment recommendations. To achieve this objective, we curated a diverse dataset of lung Computed Tomography (CT) images, comprising 330 annotated nodules (nodules are labeled as bounding boxes) from 95 distinct patients. The quality of the dataset was evaluated using a variety of classical classification and detection models, and these promising results demonstrate that the dataset has a feasible application and further facilitate intelligent auxiliary diagnosis.
☆ MFDNet: Multi-Frequency Deflare Network for Efficient Nighttime Flare Removal
When light is scattered or reflected accidentally in the lens, flare artifacts may appear in the captured photos, affecting the photos' visual quality. The main challenge in flare removal is to eliminate various flare artifacts while preserving the original content of the image. To address this challenge, we propose a lightweight Multi-Frequency Deflare Network (MFDNet) based on the Laplacian Pyramid. Our network decomposes the flare-corrupted image into low and high-frequency bands, effectively separating the illumination and content information in the image. The low-frequency part typically contains illumination information, while the high-frequency part contains detailed content information. So our MFDNet consists of two main modules: the Low-Frequency Flare Perception Module (LFFPM) to remove flare in the low-frequency part and the Hierarchical Fusion Reconstruction Module (HFRM) to reconstruct the flare-free image. Specifically, to perceive flare from a global perspective while retaining detailed information for image restoration, LFFPM utilizes Transformer to extract global information while utilizing a convolutional neural network to capture detailed local features. Then HFRM gradually fuses the outputs of LFFPM with the high-frequency component of the image through feature aggregation. Moreover, our MFDNet can reduce the computational cost by processing in multiple frequency bands instead of directly removing the flare on the input image. Experimental results demonstrate that our approach outperforms state-of-the-art methods in removing nighttime flare on real-world and synthetic images from the Flare7K dataset. Furthermore, the computational complexity of our model is remarkably low.
comment: Accepted by The Visual Computer journal
☆ Few-Shot Medical Image Segmentation with High-Fidelity Prototypes
Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.
☆ EgoVideo: Exploring Egocentric Foundation Model and Downstream Adaptation CVPR 2024
In this report, we present our solutions to the EgoVis Challenges in CVPR 2024, including five tracks in the Ego4D challenge and three tracks in the EPIC-Kitchens challenge. Building upon the video-language two-tower model and leveraging our meticulously organized egocentric video data, we introduce a novel foundation model called EgoVideo. This model is specifically designed to cater to the unique characteristics of egocentric videos and provides strong support for our competition submissions. In the Ego4D challenges, we tackle various tasks including Natural Language Queries, Step Grounding, Moment Queries, Short-term Object Interaction Anticipation, and Long-term Action Anticipation. In addition, we also participate in the EPIC-Kitchens challenge, where we engage in the Action Recognition, Multiple Instance Retrieval, and Domain Adaptation for Action Recognition tracks. By adapting EgoVideo to these diverse tasks, we showcase its versatility and effectiveness in different egocentric video analysis scenarios, demonstrating the powerful representation ability of EgoVideo as an egocentric foundation model. Our codebase and pretrained models are publicly available at https://github.com/OpenGVLab/EgoVideo.
comment: Champion solutions in the EgoVis CVPR 2024 workshop
☆ Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters using RGB video data captured using commodity cameras. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions. Given a speech audio waveform and a token sequence of the speaker's face landmark motion and body-joint motion computed from a video, our method synthesizes the motion sequences for the speaker's face landmarks and body joints to match the content and the affect of the speech. We design a generator consisting of a set of encoders to transform all the inputs into a multimodal embedding space capturing their correlations, followed by a pair of decoders to synthesize the desired face and pose motions. To enhance the plausibility of synthesis, we use an adversarial discriminator that learns to differentiate between the face and pose motions computed from the original videos and our synthesized motions based on their affective expressions. To evaluate our approach, we extend the TED Gesture Dataset to include view-normalized, co-speech face landmarks in addition to body gestures. We demonstrate the performance of our method through thorough quantitative and qualitative experiments on multiple evaluation metrics and via a user study. We observe that our method results in low reconstruction error and produces synthesized samples with diverse facial expressions and body gestures for digital characters.
comment: 14 pages, 7 figures, 2 tables
☆ Leveraging Pre-trained Models for FF-to-FFPE Histopathological Image Translation
The two primary types of Hematoxylin and Eosin (H&E) slides in histopathology are Formalin-Fixed Paraffin-Embedded (FFPE) and Fresh Frozen (FF). FFPE slides offer high quality histopathological images but require a labor-intensive acquisition process. In contrast, FF slides can be prepared quickly, but the image quality is relatively poor. Our task is to translate FF images into FFPE style, thereby improving the image quality for diagnostic purposes. In this paper, we propose Diffusion-FFPE, a method for FF-to-FFPE histopathological image translation using a pre-trained diffusion model. Specifically, we employ a one-step diffusion model as the generator and fine-tune it with LoRA adapters using adversarial learning objectives. To ensure that the model effectively captures both global structural information and local details, we propose a multi-scale feature fusion (MFF) module. This module utilizes two VAE encoders to extract features of varying image sizes and performs feature fusion before feeding them into the UNet. Furthermore, we utilize a pre-trained vision-language model for histopathology as the backbone for the discriminator to further improve performance We conducted FF-to-FFPE translation experiments on the TCGA-NSCLC datasets, and our method achieved better performance compared to other methods. The code and models are released at https://github.com/QilaiZhang/Diffusion-FFPE.
☆ ViT-1.58b: Mobile Vision Transformers in the 1-bit Era
Vision Transformers (ViTs) have achieved remarkable performance in various image classification tasks by leveraging the attention mechanism to process image patches as tokens. However, the high computational and memory demands of ViTs pose significant challenges for deployment in resource-constrained environments. This paper introduces ViT-1.58b, a novel 1.58-bit quantized ViT model designed to drastically reduce memory and computational overhead while preserving competitive performance. ViT-1.58b employs ternary quantization, which refines the balance between efficiency and accuracy by constraining weights to {-1, 0, 1} and quantizing activations to 8-bit precision. Our approach ensures efficient scaling in terms of both memory and computation. Experiments on CIFAR-10 and ImageNet-1k demonstrate that ViT-1.58b maintains comparable accuracy to full-precision Vit, with significant reductions in memory usage and computational costs. This paper highlights the potential of extreme quantization techniques in developing sustainable AI solutions and contributes to the broader discourse on efficient model deployment in practical applications. Our code and weights are available at https://github.com/DLYuanGod/ViT-1.58b.
☆ A Multi-Stage Goal-Driven Network for Pedestrian Trajectory Prediction
Pedestrian trajectory prediction plays a pivotal role in ensuring the safety and efficiency of various applications, including autonomous vehicles and traffic management systems. This paper proposes a novel method for pedestrian trajectory prediction, called multi-stage goal-driven network (MGNet). Diverging from prior approaches relying on stepwise recursive prediction and the singular forecasting of a long-term goal, MGNet directs trajectory generation by forecasting intermediate stage goals, thereby reducing prediction errors. The network comprises three main components: a conditional variational autoencoder (CVAE), an attention module, and a multi-stage goal evaluator. Trajectories are encoded using conditional variational autoencoders to acquire knowledge about the approximate distribution of pedestrians' future trajectories, and combined with an attention mechanism to capture the temporal dependency between trajectory sequences. The pivotal module is the multi-stage goal evaluator, which utilizes the encoded feature vectors to predict intermediate goals, effectively minimizing cumulative errors in the recursive inference process. The effectiveness of MGNet is demonstrated through comprehensive experiments on the JAAD and PIE datasets. Comparative evaluations against state-of-the-art algorithms reveal significant performance improvements achieved by our proposed method.
comment: Paper accepted by 5th International Conference on Computer Vision, Image and Deep Learning (CVIDL 2024)
☆ ScanFormer: Referring Expression Comprehension by Iteratively Scanning CVPR2024
Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance, they perform a dense perception of images, which incorporates redundant visual regions unrelated to linguistic queries, leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks, with limited exploration in vision-language fields. To address this, we propose a coarse-to-fine iterative perception framework, called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration, irrelevant patches are discarded by our designed informativeness prediction. Furthermore, we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets, namely RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, verify the effectiveness of our method, which can strike a balance between accuracy and efficiency.
comment: Accepted by CVPR2024
☆ Multimodal foundation world models for generalist embodied agents
Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.
☆ Towards Synchronous Memorizability and Generalizability with Site-Modulated Diffusion Replay for Cross-Site Continual Segmentation
The ability to learn sequentially from different data sites is crucial for a deep network in solving practical medical image diagnosis problems due to privacy restrictions and storage limitations. However, adapting on incoming site leads to catastrophic forgetting on past sites and decreases generalizablity on unseen sites. Existing Continual Learning (CL) and Domain Generalization (DG) methods have been proposed to solve these two challenges respectively, but none of them can address both simultaneously. Recognizing this limitation, this paper proposes a novel training paradigm, learning towards Synchronous Memorizability and Generalizability (SMG-Learning). To achieve this, we create the orientational gradient alignment to ensure memorizability on previous sites, and arbitrary gradient alignment to enhance generalizability on unseen sites. This approach is named as Parallel Gradient Alignment (PGA). Furthermore, we approximate the PGA as dual meta-objectives using the first-order Taylor expansion to reduce computational cost of aligning gradients. Considering that performing gradient alignments, especially for previous sites, is not feasible due to the privacy constraints, we design a Site-Modulated Diffusion (SMD) model to generate images with site-specific learnable prompts, replaying images have similar data distributions as previous sites. We evaluate our method on two medical image segmentation tasks, where data from different sites arrive sequentially. Experimental results show that our method efficiently enhances both memorizability and generalizablity better than other state-of-the-art methods, delivering satisfactory performance across all sites. Our code will be available at: https://github.com/dyxu-cuhkcse/SMG-Learning.
☆ Real-time Structure Flow
This article introduces the structure flow field; a flow field that can provide high-speed robo-centric motion information for motion control of highly dynamic robotic devices and autonomous vehicles. Structure flow is the angular 3D velocity of the scene at a given pixel. We show that structure flow posses an elegant evolution model in the form of a Partial Differential Equation (PDE) that enables us to create dense flow predictions forward in time. We exploit this structure to design a predictor-update algorithm to compute structure flow in real time using image and depth measurements. The prediction stage takes the previous estimate of the structure flow and propagates it forward in time using a numerical implementation of the structure flow PDE. The predicted flow is then updated using new image and depth data. The algorithm runs up to 600 Hz on a Desktop GPU machine for 512x512 images with flow values up to 8 pixels. We provide ground truth validation on high-speed synthetic image sequences as well as results on real-life video on driving scenarios.
☆ View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis
The inspection and monitoring of infrastructure assets typically requires identifying visual anomalies in scenes periodically photographed over time. Images collected manually or with robots such as unmanned aerial vehicles from the same scene at different instances in time are typically not perfectly aligned. Supervised segmentation methods can be applied to identify known problems, but unsupervised anomaly detection approaches are required when unknown anomalies occur. Current unsupervised pixel-level anomaly detection methods have mainly been developed for industrial settings where the camera position is known and constant. However, we find that these methods fail to generalize to the case when images are not perfectly aligned. We term the problem of unsupervised anomaly detection between two such imperfectly aligned sets of images as Scene Anomaly Detection (Scene AD). We present a novel network termed OmniAD to address the Scene AD problem posed. Specifically, we refine the anomaly detection method reverse distillation to achieve a 40% increase in pixel-level anomaly detection performance. The network's performance is further demonstrated to improve with two new data augmentation strategies proposed that leverage novel view synthesis and camera localization to improve generalization. We validate our approach with qualitative and quantitative results on a new dataset, ToyCity, the first Scene AD dataset with multiple objects, as well as on the established single object-centric dataset, MAD. https://drags99.github.io/OmniAD/
☆ Expressive Keypoints for Skeleton-based Action Recognition via Skeleton Transformation
In the realm of skeleton-based action recognition, the traditional methods which rely on coarse body keypoints fall short of capturing subtle human actions. In this work, we propose Expressive Keypoints that incorporates hand and foot details to form a fine-grained skeletal representation, improving the discriminative ability for existing models in discerning intricate actions. To efficiently model Expressive Keypoints, the Skeleton Transformation strategy is presented to gradually downsample the keypoints and prioritize prominent joints by allocating the importance weights. Additionally, a plug-and-play Instance Pooling module is exploited to extend our approach to multi-person scenarios without surging computation costs. Extensive experimental results over seven datasets present the superiority of our method compared to the state-of-the-art for skeleton-based human action recognition. Code is available at https://github.com/YijieYang23/SkeleT-GCN.
☆ Changen2: Multi-Temporal Remote Sensing Generative Change Foundation Model ICCV 2023
Our understanding of the temporal dynamics of the Earth's surface has been advanced by deep vision models, which often require lots of labeled multi-temporal images for training. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present change data generators based on generative models, which are cheap and automatic, alleviating these data problems. Our main idea is to simulate a stochastic change process over time. We describe the stochastic change process as a probabilistic graphical model (GPCM), which factorizes the complex simulation problem into two more tractable sub-problems, i.e., change event simulation and semantic change synthesis. To solve these two problems, we present Changen2, a GPCM with a resolution-scalable diffusion transformer which can generate time series of images and their semantic and change labels from labeled or unlabeled single-temporal images. Changen2 is a generative change foundation model that can be trained at scale via self-supervision, and can produce change supervisory signals from unlabeled single-temporal images. Unlike existing foundation models, Changen2 synthesizes change data to train task-specific foundation models for change detection. The resulting model possesses inherent zero-shot change detection capabilities and excellent transferability. Experiments suggest Changen2 has superior spatiotemporal scalability, e.g., Changen2 model trained on 256$^2$ pixel single-temporal images can yield time series of any length and resolutions of 1,024$^2$ pixels. Changen2 pre-trained models exhibit superior zero-shot performance (narrowing the performance gap to 3% on LEVIR-CD and approximately 10% on both S2Looking and SECOND, compared to fully supervised counterparts) and transferability across multiple types of change tasks.
comment: The enhanced extension of our ICCV 2023 (Changen)
☆ DICE: End-to-end Deformation Capture of Hand-Face Interactions from a Single Image
Reconstructing 3D hand-face interactions with deformations from a single image is a challenging yet crucial task with broad applications in AR, VR, and gaming. The challenges stem from self-occlusions during single-view hand-face interactions, diverse spatial relationships between hands and face, complex deformations, and the ambiguity of the single-view setting. The first and only method for hand-face interaction recovery, Decaf, introduces a global fitting optimization guided by contact and deformation estimation networks trained on studio-collected data with 3D annotations. However, Decaf suffers from a time-consuming optimization process and limited generalization capability due to its reliance on 3D annotations of hand-face interaction data. To address these issues, we present DICE, the first end-to-end method for Deformation-aware hand-face Interaction reCovEry from a single image. DICE estimates the poses of hands and faces, contacts, and deformations simultaneously using a Transformer-based architecture. It features disentangling the regression of local deformation fields and global mesh vertex locations into two network branches, enhancing deformation and contact estimation for precise and robust hand-face mesh recovery. To improve generalizability, we propose a weakly-supervised training approach that augments the training set using in-the-wild images without 3D ground-truth annotations, employing the depths of 2D keypoints estimated by off-the-shelf models and adversarial priors of poses for supervision. Our experiments demonstrate that DICE achieves state-of-the-art performance on a standard benchmark and in-the-wild data in terms of accuracy and physical plausibility. Additionally, our method operates at an interactive rate (20 fps) on an Nvidia 4090 GPU, whereas Decaf requires more than 15 seconds for a single image. Our code will be publicly available upon publication.
comment: 23 pages, 9 figures, 3 tables
☆ MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
We train a model to generate images from multimodal prompts of interleaved text and images such as "a man and his dog in an animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.
☆ WV-Net: A foundation model for SAR WV-mode satellite imagery trained using contrastive self-supervised learning on 10 million images NeurIPS 2024
The European Space Agency's Copernicus Sentinel-1 (S-1) mission is a constellation of C-band synthetic aperture radar (SAR) satellites that provide unprecedented monitoring of the world's oceans. S-1's wave mode (WV) captures 20x20 km image patches at 5 m pixel resolution and is unaffected by cloud cover or time-of-day. The mission's open data policy has made SAR data easily accessible for a range of applications, but the need for manual image annotations is a bottleneck that hinders the use of machine learning methods. This study uses nearly 10 million WV-mode images and contrastive self-supervised learning to train a semantic embedding model called WV-Net. In multiple downstream tasks, WV-Net outperforms a comparable model that was pre-trained on natural images (ImageNet) with supervised learning. Experiments show improvements for estimating wave height (0.50 vs 0.60 RMSE using linear probing), estimating near-surface air temperature (0.90 vs 0.97 RMSE), and performing multilabel-classification of geophysical and atmospheric phenomena (0.96 vs 0.95 micro-averaged AUROC). WV-Net embeddings are also superior in an unsupervised image-retrieval task and scale better in data-sparse settings. Together, these results demonstrate that WV-Net embeddings can support geophysical research by providing a convenient foundation model for a variety of data analysis and exploration tasks.
comment: 20 pages, 9 figures, submitted to NeurIPS 2024
☆ 3D Feature Distillation with Object-Centric Priors
Grounding natural language to the physical world is a ubiquitous topic with a wide range of applications in computer vision and robotics. Recently, 2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images. Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific and hence lack generalization, or focus on indoor room scan data that require access to multiple camera views, which is not practical in robot manipulation scenarios. Additionally, related methods typically fuse features at pixel-level and assume that all camera views are equally informative. In this work, we show that this approach leads to sub-optimal 3D features, both in terms of grounding accuracy, as well as segmentation crispness. To alleviate this, we propose a multi-view feature fusion strategy that employs object-centric priors to eliminate uninformative views based on semantic information, and fuse features at object-level via instance segmentation masks. To distill our object-centric 3D features, we generate a large-scale synthetic multi-view dataset of cluttered tabletop scenes, spawning 15k scenes from over 3300 unique object instances, which we make publicly available. We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency, while doing so from single-view RGB-D, thus departing from the assumption of multiple camera views at test time. Finally, we show that our approach can generalize to novel tabletop domains and be re-purposed for 3D instance segmentation without fine-tuning, and demonstrate its utility for language-guided robotic grasping in clutter
comment: Submitted CoRL-24
☆ Towards Open-World Grasping with Large Vision-Language Models
The ability to grasp objects in-the-wild from open-ended language instructions constitutes a fundamental challenge in robotics. An open-world grasping system should be able to combine high-level contextual with low-level physical-geometric reasoning in order to be applicable in arbitrary scenarios. Recent works exploit the web-scale knowledge inherent in large language models (LLMs) to plan and reason in robotic context, but rely on external vision and action models to ground such knowledge into the environment and parameterize actuation. This setup suffers from two major bottlenecks: a) the LLM's reasoning capacity is constrained by the quality of visual grounding, and b) LLMs do not contain low-level spatial understanding of the world, which is essential for grasping in contact-rich scenarios. In this work we demonstrate that modern vision-language models (VLMs) are capable of tackling such limitations, as they are implicitly grounded and can jointly reason about semantics and geometry. We propose OWG, an open-world grasping pipeline that combines VLMs with segmentation and grasp synthesis models to unlock grounded world understanding in three stages: open-ended referring segmentation, grounded grasp planning and grasp ranking via contact reasoning, all of which can be applied zero-shot via suitable visual prompting mechanisms. We conduct extensive evaluation in cluttered indoor scene datasets to showcase OWG's robustness in grounding from open-ended language, as well as open-world robotic grasping experiments in both simulation and hardware that demonstrate superior performance compared to previous supervised and zero-shot LLM-based methods.
comment: Submitted CoRL24
☆ Dynamic Gaussian Marbles for Novel View Synthesis of Casual Monocular Videos
Gaussian splatting has become a popular representation for novel-view synthesis, exhibiting clear strengths in efficiency, photometric quality, and compositional edibility. Following its success, many works have extended Gaussians to 4D, showing that dynamic Gaussians maintain these benefits while also tracking scene geometry far better than alternative representations. Yet, these methods assume dense multi-view videos as supervision, constraining their use to controlled capture settings. In this work, we extend the capability of Gaussian scene representations to casually captured monocular videos. We show that existing 4D Gaussian methods dramatically fail in this setup because the monocular setting is underconstrained. Building off this finding, we propose Dynamic Gaussian Marbles (DGMarbles), consisting of three core modifications that target the difficulties of the monocular setting. First, DGMarbles uses isotropic Gaussian "marbles", reducing the degrees of freedom of each Gaussian, and constraining the optimization to focus on motion and appearance over local shape. Second, DGMarbles employs a hierarchical divide-and-conquer learning strategy to guide the optimization towards solutions with coherent motion. Finally, DGMarbles adds image-level and geometry-level priors into the optimization, including a tracking loss that takes advantage of recent progress in point tracking. By constraining the optimization in these ways, DGMarbles learns Gaussian trajectories that enable novel-view rendering and accurately capture the 3D motion of the scene elements. We evaluate on the (monocular) Nvidia Dynamic Scenes dataset and the Dycheck iPhone dataset, and show that DGMarbles significantly outperforms other Gaussian baselines in quality, and is on-par with non-Gaussian representations, all while maintaining the efficiency, compositionality, editability, and tracking benefits of Gaussians.
☆ SpY: A Context-Based Approach to Spacecraft Component Detection
This paper focuses on autonomously characterizing components such as solar panels, body panels, antennas, and thrusters of an unknown resident space object (RSO) using camera feed to aid autonomous on-orbit servicing (OOS) and active debris removal. Significant research has been conducted in this area using convolutional neural networks (CNNs). While CNNs are powerful at learning patterns and performing object detection, they struggle with missed detections and misclassifications in environments different from the training data, making them unreliable for safety in high-stakes missions like OOS. Additionally, failures exhibited by CNNs are often easily rectifiable by humans using commonsense reasoning and contextual knowledge. Embedding such reasoning in an object detector could improve detection accuracy. To validate this hypothesis, this paper presents an end-to-end object detector called SpaceYOLOv2 (SpY), which leverages the generalizability of CNNs while incorporating contextual knowledge using traditional computer vision techniques. SpY consists of two main components: a shape detector and the SpaceYOLO classifier (SYC). The shape detector uses CNNs to detect primitive shapes of RSOs and SYC associates these shapes with contextual knowledge, such as color and texture, to classify them as spacecraft components or "unknown" if the detected shape is uncertain. SpY's modular architecture allows customizable usage of contextual knowledge to improve detection performance, or SYC as a secondary fail-safe classifier with an existing spacecraft component detector. Performance evaluations on hardware-in-the-loop images of a mock-up spacecraft demonstrate that SpY is accurate and an ensemble of SpY with YOLOv5 trained for satellite component detection improved the performance by 23.4% in recall, demonstrating enhanced safety for vision-based navigation tasks.
comment: 12 pages, 9 figures
☆ Geometric Features Enhanced Human-Object Interaction Detection
Cameras are essential vision instruments to capture images for pattern detection and measurement. Human-object interaction (HOI) detection is one of the most popular pattern detection approaches for captured human-centric visual scenes. Recently, Transformer-based models have become the dominant approach for HOI detection due to their advanced network architectures and thus promising results. However, most of them follow the one-stage design of vanilla Transformer, leaving rich geometric priors under-exploited and leading to compromised performance especially when occlusion occurs. Given that geometric features tend to outperform visual ones in occluded scenarios and offer information that complements visual cues, we propose a novel end-to-end Transformer-style HOI detection model, i.e., geometric features enhanced HOI detector (GeoHOI). One key part of the model is a new unified self-supervised keypoint learning method named UniPointNet that bridges the gap of consistent keypoint representation across diverse object categories, including humans. GeoHOI effectively upgrades a Transformer-based HOI detector benefiting from the keypoints similarities measuring the likelihood of human-object interactions as well as local keypoint patches to enhance interaction query representation, so as to boost HOI predictions. Extensive experiments show that the proposed method outperforms the state-of-the-art models on V-COCO and achieves competitive performance on HICO-DET. Case study results on the post-disaster rescue with vision-based instruments showcase the applicability of the proposed GeoHOI in real-world applications.
comment: Accepted to IEEE TIM
☆ CSI4Free: GAN-Augmented mmWave CSI for Improved Pose Classification
In recent years, Joint Communication and Sensing (JC&S), has demonstrated significant success, particularly in utilizing sub-6 GHz frequencies with commercial-off-the-shelf (COTS) Wi-Fi devices for applications such as localization, gesture recognition, and pose classification. Deep learning and the existence of large public datasets has been pivotal in achieving such results. However, at mmWave frequencies (30-300 GHz), which has shown potential for more accurate sensing performance, there is a noticeable lack of research in the domain of COTS Wi-Fi sensing. Challenges such as limited research hardware, the absence of large datasets, limited functionality in COTS hardware, and the complexities of data collection present obstacles to a comprehensive exploration of this field. In this work, we aim to address these challenges by developing a method that can generate synthetic mmWave channel state information (CSI) samples. In particular, we use a generative adversarial network (GAN) on an existing dataset, to generate 30,000 additional CSI samples. The augmented samples exhibit a remarkable degree of consistency with the original data, as indicated by the notably high GAN-train and GAN-test scores. Furthermore, we integrate the augmented samples in training a pose classification model. We observe that the augmented samples complement the real data and improve the generalization of the classification model.
☆ IDA-UIE: An Iterative Framework for Deep Network-based Degradation Aware Underwater Image Enhancement
Underwater image quality is affected by fluorescence, low illumination, absorption, and scattering. Recent works in underwater image enhancement have proposed different deep network architectures to handle these problems. Most of these works have proposed a single network to handle all the challenges. We believe that deep networks trained for specific conditions deliver better performance than a single network learned from all degradation cases. Accordingly, the first contribution of this work lies in the proposal of an iterative framework where a single dominant degradation condition is identified and resolved. This proposal considers the following eight degradation conditions -- low illumination, low contrast, haziness, blurred image, presence of noise and color imbalance in three different channels. A deep network is designed to identify the dominant degradation condition. Accordingly, an appropriate deep network is selected for degradation condition-specific enhancement. The second contribution of this work is the construction of degradation condition specific datasets from good quality images of two standard datasets (UIEB and EUVP). This dataset is used to learn the condition specific enhancement networks. The proposed approach is found to outperform nine baseline methods on UIEB and EUVP datasets.
♻ ☆ Situational Awareness Matters in 3D Vision Language Reasoning CVPR 2024
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
comment: CVPR 2024. Project Page: https://yunzeman.github.io/situation3d
♻ ☆ Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
comment: Project page: https://yandex-research.github.io/invertible-cd/
♻ ☆ Unsupervised Open-Vocabulary Object Localization in Videos ICCV 2023
In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.
comment: Accepted by ICCV 2023; Presented on CVPR 2024 Workshop CORR; Project Page:https://github.com/amazon-science/object-centric-vol
♻ ☆ Towards Arbitrary-Scale Histopathology Image Super-resolution: An Efficient Dual-branch Framework via Implicit Self-texture Enhancement
High-quality whole-slide scanners are expensive, complex, and time-consuming, thus limiting the acquisition and utilization of high-resolution pathology whole-slide images in daily clinical work. Deep learning-based single-image super-resolution techniques are an effective way to solve this problem by synthesizing high-resolution images from low-resolution ones. However, the existing super-resolution models applied in pathology images can only work in fixed integer magnifications, significantly decreasing their applicability. Though methods based on implicit neural representation have shown promising results in arbitrary-scale super-resolution of natural images, applying them directly to pathology images is inadequate because they have unique fine-grained image textures different from natural images. Thus, we propose an Implicit Self-Texture Enhancement-based dual-branch framework (ISTE) for arbitrary-scale super-resolution of pathology images to address this challenge. ISTE contains a pixel learning branch and a texture learning branch, which first learn pixel features and texture features, respectively. Then, we design a two-stage texture enhancement strategy to fuse the features from the two branches to obtain the super-resolution results, where the first stage is feature-based texture enhancement, and the second stage is spatial-domain-based texture enhancement. Extensive experiments on three public datasets show that ISTE outperforms existing fixed-scale and arbitrary-scale algorithms at multiple magnifications and helps to improve downstream task performance. To the best of our knowledge, this is the first work to achieve arbitrary-scale super-resolution in pathology images. Codes will be available.
♻ ☆ An unsupervised approach towards promptable defect segmentation in laser-based additive manufacturing by Segment Anything
Foundation models are currently driving a paradigm shift in computer vision tasks for various fields including biology, astronomy, and robotics among others, leveraging user-generated prompts to enhance their performance. In the Laser Additive Manufacturing (LAM) domain, accurate image-based defect segmentation is imperative to ensure product quality and facilitate real-time process control. However, such tasks are often characterized by multiple challenges including the absence of labels and the requirement for low latency inference among others. Porosity is a very common defect in LAM due to lack of fusion, entrapped gas, and keyholes, directly affecting mechanical properties like tensile strength, stiffness, and hardness, thereby compromising the quality of the final product. To address these issues, we construct a framework for image segmentation using a state-of-the-art Vision Transformer (ViT) based Foundation model (Segment Anything Model) with a novel multi-point prompt generation scheme using unsupervised clustering. Utilizing our framework we perform porosity segmentation in a case study of laser-based powder bed fusion (L-PBF) and obtain high accuracy without using any labeled data to guide the prompt tuning process. By capitalizing on lightweight foundation model inference combined with unsupervised prompt generation, we envision constructing a real-time anomaly detection pipeline that could revolutionize current laser additive manufacturing processes, thereby facilitating the shift towards Industry 4.0 and promoting defect-free production along with operational efficiency.
comment: 18 pages, 9 figures
♻ ☆ InstantGroup: Instant Template Generation for Scalable Group of Brain MRI Registration
Template generation is a critical step in groupwise image registration, which involves aligning a group of subjects into a common space. While existing methods can generate high-quality template images, they often incur substantial time costs or are limited by fixed group scales. In this paper, we present InstantGroup, an efficient groupwise template generation framework based on variational autoencoder (VAE) models that leverage latent representations' arithmetic properties, enabling scalability to groups of any size. InstantGroup features a Dual VAEs backbone with shared-weight twin networks to handle pairs of inputs and incorporates a Displacement Inversion Module (DIM) to maintain template unbiasedness and a Subject-Template Alignment Module (STAM) to improve template quality and registration accuracy. Experiments on 3D brain MRI scans from the OASIS and ADNI datasets reveal that InstantGroup dramatically reduces runtime, generating templates within seconds for various group sizes while maintaining superior performance compared to state-of-the-art baselines on quantitative metrics, including unbiasedness and registration accuracy.
♻ ☆ WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) constitutes a comprehensive labeled dataset employed in machine learning applications. Nevertheless, the methodologies for data preparation, preprocessing, and classification documented in the literature exhibit considerable variability and are typically not applied to the dataset in its entirety. This study initially undertakes a concise review of the state-of-the-art benchmarks pertaining to the dataset, with a particular focus on clarifying data preparation and preprocessing techniques. Subsequently, we explore the utilization of the Wavelet Scattering Transform (WST) and Mel spectrogram as preprocessing mechanisms for feature extraction. In this paper, we introduce \textbf{WhaleNet} (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations, leveraging both WST and Mel spectrogram for enhanced feature discrimination. By integrating the insights derived from WST and Mel representations, we achieved an improvement in classification accuracy by $8-10\%$ over existing architectures, corresponding to a classification accuracy of $97.61\%$.
♻ ☆ Scaling Painting Style Transfer
Neural style transfer (NST) is a deep learning technique that produces an unprecedentedly rich style transfer from a style image to a content image. It is particularly impressive when it comes to transferring style from a painting to an image. NST was originally achieved by solving an optimization problem to match the global statistics of the style image while preserving the local geometric features of the content image. The two main drawbacks of this original approach is that it is computationally expensive and that the resolution of the output images is limited by high GPU memory requirements. Many solutions have been proposed to both accelerate NST and produce images with larger size. However, our investigation shows that these accelerated methods all compromise the quality of the produced images in the context of painting style transfer. Indeed, transferring the style of a painting is a complex task involving features at different scales, from the color palette and compositional style to the fine brushstrokes and texture of the canvas. This paper provides a solution to solve the original global optimization for ultra-high resolution (UHR) images, enabling multiscale NST at unprecedented image sizes. This is achieved by spatially localizing the computation of each forward and backward passes through the VGG network. Extensive qualitative and quantitative comparisons, as well as a \textcolor{coverletter}{perceptual study}, show that our method produces style transfer of unmatched quality for such high-resolution painting styles. By a careful comparison, we show that state-of-the-art fast methods are still prone to artifacts, thus suggesting that fast painting style transfer remains an open problem. Source code is available at https://github.com/bgalerne/scaling_painting_style_transfer.
comment: 14 pages, 9 figures, 4 tables, accepted at EGSR 2024
♻ ☆ SALUDA: Surface-based Automotive Lidar Unsupervised Domain Adaptation 3DV 2024
Learning models on one labeled dataset that generalize well on another domain is a difficult task, as several shifts might happen between the data domains. This is notably the case for lidar data, for which models can exhibit large performance discrepancies due for instance to different lidar patterns or changes in acquisition conditions. This paper addresses the corresponding Unsupervised Domain Adaptation (UDA) task for semantic segmentation. To mitigate this problem, we introduce an unsupervised auxiliary task of learning an implicit underlying surface representation simultaneously on source and target data. As both domains share the same latent representation, the model is forced to accommodate discrepancies between the two sources of data. This novel strategy differs from classical minimization of statistical divergences or lidar-specific domain adaptation techniques. Our experiments demonstrate that our method achieves a better performance than the current state of the art, both in real-to-real and synthetic-to-real scenarios.
comment: Accepted as spotlight to 3DV 2024. Project repository: github.com/valeoai/SALUDA
♻ ☆ Towards Zero-Shot Interpretable Human Recognition: A 2D-3D Registration Framework
Large vision models based in deep learning architectures have been consistently advancing the state-of-the-art in biometric recognition. However, three weaknesses are commonly reported for such kind of approaches: 1) their extreme demands in terms of learning data; 2) the difficulties in generalising between different domains; and 3) the lack of interpretability/explainability, with biometrics being of particular interest, as it is important to provide evidence able to be used for forensics/legal purposes (e.g., in courts). To the best of our knowledge, this paper describes the first recognition framework/strategy that aims at addressing the three weaknesses simultaneously. At first, it relies exclusively in synthetic samples for learning purposes. Instead of requiring a large amount and variety of samples for each subject, the idea is to exclusively enroll a 3D point cloud per identity. Then, using generative strategies, we synthesize a very large (potentially infinite) number of samples, containing all the desired covariates (poses, clothing, distances, perspectives, lighting, occlusions,...). Upon the synthesizing method used, it is possible to adapt precisely to different kind of domains, which accounts for generalization purposes. Such data are then used to learn a model that performs local registration between image pairs, establishing positive correspondences between body parts that are the key, not only to recognition (according to cardinality and distribution), but also to provide an interpretable description of the response (e.g.: "both samples are from the same person, as they have similar facial shape, hair color and legs thickness").
♻ ☆ CTNeRF: Cross-Time Transformer for Dynamic Neural Radiance Field from Monocular Video
The goal of our work is to generate high-quality novel views from monocular videos of complex and dynamic scenes. Prior methods, such as DynamicNeRF, have shown impressive performance by leveraging time-varying dynamic radiation fields. However, these methods have limitations when it comes to accurately modeling the motion of complex objects, which can lead to inaccurate and blurry renderings of details. To address this limitation, we propose a novel approach that builds upon a recent generalization NeRF, which aggregates nearby views onto new viewpoints. However, such methods are typically only effective for static scenes. To overcome this challenge, we introduce a module that operates in both the time and frequency domains to aggregate the features of object motion. This allows us to learn the relationship between frames and generate higher-quality images. Our experiments demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets. Specifically, our approach outperforms existing methods in terms of both the accuracy and visual quality of the synthesized views. Our code is available on https://github.com/xingy038/CTNeRF.
comment: Accepted by Pattern Recognition
♻ ☆ Jina CLIP: Your CLIP Model Is Also Your Text Retriever ICML2024
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
comment: 4 pages, MFM-EAI@ICML2024
♻ ☆ Single-Model Attribution of Generative Models Through Final-Layer Inversion ICML2024
Recent breakthroughs in generative modeling have sparked interest in practical single-model attribution. Such methods predict whether a sample was generated by a specific generator or not, for instance, to prove intellectual property theft. However, previous works are either limited to the closed-world setting or require undesirable changes to the generative model. We address these shortcomings by, first, viewing single-model attribution through the lens of anomaly detection. Arising from this change of perspective, we propose FLIPAD, a new approach for single-model attribution in the open-world setting based on final-layer inversion and anomaly detection. We show that the utilized final-layer inversion can be reduced to a convex lasso optimization problem, making our approach theoretically sound and computationally efficient. The theoretical findings are accompanied by an experimental study demonstrating the effectiveness of our approach and its flexibility to various domains.
comment: Accepted at the Forty-first International Conference on Machine Learning [ICML2024]
♻ ☆ Bayesian Uncertainty Estimation by Hamiltonian Monte Carlo: Applications to Cardiac MRI Segmentation
Deep learning (DL)-based methods have achieved state-of-the-art performance for a wide range of medical image segmentation tasks. Nevertheless, recent studies show that deep neural networks (DNNs) can be miscalibrated and overconfident, leading to "silent failures" that are risky} for clinical applications. Bayesian statistics provide an intuitive approach to DL failure detection, based on posterior probability estimation. However, Bayesian DL, and in particular the posterior estimation, is intractable for large medical image segmentation DNNs. To tackle this challenge, we propose a Bayesian learning framework by Hamiltonian Monte Carlo (HMC), tempered by cold posterior (CP) to accommodate medical data augmentation, named HMC-CP. For HMC computation, we further propose a cyclical annealing strategy, which captures both local and global geometries of the posterior distribution, enabling highly efficient Bayesian DNN training with the same computational budget requirements as training a single DNN. The resulting Bayesian DNN outputs an ensemble segmentation along with the segmentation uncertainty. We evaluate the proposed HMC-CP extensively on cardiac magnetic resonance image (MRI) segmentation, using in-domain steady-state free precession (SSFP) cine images as well as out-of-domain datasets of quantitative $T_1$ and $T_2$ mapping.
♻ ☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
♻ ☆ Commonsense Prototype for Outdoor Unsupervised 3D Object Detection CVPR 2024
The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However, the challenge arises due to the sparsity of LiDAR scans, which leads to pseudo-labels with erroneous size and position, resulting in subpar detection performance. To tackle this problem, this paper introduces a Commonsense Prototype-based Detector, termed CPD, for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points, based on commonsense intuition. Subsequently, CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore, CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD), PandaSet, and KITTI datasets by a large margin. Besides, by training CPD on WOD and testing on KITTI, CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes, respectively. These achievements position CPD in close proximity to fully supervised detectors, highlighting the significance of our method. The code will be available at https://github.com/hailanyi/CPD.
comment: Accepted by CVPR 2024
♻ ☆ Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents CVPR 2024
Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However, existing editable scene simulation approaches face limitations in terms of user interaction efficiency, multi-camera photo-realistic rendering and external digital assets integration. To address these challenges, this paper introduces ChatSim, the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility,~ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes, ChatSim employs a novel multi-camera neural radiance field method. Furthermore, to unleash the potential of extensive high-quality digital assets, ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos.
comment: CVPR 2024(Highlight)
♻ ☆ ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on large datasets. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures. In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing CD tasks. We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively. All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images. For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features, thereby obtaining accurate change information. On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex training strategies or tricks, fully demonstrating the potential of the Mamba architecture in CD tasks. Further experiments show that our architecture is quite robust to degraded data. The source code will be available in https://github.com/ChenHongruixuan/MambaCD
comment: Accepted by IEEE TGRS
♻ ☆ ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer
Optical high-resolution imagery and OSM data are two important data sources of change detection (CD). Previous related studies focus on utilizing the information in OSM data to aid the CD on optical high-resolution images. This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical imagery, thereby expanding the scope of CD tasks. To this end, we propose an object-guided Transformer (ObjFormer) by naturally combining the object-based image analysis (OBIA) technique with the advanced vision Transformer architecture. This combination can significantly reduce the computational overhead in the self-attention module without adding extra parameters or layers. ObjFormer has a hierarchical pseudo-siamese encoder consisting of object-guided self-attention modules that extracts multi-level heterogeneous features from OSM data and optical images; a decoder consisting of object-guided cross-attention modules can recover land-cover changes from the extracted heterogeneous features. Beyond basic binary change detection, this paper raises a new semi-supervised semantic change detection task that does not require any manually annotated land-cover labels to train semantic change detectors. Two lightweight semantic decoders are added to ObjFormer to accomplish this task efficiently. A converse cross-entropy loss is designed to fully utilize negative samples, contributing to the great performance improvement in this task. A large-scale benchmark dataset called OpenMapCD containing 1,287 samples covering 40 regions on six continents is constructed to conduct detailed experiments. The results show the effectiveness of our methods in this new kind of CD task. Additionally, case studies in Japanese cities demonstrate the framework's generalizability and practical potential. The OpenMapCD and source code are available in https://github.com/ChenHongruixuan/ObjFormer
comment: Accepted by IEEE TGRS
♻ ☆ Splatter a Video: Video Gaussian Representation for Versatile Processing
Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: https://sunyangtian.github.io/spatter_a_video_web/
♻ ☆ Visual Odometry with Neuromorphic Resonator Networks
Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, visual odometry is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, -memory, and -energy requirements. Neuromorphic hardware offers low-power solutions to many vision and AI problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose to use Vector Symbolic Architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event camera dataset and the other in a dynamic scene with a robotic task.
comment: 19 pages, 5 figures, minor revisions, added results for shapes_translation dataset
♻ ☆ Neuromorphic Visual Scene Understanding with Resonator Networks
Analyzing a visual scene by inferring the configuration of a generative model is widely considered the most flexible and generalizable approach to scene understanding. Yet, one major problem is the computational challenge of the inference procedure, involving a combinatorial search across object identities and poses. Here we propose a neuromorphic solution exploiting three key concepts: (1) a computational framework based on Vector Symbolic Architectures (VSA) with complex-valued vectors; (2) the design of Hierarchical Resonator Networks (HRN) to factorize the non-commutative transforms translation and rotation in visual scenes; (3) the design of a multi-compartment spiking phasor neuron model for implementing complex-valued resonator networks on neuromorphic hardware. The VSA framework uses vector binding operations to form a generative image model in which binding acts as the equivariant operation for geometric transformations. A scene can, therefore, be described as a sum of vector products, which can then be efficiently factorized by a resonator network to infer objects and their poses. The HRN features a partitioned architecture in which vector binding is equivariant for horizontal and vertical translation within one partition and for rotation and scaling within the other partition. The spiking neuron model allows mapping the resonator network onto efficient and low-power neuromorphic hardware. Our approach is demonstrated on synthetic scenes composed of simple 2D shapes undergoing rigid geometric transformations and color changes. A companion paper demonstrates the same approach in real-world application scenarios for machine vision and robotics.
comment: 23 pages, 8 figures, minor revisions and extended supplementary material
♻ ☆ MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions
The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at https://github.com/francescodisalvo05/medmnistc-api}{github.com/francescodisalvo05/medmnistc-api .
♻ ☆ ECGrecover: a Deep Learning Approach for Electrocardiogram Signal Completion
In this work, we address the challenge of reconstructing the complete 12-lead ECG signal from incomplete parts of it. We focus on two main scenarii: (i) reconstructing missing signal segments within an ECG lead and (ii) recovering missing leads from a single-lead. We propose a model with a U-Net architecture trained on a novel objective function to address the reconstruction problem. This function incorporates both spatial and temporal aspects of the ECG by combining the distance in amplitude between the reconstructed and real signals with the signal trend. Through comprehensive assessments using both a real-life dataset and a publicly accessible one, we demonstrate that the proposed approach consistently outperforms state-of-the-art methods based on generative adversarial networks and a CopyPaste strategy. Our proposed model demonstrates superior performance in standard distortion metrics and preserves critical ECG characteristics, particularly the P, Q, R, S, and T wave coordinates. Two emerging clinical applications emphasize the relevance of our work. The first is the increasing need to digitize paper-stored ECGs for utilization in AI-based applications (automatic annotation and risk-quantification), often limited to digital ECG complete 10s recordings. The second is the widespread use of wearable devices that record ECGs but typically capture only a small subset of the 12 standard leads. In both cases, a non-negligible amount of information is lost or not recorded, which our approach aims to recover to overcome these limitations.
♻ ☆ Fast Learnings of Coupled Nonnegative Tensor Decomposition Using Optimal Gradient and Low-rank Approximation
Tensor decomposition is a fundamental technique widely applied in signal processing, machine learning, and various other fields. However, traditional tensor decomposition methods encounter limitations when jointly analyzing multi-block tensors, as they often struggle to effectively explore shared information among tensors. In this study, we first introduce a novel coupled nonnegative CANDECOMP/PARAFAC decomposition algorithm optimized by the alternating proximal gradient method (CoNCPD-APG). This algorithm is specially designed to address the challenges of jointly decomposing different tensors that are partially or fully linked, while simultaneously extracting common components, individual components and, core tensors. Recognizing the computational challenges inherent in optimizing nonnegative constraints over high-dimensional tensor data, we further propose the lraCoNCPD-APG algorithm. By integrating low-rank approximation with the proposed CoNCPD-APG method, the proposed algorithm can significantly decrease the computational burden without compromising decomposition quality, particularly for multi-block large-scale tensors. Simulation experiments conducted on synthetic data, real-world face image data, and two kinds of electroencephalography (EEG) data demonstrate the practicality and superiority of the proposed algorithms for coupled nonnegative tensor decomposition problems. Our results underscore the efficacy of our methods in uncovering meaningful patterns and structures from complex multi-block tensor data, thereby offering valuable insights for future applications.
comment: 15 pages, 6 figures
♻ ☆ Re-initialization-free Level Set Method via Molecular Beam Epitaxy Equation Regularization for Image Segmentation
Variational level set method has become a powerful tool in image segmentation due to its ability to handle complex topological changes and maintain continuity and smoothness in the process of evolution. However its evolution process can be unstable, which results in over flatted or over sharpened contours and segmentation failure. To improve the accuracy and stability of evolution, we propose a high-order level set variational segmentation method integrated with molecular beam epitaxy (MBE) equation regularization. This method uses the crystal growth in the MBE process to limit the evolution of the level set function, and thus can avoid the re-initialization in the evolution process and regulate the smoothness of the segmented curve. It also works for noisy images with intensity inhomogeneity, which is a challenge in image segmentation. To solve the variational model, we derive the gradient flow and design scalar auxiliary variable (SAV) scheme coupled with fast Fourier transform (FFT), which can significantly improve the computational efficiency compared with the traditional semi-implicit and semi-explicit scheme. Numerical experiments show that the proposed method can generate smooth segmentation curves, retain fine segmentation targets and obtain robust segmentation results of small objects. Compared to existing level set methods, this model is state-of-the-art in both accuracy and efficiency.
♻ ☆ Enhancing Low-light Light Field Images with A Deep Compensation Unfolding Network
This paper presents a novel and interpretable end-to-end learning framework, called the deep compensation unfolding network (DCUNet), for restoring light field (LF) images captured under low-light conditions. DCUNet is designed with a multi-stage architecture that mimics the optimization process of solving an inverse imaging problem in a data-driven fashion. The framework uses the intermediate enhanced result to estimate the illumination map, which is then employed in the unfolding process to produce a new enhanced result. Additionally, DCUNet includes a content-associated deep compensation module at each optimization stage to suppress noise and illumination map estimation errors. To properly mine and leverage the unique characteristics of LF images, this paper proposes a pseudo-explicit feature interaction module that comprehensively exploits redundant information in LF images. The experimental results on both simulated and real datasets demonstrate the superiority of our DCUNet over state-of-the-art methods, both qualitatively and quantitatively. Moreover, DCUNet preserves the essential geometric structure of enhanced LF images much better. The code will be publicly available at https://github.com/lyuxianqiang/LFLL-DCU.
♻ ☆ SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
comment: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
♻ ☆ Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment ACL 2024
Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper
comment: 12 pages,4 figures; Accepted by ACL 2024 Main Conference
♻ ☆ RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker
Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at https://github.com/CoderChen01/InterCLIP-MEP.
comment: 8 pages, 6 figures, 6 tables
♻ ☆ LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging ICML 2024
Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.
comment: ICML 2024
♻ ☆ DeCoF: Generated Video Detection via Frame Consistency: The First Benchmark Dataset
The escalating quality of video generated by advanced video generation methods results in new security challenges, while there have been few relevant research efforts: 1) There is no open-source dataset for generated video detection, 2) No generated video detection method has been proposed so far. To this end, we propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions, as well as various generation models with different architectures and generation methods, including the most popular commercial models like OpenAI's Sora and Google's Veo. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability. Hence, we propose a simple yet effective \textbf{de}tection model based on \textbf{f}rame \textbf{co}nsistency (\textbf{DeCoF}), which focuses on temporal artifacts by eliminating the impact of spatial artifacts during feature learning. Extensive experiments demonstrate the efficacy of DeCoF in detecting videos generated by unseen video generation models and confirm its powerful generalizability across several commercially proprietary models. Our code and dataset will be released at \url{https://github.com/wuwuwuyue/DeCoF}.
♻ ☆ InstructTA: Instruction-Tuned Targeted Attack for Large Vision-Language Models
Large vision-language models (LVLMs) have demonstrated their incredible capability in image understanding and response generation. However, this rich visual interaction also makes LVLMs vulnerable to adversarial examples. In this paper, we formulate a novel and practical targeted attack scenario that the adversary can only know the vision encoder of the victim LVLM, without the knowledge of its prompts (which are often proprietary for service providers and not publicly available) and its underlying large language model (LLM). This practical setting poses challenges to the cross-prompt and cross-model transferability of targeted adversarial attack, which aims to confuse the LVLM to output a response that is semantically similar to the attacker's chosen target text. To this end, we propose an instruction-tuned targeted attack (dubbed \textsc{InstructTA}) to deliver the targeted adversarial attack on LVLMs with high transferability. Initially, we utilize a public text-to-image generative model to "reverse" the target response into a target image, and employ GPT-4 to infer a reasonable instruction $\boldsymbol{p}^\prime$ from the target response. We then form a local surrogate model (sharing the same vision encoder with the victim LVLM) to extract instruction-aware features of an adversarial image example and the target image, and minimize the distance between these two features to optimize the adversarial example. To further improve the transferability with instruction tuning, we augment the instruction $\boldsymbol{p}^\prime$ with instructions paraphrased from GPT-4. Extensive experiments demonstrate the superiority of our proposed method in targeted attack performance and transferability. The code is available at https://github.com/xunguangwang/InstructTA.
♻ ☆ A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning
Underwater image enhancement (UIE) presents a significant challenge within computer vision research. Despite the development of numerous UIE algorithms, a thorough and systematic review is still absent. To foster future advancements, we provide a detailed overview of the UIE task from several perspectives. Firstly, we introduce the physical models, data construction processes, evaluation metrics, and loss functions. Secondly, we categorize and discuss recent algorithms based on their contributions, considering six aspects: network architecture, learning strategy, learning stage, auxiliary tasks, domain perspective, and disentanglement fusion. Thirdly, due to the varying experimental setups in the existing literature, a comprehensive and unbiased comparison is currently unavailable. To address this, we perform both quantitative and qualitative evaluations of state-of-the-art algorithms across multiple benchmark datasets. Lastly, we identify key areas for future research in UIE. A collection of resources for UIE can be found at {https://github.com/YuZhao1999/UIE}.
comment: A survey on the underwater image enhancement task
♻ ☆ An Event-based Algorithm for Simultaneous 6-DOF Camera Pose Tracking and Mapping
Compared to regular cameras, Dynamic Vision Sensors or Event Cameras can output compact visual data based on a change in the intensity in each pixel location asynchronously. In this paper, we study the application of current image-based SLAM techniques to these novel sensors. To this end, the information in adaptively selected event windows is processed to form motion-compensated images. These images are then used to reconstruct the scene and estimate the 6-DOF pose of the camera. We also propose an inertial version of the event-only pipeline to assess its capabilities. We compare the results of different configurations of the proposed algorithm against the ground truth for sequences of two publicly available event datasets. We also compare the results of the proposed event-inertial pipeline with the state-of-the-art and show it can produce comparable or more accurate results provided the map estimate is reliable.
♻ ☆ Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation
The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on edge devices with low computational resources. We explore a new visual adaptation paradigm called edge tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on edge devices with small networks which require low computational resources. Existing methods that are potentially suitable for our edge tuning paradigm are discussed. But, three major drawbacks hinder their application in edge tuning: low adaptation capability, large adapter network, and high information transfer overhead. To address these issues, we propose Minimal Interaction Edge Tuning, or MIET, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIET achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.
comment: 9 pages
♻ ☆ Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation
Scalp diseases and alopecia affect millions of people around the world, underscoring the urgent need for early diagnosis and management of the disease. However, the development of a comprehensive AI-based diagnosis system encompassing these conditions remains an underexplored domain due to the challenges associated with data imbalance and the costly nature of labeling. To address these issues, we propose ScalpVision, an AI-driven system for the holistic diagnosis of scalp diseases and alopecia. In ScalpVision, effective hair segmentation is achieved using pseudo image-label pairs and an innovative prompting method in the absence of traditional hair masking labels. This approach is crucial for extracting key features such as hair thickness and count, which are then used to assess alopecia severity. Additionally, ScalpVision introduces DiffuseIT-M, a generative model adept at dataset augmentation while maintaining hair information, facilitating improved predictions of scalp disease severity. Our experimental results affirm ScalpVision's efficiency in diagnosing a variety of scalp conditions and alopecia, showcasing its potential as a valuable tool in dermatological care.
comment: IEEE Transactions on Medical Imaging (Under Review)
♻ ☆ DF-SLAM: Dictionary Factors Representation for High-Fidelity Neural Implicit Dense Visual SLAM System
We introduce a high-fidelity neural implicit dense visual Simultaneous Localization and Mapping (SLAM) system, termed DF-SLAM. In our work, we employ dictionary factors for scene representation, encoding the geometry and appearance information of the scene as a combination of basis and coefficient factors. Compared to neural implicit dense visual SLAM methods that directly encode scene information as features, our method exhibits superior scene detail reconstruction capabilities and more efficient memory usage, while our model size is insensitive to the size of the scene map, making our method more suitable for large-scale scenes. Additionally, we employ feature integration rendering to accelerate color rendering speed while ensuring color rendering quality, further enhancing the real-time performance of our neural SLAM method. Extensive experiments on synthetic and real-world datasets demonstrate that our method is competitive with existing state-of-the-art neural implicit SLAM methods in terms of real-time performance, localization accuracy, and scene reconstruction quality. Our source code is available at https://github.com/funcdecl/DF-SLAM.
♻ ☆ Towards Training-free Open-world Segmentation via Image Prompt Foundation Models
The realm of computer vision has witnessed a paradigm shift with the advent of foundational models, mirroring the transformative influence of large language models in the domain of natural language processing. This paper delves into the exploration of open-world segmentation, presenting a novel approach called Image Prompt Segmentation (IPSeg) that harnesses the power of vision foundational models. IPSeg lies the principle of a training-free paradigm, which capitalizes on image prompt techniques. Specifically, IPSeg utilizes a single image containing a subjective visual concept as a flexible prompt to query vision foundation models like DINOv2 and Stable Diffusion. Our approach extracts robust features for the prompt image and input image, then matches the input representations to the prompt representations via a novel feature interaction module to generate point prompts highlighting target objects in the input image. The generated point prompts are further utilized to guide the Segment Anything Model to segment the target object in the input image. The proposed method stands out by eliminating the need for exhaustive training sessions, thereby offering a more efficient and scalable solution. Experiments on COCO, PASCAL VOC, and other datasets demonstrate IPSeg's efficacy for flexible open-world segmentation using intuitive image prompts. This work pioneers tapping foundation models for open-world understanding through visual concepts conveyed in images.
comment: This paper is accepted by IJCV2024
♻ ☆ Regional Style and Color Transfer
This paper presents a novel contribution to the field of regional style transfer. Existing methods often suffer from the drawback of applying style homogeneously across the entire image, leading to stylistic inconsistencies or foreground object twisted when applied to image with foreground elements such as person figures. To address this limitation, we propose a new approach that leverages a segmentation network to precisely isolate foreground objects within the input image. Subsequently, style transfer is applied exclusively to the background region. The isolated foreground objects are then carefully reintegrated into the style-transferred background. To enhance the visual coherence between foreground and background, a color transfer step is employed on the foreground elements prior to their rein-corporation. Finally, we utilize feathering techniques to achieve a seamless amalgamation of foreground and background, resulting in a visually unified and aesthetically pleasing final composition. Extensive evaluations demonstrate that our proposed approach yields significantly more natural stylistic transformations compared to conventional methods.
comment: Accepted by 2024 5th International Conference on Computer Vision, Image and Deep Learning
♻ ☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
comment: - Moved Table 1 from before to after Section 4.2 heading - Renamed output pdf file with paper title
♻ ☆ Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation
This paper introduces a novel approach to leverage the generalizability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DMSFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then use a diffusion model-based image mixup strategy to bridge the domain gap between the source and target domains. We validate our approach through comprehensive experiments across a range of datasets, including Office-31, Office-Home, and VisDA. The results demonstrate significant improvements in SFDA performance, highlighting the potential of diffusion models in generating contextually relevant, domain-specific images.
comment: arXiv admin note: substantial text overlap with arXiv:2310.01701
♻ ☆ Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks
The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios.
comment: conference
♻ ☆ AutoProSAM: Automated Prompting SAM for 3D Multi-Organ Segmentation
Segment Anything Model (SAM) is one of the pioneering prompt-based foundation models for image segmentation and has been rapidly adopted for various medical imaging applications. However, in clinical settings, creating effective prompts is notably challenging and time-consuming, requiring the expertise of domain specialists such as physicians. This requirement significantly diminishes SAM's primary advantage - its interactive capability with end users - in medical applications. Moreover, recent studies have indicated that SAM, originally designed for 2D natural images, performs sub optimally on 3D medical image segmentation tasks. This subpar performance is attributed to the domain gaps between natural and medical images and the disparities in spatial arrangements between 2D and 3D images, particularly in multi-organ segmentation applications. To overcome these challenges, we present a novel technique termed AutoProSAM. This method automates 3D multi-organ CT-based segmentation by leveraging SAM's foundational model capabilities without relying on domain experts for prompts. The approach utilizes parameter-efficient adaptation techniques to adapt SAM for 3D medical imagery and incorporates an effective automatic prompt learning paradigm specific to this domain. By eliminating the need for manual prompts, it enhances SAM's capabilities for 3D medical image segmentation and achieves state-of-the-art (SOTA) performance in CT-based multi-organ segmentation tasks.
♻ ☆ Fast Encoder-Based 3D from Casual Videos via Point Track Processing
This paper addresses the long-standing challenge of reconstructing 3D structures from videos with dynamic content. Current approaches to this problem were not designed to operate on casual videos recorded by standard cameras or require a long optimization time. Aiming to significantly improve the efficiency of previous approaches, we present TracksTo4D, a learning-based approach that enables inferring 3D structure and camera positions from dynamic content originating from casual videos using a single efficient feed-forward pass. To achieve this, we propose operating directly over 2D point tracks as input and designing an architecture tailored for processing 2D point tracks. Our proposed architecture is designed with two key principles in mind: (1) it takes into account the inherent symmetries present in the input point tracks data, and (2) it assumes that the movement patterns can be effectively represented using a low-rank approximation. TracksTo4D is trained in an unsupervised way on a dataset of casual videos utilizing only the 2D point tracks extracted from the videos, without any 3D supervision. Our experiments show that TracksTo4D can reconstruct a temporal point cloud and camera positions of the underlying video with accuracy comparable to state-of-the-art methods, while drastically reducing runtime by up to 95\%. We further show that TracksTo4D generalizes well to unseen videos of unseen semantic categories at inference time.
♻ ☆ I2V-Adapter: A General Image-to-Video Adapter for Diffusion Models
Text-guided image-to-video (I2V) generation aims to generate a coherent video that preserves the identity of the input image and semantically aligns with the input prompt. Existing methods typically augment pretrained text-to-video (T2V) models by either concatenating the image with noised video frames channel-wise before being fed into the model or injecting the image embedding produced by pretrained image encoders in cross-attention modules. However, the former approach often necessitates altering the fundamental weights of pretrained T2V models, thus restricting the model's compatibility within the open-source communities and disrupting the model's prior knowledge. Meanwhile, the latter typically fails to preserve the identity of the input image. We present I2V-Adapter to overcome such limitations. I2V-Adapter adeptly propagates the unnoised input image to subsequent noised frames through a cross-frame attention mechanism, maintaining the identity of the input image without any changes to the pretrained T2V model. Notably, I2V-Adapter only introduces a few trainable parameters, significantly alleviating the training cost and also ensures compatibility with existing community-driven personalized models and control tools. Moreover, we propose a novel Frame Similarity Prior to balance the motion amplitude and the stability of generated videos through two adjustable control coefficients. Our experimental results demonstrate that I2V-Adapter is capable of producing high-quality videos. This performance, coupled with its agility and adaptability, represents a substantial advancement in the field of I2V, particularly for personalized and controllable applications.
Computation and Language 150
☆ Towards Compositionality in Concept Learning ICML 2024
Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at https://github.com/adaminsky/compositional_concepts .
comment: Accepted at ICML 2024. 26 pages, 10 figures
☆ Symbolic Learning Enables Self-Evolving Agents
The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".
comment: Code available at https://github.com/aiwaves-cn/agents
☆ PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Large language models (LLMs) have revolutionized the field of NLP. Notably, their in-context learning capabilities also enable their use as evaluation metrics for natural language generation, making them particularly advantageous in low-resource scenarios and time-restricted applications. In this work, we introduce PrExMe, a large-scale prompt exploration for metrics, where we evaluate more than 720 prompt templates for open-source LLM-based metrics on machine translation (MT) and summarization datasets, totalling over 6.6M evaluations. This extensive comparison (1) serves as a benchmark of the performance of recent open-source LLMs as metrics and (2) explores the stability and variability of different prompting strategies. We discover that, on the one hand, there are scenarios for which prompts are stable. For instance, some LLMs show idiosyncratic preferences and favor to grade generated texts with textual labels while others prefer to return numeric scores. On the other hand, the stability of prompts and model rankings can be susceptible to seemingly innocuous changes. For example, changing the requested output format from "0 to 100" to "-1 to +1" can strongly affect the rankings in our evaluation. Our study contributes to understanding the impact of different prompting approaches on LLM-based metrics for MT and summarization evaluation, highlighting the most stable prompting patterns and potential limitations.
comment: Preprint
☆ ChronoMagic-Bench: A Benchmark for Metamorphic Evaluation of Text-to-Time-lapse Video Generation
We propose a novel text-to-video (T2V) generation benchmark, ChronoMagic-Bench, to evaluate the temporal and metamorphic capabilities of the T2V models (e.g. Sora and Lumiere) in time-lapse video generation. In contrast to existing benchmarks that focus on the visual quality and textual relevance of generated videos, ChronoMagic-Bench focuses on the model's ability to generate time-lapse videos with significant metamorphic amplitude and temporal coherence. The benchmark probes T2V models for their physics, biology, and chemistry capabilities, in a free-form text query. For these purposes, ChronoMagic-Bench introduces 1,649 prompts and real-world videos as references, categorized into four major types of time-lapse videos: biological, human-created, meteorological, and physical phenomena, which are further divided into 75 subcategories. This categorization comprehensively evaluates the model's capacity to handle diverse and complex transformations. To accurately align human preference with the benchmark, we introduce two new automatic metrics, MTScore and CHScore, to evaluate the videos' metamorphic attributes and temporal coherence. MTScore measures the metamorphic amplitude, reflecting the degree of change over time, while CHScore assesses the temporal coherence, ensuring the generated videos maintain logical progression and continuity. Based on the ChronoMagic-Bench, we conduct comprehensive manual evaluations of ten representative T2V models, revealing their strengths and weaknesses across different categories of prompts, and providing a thorough evaluation framework that addresses current gaps in video generation research. Moreover, we create a large-scale ChronoMagic-Pro dataset, containing 460k high-quality pairs of 720p time-lapse videos and detailed captions ensuring high physical pertinence and large metamorphic amplitude.
comment: 31 pages, 15 figures
☆ CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs
Chart understanding plays a pivotal role when applying Multimodal Large Language Models (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/
comment: 121 pages, 90 figures
☆ APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/
☆ "Is ChatGPT a Better Explainer than My Professor?": Evaluating the Explanation Capabilities of LLMs in Conversation Compared to a Human Baseline
Explanations form the foundation of knowledge sharing and build upon communication principles, social dynamics, and learning theories. We focus specifically on conversational approaches for explanations because the context is highly adaptive and interactive. Our research leverages previous work on explanatory acts, a framework for understanding the different strategies that explainers and explainees employ in a conversation to both explain, understand, and engage with the other party. We use the 5-Levels dataset was constructed from the WIRED YouTube series by Wachsmuth et al., and later annotated by Booshehri et al. with explanatory acts. These annotations provide a framework for understanding how explainers and explainees structure their response when crafting a response. With the rise of generative AI in the past year, we hope to better understand the capabilities of Large Language Models (LLMs) and how they can augment expert explainer's capabilities in conversational settings. To achieve this goal, the 5-Levels dataset (We use Booshehri et al.'s 2023 annotated dataset with explanatory acts.) allows us to audit the ability of LLMs in engaging in explanation dialogues. To evaluate the effectiveness of LLMs in generating explainer responses, we compared 3 different strategies, we asked human annotators to evaluate 3 different strategies: human explainer response, GPT4 standard response, GPT4 response with Explanation Moves.
comment: 6 figures, 5 pages
☆ WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics, and then composes multiple tactics for systematic exploration of novel jailbreaks. Compared to prior work that performed red-teaming via recruited human workers, gradient-based optimization, or iterative revision with LLMs, our work investigates jailbreaks from chatbot users who were not specifically instructed to break the system. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks compared to state-of-the-art jailbreak methods. While many datasets exist for jailbreak evaluation, very few open-source datasets exist for jailbreak training, as safety training data has been closed even when model weights are open. With WildTeaming we create WildJailbreak, a large-scale open-source synthetic safety dataset with 262K vanilla (direct request) and adversarial (complex jailbreak) prompt-response pairs. To mitigate exaggerated safety behaviors, WildJailbreak provides two contrastive types of queries: 1) harmful queries (vanilla & adversarial) and 2) benign queries that resemble harmful queries in form but contain no harm. As WildJailbreak considerably upgrades the quality and scale of existing safety resources, it uniquely enables us to examine the scaling effects of data and the interplay of data properties and model capabilities during safety training. Through extensive experiments, we identify the training properties that enable an ideal balance of safety behaviors: appropriate safeguarding without over-refusal, effective handling of vanilla and adversarial queries, and minimal, if any, decrease in general capabilities. All components of WildJailbeak contribute to achieving balanced safety behaviors of models.
☆ Mental Modeling of Reinforcement Learning Agents by Language Models
Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent's behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent's behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.
comment: https://lukaswill.github.io/
☆ Is In-Context Learning a Type of Gradient-Based Learning? Evidence from the Inverse Frequency Effect in Structural Priming
Large language models (LLMs) have shown the emergent capability of in-context learning (ICL). One line of research has explained ICL as functionally performing gradient descent. In this paper, we introduce a new way of diagnosing whether ICL is functionally equivalent to gradient-based learning. Our approach is based on the inverse frequency effect (IFE) -- a phenomenon in which an error-driven learner is expected to show larger updates when trained on infrequent examples than frequent ones. The IFE has previously been studied in psycholinguistics because humans show this effect in the context of structural priming (the tendency for people to produce sentence structures they have encountered recently); the IFE has been used as evidence that human structural priming must involve error-driven learning mechanisms. In our experiments, we simulated structural priming within ICL and found that LLMs display the IFE, with the effect being stronger in larger models. We conclude that ICL is indeed a type of gradient-based learning, supporting the hypothesis that a gradient component is implicitly computed in the forward pass during ICL. Our results suggest that both humans and LLMs make use of gradient-based, error-driven processing mechanisms.
☆ WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs
We introduce WildGuard -- an open, light-weight moderation tool for LLM safety that achieves three goals: (1) identifying malicious intent in user prompts, (2) detecting safety risks of model responses, and (3) determining model refusal rate. Together, WildGuard serves the increasing needs for automatic safety moderation and evaluation of LLM interactions, providing a one-stop tool with enhanced accuracy and broad coverage across 13 risk categories. While existing open moderation tools such as Llama-Guard2 score reasonably well in classifying straightforward model interactions, they lag far behind a prompted GPT-4, especially in identifying adversarial jailbreaks and in evaluating models' refusals, a key measure for evaluating safety behaviors in model responses. To address these challenges, we construct WildGuardMix, a large-scale and carefully balanced multi-task safety moderation dataset with 92K labeled examples that cover vanilla (direct) prompts and adversarial jailbreaks, paired with various refusal and compliance responses. WildGuardMix is a combination of WildGuardTrain, the training data of WildGuard, and WildGuardTest, a high-quality human-annotated moderation test set with 5K labeled items covering broad risk scenarios. Through extensive evaluations on WildGuardTest and ten existing public benchmarks, we show that WildGuard establishes state-of-the-art performance in open-source safety moderation across all the three tasks compared to ten strong existing open-source moderation models (e.g., up to 26.4% improvement on refusal detection). Importantly, WildGuard matches and sometimes exceeds GPT-4 performance (e.g., up to 3.9% improvement on prompt harmfulness identification). WildGuard serves as a highly effective safety moderator in an LLM interface, reducing the success rate of jailbreak attacks from 79.8% to 2.4%.
comment: First two authors contributed equally. Third and fourth authors contributed equally
☆ Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation SIGDIAL 2023
Recently, various methods have been proposed to create open-domain conversational agents with Large Language Models (LLMs). These models are able to answer user queries, but in a one-way Q&A format rather than a true conversation. Fine-tuning on particular datasets is the usual way to modify their style to increase conversational ability, but this is expensive and usually only available in a few languages. In this study, we explore role-play zero-shot prompting as an efficient and cost-effective solution for open-domain conversation, using capable multilingual LLMs (Beeching et al., 2023) trained to obey instructions. We design a prompting system that, when combined with an instruction-following model - here Vicuna (Chiang et al., 2023) - produces conversational agents that match and even surpass fine-tuned models in human evaluation in French in two different tasks.
comment: Updated version of a paper originally submitted at SIGDIAL 2023
☆ Cascading Large Language Models for Salient Event Graph Generation
Generating event graphs from long documents is challenging due to the inherent complexity of multiple tasks involved such as detecting events, identifying their relationships, and reconciling unstructured input with structured graphs. Recent studies typically consider all events with equal importance, failing to distinguish salient events crucial for understanding narratives. This paper presents CALLMSAE, a CAscading Large Language Model framework for SAlient Event graph generation, which leverages the capabilities of LLMs and eliminates the need for costly human annotations. We first identify salient events by prompting LLMs to generate summaries, from which salient events are identified. Next, we develop an iterative code refinement prompting strategy to generate event relation graphs, removing hallucinated relations and recovering missing edges. Fine-tuning contextualised graph generation models on the LLM-generated graphs outperforms the models trained on CAEVO-generated data. Experimental results on a human-annotated test set show that the proposed method generates salient and more accurate graphs, outperforming competitive baselines.
comment: 9 + 12 pages
☆ IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons
It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-andplay solution that can be integrated seamlessly with existing models.
comment: 19 pages, 13 figures, 5 tables
☆ LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
There is an increasing trend towards evaluating NLP models with LLM-generated judgments instead of human judgments. In the absence of a comparison against human data, this raises concerns about the validity of these evaluations; in case they are conducted with proprietary models, this also raises concerns over reproducibility. We provide JUDGE-BENCH, a collection of 20 NLP datasets with human annotations, and comprehensively evaluate 11 current LLMs, covering both open-weight and proprietary models, for their ability to replicate the annotations. Our evaluations show that each LLM exhibits a large variance across datasets in its correlation to human judgments. We conclude that LLMs are not yet ready to systematically replace human judges in NLP.
☆ Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.
☆ Dynamic Data Pruning for Automatic Speech Recognition
The recent success of Automatic Speech Recognition (ASR) is largely attributed to the ever-growing amount of training data. However, this trend has made model training prohibitively costly and imposed computational demands. While data pruning has been proposed to mitigate this issue by identifying a small subset of relevant data, its application in ASR has been barely explored, and existing works often entail significant overhead to achieve meaningful results. To fill this gap, this paper presents the first investigation of dynamic data pruning for ASR, finding that we can reach the full-data performance by dynamically selecting 70% of data. Furthermore, we introduce Dynamic Data Pruning for ASR (DDP-ASR), which offers several fine-grained pruning granularities specifically tailored for speech-related datasets, going beyond the conventional pruning of entire time sequences. Our intensive experiments show that DDP-ASR can save up to 1.6x training time with negligible performance loss.
comment: Accepted to Interspeech 2024
☆ Themis: Towards Flexible and Interpretable NLG Evaluation
The evaluation of natural language generation (NLG) tasks is a significant and longstanding research issue. With the recent emergence of powerful large language models (LLMs), some studies have turned to LLM-based automatic evaluation methods, which demonstrate great potential to become a new evaluation paradigm following traditional string-based and model-based metrics. However, despite the improved performance of existing methods, they still possess some deficiencies, such as dependency on references and limited evaluation flexibility. Therefore, in this paper, we meticulously construct a large-scale NLG evaluation corpus NLG-Eval with human and GPT-4 annotations to alleviate the lack of relevant data in this field. Furthermore, we propose Themis, an LLM dedicated to NLG evaluation, which has been trained with our designed multi-perspective consistency and rating-oriented preference alignment methods. Themis can conduct flexible and interpretable evaluations without references, and it exhibits superior evaluation performance on various NLG tasks, simultaneously generalizing well to unseen tasks and surpassing other evaluation models, including GPT-4.
☆ Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model
With the continuous advancement of artificial intelligence, natural language processing technology has become widely utilized in various fields. At the same time, there are many challenges in creating Chinese news summaries. First of all, the semantics of Chinese news is complex, and the amount of information is enormous. Extracting critical information from Chinese news presents a significant challenge. Second, the news summary should be concise and clear, focusing on the main content and avoiding redundancy. In addition, the particularity of the Chinese language, such as polysemy, word segmentation, etc., makes it challenging to generate Chinese news summaries. Based on the above, this paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries. The experimental results show that the proposed method has a good effect on creating news summaries, which is of great importance to the construction of news summaries.
comment: submitted to ICMIII 2024
☆ Grammar Assistance Using Syntactic Structures (GAUSS)
Automatic grammar coaching serves an important purpose of advising on standard grammar varieties while not imposing social pressures or reinforcing established social roles. Such systems already exist but most of them are for English and few of them offer meaningful feedback. Furthermore, they typically rely completely on neural methods and require huge computational resources which most of the world cannot afford. We propose a grammar coaching system for Spanish that relies on (i) a rich linguistic formalism capable of giving informative feedback; and (ii) a faster parsing algorithm which makes using this formalism practical in a real-world application. The approach is feasible for any language for which there is a computerized grammar and is less reliant on expensive and environmentally costly neural methods. We seek to contribute to Greener AI and to address global education challenges by raising the standards of inclusivity and engagement in grammar coaching.
comment: 5 pages, 4 figures, project summary for CEDI-SEPLN Seminar of the Spanish Society for Natural Language Processing at the 7th Spanish Conference on Informatics, June 19-20, 2024, A Coru\~na, Spain
☆ PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.
☆ MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.
☆ Advancing Airport Tower Command Recognition: Integrating Squeeze-and-Excitation and Broadcasted Residual Learning
Accurate recognition of aviation commands is vital for flight safety and efficiency, as pilots must follow air traffic control instructions precisely. This paper addresses challenges in speech command recognition, such as noisy environments and limited computational resources, by advancing keyword spotting technology. We create a dataset of standardized airport tower commands, including routine and emergency instructions. We enhance broadcasted residual learning with squeeze-and-excitation and time-frame frequency-wise squeeze-and-excitation techniques, resulting in our BC-SENet model. This model focuses on crucial information with fewer parameters. Our tests on five keyword spotting models, including BC-SENet, demonstrate superior accuracy and efficiency. These findings highlight the effectiveness of our model advancements in improving speech command recognition for aviation safety and efficiency in noisy, high-stakes environments. Additionally, BC-SENet shows comparable performance on the common Google Speech Command dataset.
comment: Accepted by IALP 2024
☆ AI-native Memory: A Pathway from LLMs Towards AGI
Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs -- (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emph{memory}. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emph{AI-native}) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.
☆ S3: A Simple Strong Sample-effective Multimodal Dialog System
In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
☆ MSR-86K: An Evolving, Multilingual Corpus with 86,300 Hours of Transcribed Audio for Speech Recognition Research
Recently, multilingual artificial intelligence assistants, exemplified by ChatGPT, have gained immense popularity. As a crucial gateway to human-computer interaction, multilingual automatic speech recognition (ASR) has also garnered significant attention, as evidenced by systems like Whisper. However, the proprietary nature of the training data has impeded researchers' efforts to study multilingual ASR. This paper introduces MSR-86K, an evolving, large-scale multilingual corpus for speech recognition research. The corpus is derived from publicly accessible videos on YouTube, comprising 15 languages and a total of 86,300 hours of transcribed ASR data. We also introduce how to use the MSR-86K corpus and other open-source corpora to train a robust multilingual ASR model that is competitive with Whisper. MSR-86K will be publicly released on HuggingFace, and we believe that such a large corpus will pave new avenues for research in multilingual ASR.
comment: Accepted by InterSpeech 2024
☆ FactFinders at CheckThat! 2024: Refining Check-worthy Statement Detection with LLMs through Data Pruning
The rapid dissemination of information through social media and the Internet has posed a significant challenge for fact-checking, among others in identifying check-worthy claims that fact-checkers should pay attention to, i.e. filtering claims needing fact-checking from a large pool of sentences. This challenge has stressed the need to focus on determining the priority of claims, specifically which claims are worth to be fact-checked. Despite advancements in this area in recent years, the application of large language models (LLMs), such as GPT, has only recently drawn attention in studies. However, many open-source LLMs remain underexplored. Therefore, this study investigates the application of eight prominent open-source LLMs with fine-tuning and prompt engineering to identify check-worthy statements from political transcriptions. Further, we propose a two-step data pruning approach to automatically identify high-quality training data instances for effective learning. The efficiency of our approach is demonstrated through evaluations on the English language dataset as part of the check-worthiness estimation task of CheckThat! 2024. Further, the experiments conducted with data pruning demonstrate that competitive performance can be achieved with only about 44\% of the training data. Our team ranked first in the check-worthiness estimation task in the English language.
☆ Hierarchical Context Pruning: Optimizing Real-World Code Completion with Repository-Level Pretrained Code LLMs
Some recently developed code large language models (Code LLMs) have been pre-trained on repository-level code data (Repo-Code LLMs), enabling these models to recognize repository structures and utilize cross-file information for code completion. However, in real-world development scenarios, simply concatenating the entire code repository often exceeds the context window limits of these Repo-Code LLMs, leading to significant performance degradation. In this study, we conducted extensive preliminary experiments and analyses on six Repo-Code LLMs. The results indicate that maintaining the topological dependencies of files and increasing the code file content in the completion prompts can improve completion accuracy; pruning the specific implementations of functions in all dependent files does not significantly reduce the accuracy of completions. Based on these findings, we proposed a strategy named Hierarchical Context Pruning (HCP) to construct completion prompts with high informational code content. The HCP models the code repository at the function level, maintaining the topological dependencies between code files while removing a large amount of irrelevant code content, significantly reduces the input length for repository-level code completion. We applied the HCP strategy in experiments with six Repo-Code LLMs, and the results demonstrate that our proposed method can significantly enhance completion accuracy while substantially reducing the length of input. Our code and data are available at https://github.com/Hambaobao/HCP-Coder.
☆ Sanskrit Knowledge-based Systems: Annotation and Computational Tools
We address the challenges and opportunities in the development of knowledge systems for Sanskrit, with a focus on question answering. By proposing a framework for the automated construction of knowledge graphs, introducing annotation tools for ontology-driven and general-purpose tasks, and offering a diverse collection of web-interfaces, tools, and software libraries, we have made significant contributions to the field of computational Sanskrit. These contributions not only enhance the accessibility and accuracy of Sanskrit text analysis but also pave the way for further advancements in knowledge representation and language processing. Ultimately, this research contributes to the preservation, understanding, and utilization of the rich linguistic information embodied in Sanskrit texts.
comment: PhD Thesis. 204 pages, 6 publications
☆ "Vorbeşti Româneşte?" A Recipe to Train Powerful Romanian LLMs with English Instructions
In recent years, Large Language Models (LLMs) have achieved almost human-like performance on various tasks. While some LLMs have been trained on multilingual data, most of the training data is in English; hence, their performance in English greatly exceeds other languages. To our knowledge, we are the first to collect and translate a large collection of texts, instructions, and benchmarks and train, evaluate, and release open-source LLMs tailored for Romanian. We evaluate our methods on four different categories, including academic benchmarks, MT-Bench (manually translated), and a professionally built historical, cultural, and social benchmark adapted to Romanian. We argue for the usefulness and high performance of RoLLMs by obtaining state-of-the-art results across the board. We publicly release all resources (i.e., data, training and evaluation code, models) to support and encourage research on Romanian LLMs while concurrently creating a generalizable recipe, adequate for other low or less-resourced languages.
comment: arXiv admin note: text overlap with arXiv:2405.07703
☆ Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated
As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an "undecided" category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why "undecided" category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
comment: 19 pages, 2 figures
☆ LLaMIPa: An Incremental Discourse Parser
This paper provides the first discourse parsing experiments with a large language model (LLM) finetuned on corpora annotated in the style of SDRT (Asher, 1993; Asher and Lascarides, 2003). The result is a discourse parser, LLaMIPa (LLaMA Incremental Parser), which is able to more fully exploit discourse context, leading to substantial performance gains over approaches that use encoder-only models to provide local, context-sensitive representations of discourse units. Furthermore, it is able to process discourse data incrementally, which is essential for the eventual use of discourse information in downstream tasks.
comment: 12 pages, 2 figures
☆ Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems
The inherent ambiguity of cause and effect boundaries poses a challenge in evaluating causal event extraction tasks. Traditional metrics like Exact Match and BertScore poorly reflect model performance, so we trained evaluation models to approximate human evaluation, achieving high agreement. We used them to perform Reinforcement Learning with extraction models to align them with human preference, prioritising semantic understanding. We successfully explored our approach through multiple datasets, including transferring an evaluator trained on one dataset to another as a way to decrease the reliance on human-annotated data. In that vein, we also propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model while still achieving high performance in training an RL model. Our code is available at \url{https://github.com/oyarsa/event_extraction/tree/causal-event-extraction}.
comment: 13 pages, 6 figures, 6 tables
☆ Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets
Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach - despite being limited by local computation resources during the model selection - is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
comment: 10 pages, 2 tables, 1 figure
☆ GUIDE: A Guideline-Guided Dataset for Instructional Video Comprehension IJCAI 2024
There are substantial instructional videos on the Internet, which provide us tutorials for completing various tasks. Existing instructional video datasets only focus on specific steps at the video level, lacking experiential guidelines at the task level, which can lead to beginners struggling to learn new tasks due to the lack of relevant experience. Moreover, the specific steps without guidelines are trivial and unsystematic, making it difficult to provide a clear tutorial. To address these problems, we present the GUIDE (Guideline-Guided) dataset, which contains 3.5K videos of 560 instructional tasks in 8 domains related to our daily life. Specifically, we annotate each instructional task with a guideline, representing a common pattern shared by all task-related videos. On this basis, we annotate systematic specific steps, including their associated guideline steps, specific step descriptions and timestamps. Our proposed benchmark consists of three sub-tasks to evaluate comprehension ability of models: (1) Step Captioning: models have to generate captions for specific steps from videos. (2) Guideline Summarization: models have to mine the common pattern in task-related videos and summarize a guideline from them. (3) Guideline-Guided Captioning: models have to generate captions for specific steps under the guide of guideline. We evaluate plenty of foundation models with GUIDE and perform in-depth analysis. Given the diversity and practicality of GUIDE, we believe that it can be used as a better benchmark for instructional video comprehension.
comment: IJCAI 2024
☆ Enhancing Data Privacy in Large Language Models through Private Association Editing
Large Language Models (LLMs) are powerful tools with extensive applications, but their tendency to memorize private information raises significant concerns as private data leakage can easily happen. In this paper, we introduce Private Association Editing (PAE), a novel defense approach for private data leakage. PAE is designed to effectively remove Personally Identifiable Information (PII) without retraining the model. Our approach consists of a four-step procedure: detecting memorized PII, applying PAE cards to mitigate memorization of private data, verifying resilience to targeted data extraction (TDE) attacks, and ensuring consistency in the post-edit LLMs. The versatility and efficiency of PAE, which allows for batch modifications, significantly enhance data privacy in LLMs. Experimental results demonstrate the effectiveness of PAE in mitigating private data leakage. We believe PAE will serve as a critical tool in the ongoing effort to protect data privacy in LLMs, encouraging the development of safer models for real-world applications.
☆ A Closer Look into Mixture-of-Experts in Large Language Models
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.
☆ SEED: Accelerating Reasoning Tree Construction via Scheduled Speculative Decoding
Large Language Models (LLMs) demonstrate remarkable emergent abilities across various tasks, yet fall short of complex reasoning and planning tasks. The tree-search-based reasoning methods address this by surpassing the capabilities of chain-of-thought prompting, encouraging exploration of intermediate steps. However, such methods introduce significant inference latency due to the systematic exploration and evaluation of multiple thought paths. This paper introduces SeeD, a novel and efficient inference framework to optimize runtime speed and GPU memory management concurrently. By employing a scheduled speculative execution, SeeD efficiently handles multiple iterations for the thought generation and the state evaluation, leveraging a rounds-scheduled strategy to manage draft model dispatching. Extensive experimental evaluations on three reasoning datasets demonstrate superior speedup performance of SeeD, providing a viable path for batched inference in training-free speculative decoding.
☆ Methodology of Adapting Large English Language Models for Specific Cultural Contexts
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
comment: 11 pages, 2 figures
☆ Selective Prompting Tuning for Personalized Conversations with LLMs ACL 2024
In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90\%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (https://github.com/hqsiswiliam/SPT) is publicly available for further exploration.
comment: Accepted to ACL 2024 findings
☆ UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs
Managing long texts is challenging for large language models (LLMs) due to limited context window sizes. This study introduces UIO-LLMs, an unbiased incremental optimization approach for memory-enhanced transformers under long-context settings. We initially conceptualize the process as a streamlined encoder-decoder framework where the weights-shared encoder and decoder respectively encapsulate a context segment into memories and leverage these memories to predict outputs of the subsequent segment. Subsequently, by treating our memory-enhanced transformers as fully-connected recurrent neural networks (RNNs), we refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm, which incorporates innovative incremental optimization techniques. These techniques not only diminish time complexity but also address the bias in gradient computation through an unbiased optimization process. UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters, while keeping the inference cost nearly linear as context length increases.
☆ NeBuLa: A discourse aware Minecraft Builder
When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent "language to code" or "language to action" models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the "language to action" component of such interactions. We fine tune an LLM to predict actions based on prior context; our model, NeBuLa, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model's ability to construct shapes and understand location descriptions using a synthetic dataset.
comment: 10 pages, 3 figures
☆ LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference
Long-context Multimodal Large Language Models (MLLMs) demand substantial computational resources for inference as the growth of their multimodal Key-Value (KV) cache, in response to increasing input lengths, challenges memory and time efficiency. Unlike single-modality LLMs that manage only textual contexts, the KV cache of long-context MLLMs includes representations from multiple images with temporal and spatial relationships and related textual contexts. The predominance of image tokens means traditional optimizations for LLMs' KV caches are unsuitable for multimodal long-context settings, and no prior works have addressed this challenge. In this work, we introduce LOOK-M, a pioneering, fine-tuning-free approach that efficiently reduces the multimodal KV cache size while maintaining performance comparable to a full cache. We observe that during prompt prefill, the model prioritizes more textual attention over image features, and based on the multimodal interaction observation, a new proposed text-prior method is explored to compress the KV cache. Furthermore, to mitigate the degradation of image contextual information, we propose several compensatory strategies using KV pairs merging. LOOK-M demonstrates that with a significant reduction in KV Cache memory usage, such as reducing it by 80% in some cases, it not only achieves up to 1.5x faster decoding but also maintains or even enhances performance across a variety of long context multimodal tasks.
☆ Automatic Speech Recognition for Hindi
Automatic speech recognition (ASR) is a key area in computational linguistics, focusing on developing technologies that enable computers to convert spoken language into text. This field combines linguistics and machine learning. ASR models, which map speech audio to transcripts through supervised learning, require handling real and unrestricted text. Text-to-speech systems directly work with real text, while ASR systems rely on language models trained on large text corpora. High-quality transcribed data is essential for training predictive models. The research involved two main components: developing a web application and designing a web interface for speech recognition. The web application, created with JavaScript and Node.js, manages large volumes of audio files and their transcriptions, facilitating collaborative human correction of ASR transcripts. It operates in real-time using a client-server architecture. The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine. VAD detects human speech presence, aiding efficient speech processing and reducing unnecessary processing during non-speech intervals, thus saving computation and network bandwidth in VoIP applications. The final phase of the research tested a neural network for accurately aligning the speech signal to hidden Markov model (HMM) states. This included implementing a novel backpropagation method that utilizes prior statistics of node co-activations.
☆ Assessing "Implicit" Retrieval Robustness of Large Language Models
Retrieval-augmented generation has gained popularity as a framework to enhance large language models with external knowledge. However, its effectiveness hinges on the retrieval robustness of the model. If the model lacks retrieval robustness, its performance is constrained by the accuracy of the retriever, resulting in significant compromises when the retrieved context is irrelevant. In this paper, we evaluate the "implicit" retrieval robustness of various large language models, instructing them to directly output the final answer without explicitly judging the relevance of the retrieved context. Our findings reveal that fine-tuning on a mix of gold and distracting context significantly enhances the model's robustness to retrieval inaccuracies, while still maintaining its ability to extract correct answers when retrieval is accurate. This suggests that large language models can implicitly handle relevant or irrelevant retrieved context by learning solely from the supervision of the final answer in an end-to-end manner. Introducing an additional process for explicit relevance judgment can be unnecessary and disrupts the end-to-end approach.
☆ ConvoCache: Smart Re-Use of Chatbot Responses
We present ConvoCache, a conversational caching system that solves the problem of slow and expensive generative AI models in spoken chatbots. ConvoCache finds a semantically similar prompt in the past and reuses the response. In this paper we evaluate ConvoCache on the DailyDialog dataset. We find that ConvoCache can apply a UniEval coherence threshold of 90% and respond to 89% of prompts using the cache with an average latency of 214ms, replacing LLM and voice synthesis that can take over 1s. To further reduce latency we test prefetching and find limited usefulness. Prefetching with 80% of a request leads to a 63% hit rate, and a drop in overall coherence. ConvoCache can be used with any chatbot to reduce costs by reducing usage of generative AI by up to 89%.
comment: Accepted to appear at Interspeech 2024
☆ ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92\% and a top-5 accuracy of 97.5\%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
comment: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics
☆ Poisoned LangChain: Jailbreak LLMs by LangChain
With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model's lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively.
comment: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer
☆ ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.
comment: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics
☆ SafeAligner: Safety Alignment against Jailbreak Attacks via Response Disparity Guidance
As the development of large language models (LLMs) rapidly advances, securing these models effectively without compromising their utility has become a pivotal area of research. However, current defense strategies against jailbreak attacks (i.e., efforts to bypass security protocols) often suffer from limited adaptability, restricted general capability, and high cost. To address these challenges, we introduce SafeAligner, a methodology implemented at the decoding stage to fortify defenses against jailbreak attacks. We begin by developing two specialized models: the Sentinel Model, which is trained to foster safety, and the Intruder Model, designed to generate riskier responses. SafeAligner leverages the disparity in security levels between the responses from these models to differentiate between harmful and beneficial tokens, effectively guiding the safety alignment by altering the output token distribution of the target model. Extensive experiments show that SafeAligner can increase the likelihood of beneficial tokens, while reducing the occurrence of harmful ones, thereby ensuring secure alignment with minimal loss to generality.
☆ BADGE: BADminton report Generation and Evaluation with LLM IJCAI 2024
Badminton enjoys widespread popularity, and reports on matches generally include details such as player names, game scores, and ball types, providing audiences with a comprehensive view of the games. However, writing these reports can be a time-consuming task. This challenge led us to explore whether a Large Language Model (LLM) could automate the generation and evaluation of badminton reports. We introduce a novel framework named BADGE, designed for this purpose using LLM. Our method consists of two main phases: Report Generation and Report Evaluation. Initially, badminton-related data is processed by the LLM, which then generates a detailed report of the match. We tested different Input Data Types, In-Context Learning (ICL), and LLM, finding that GPT-4 performs best when using CSV data type and the Chain of Thought prompting. Following report generation, the LLM evaluates and scores the reports to assess their quality. Our comparisons between the scores evaluated by GPT-4 and human judges show a tendency to prefer GPT-4 generated reports. Since the application of LLM in badminton reporting remains largely unexplored, our research serves as a foundational step for future advancements in this area. Moreover, our method can be extended to other sports games, thereby enhancing sports promotion. For more details, please refer to https://github.com/AndyChiangSH/BADGE.
comment: Accepted by IJCAI 2024 Workshop: The 2nd International Workshop on Intelligent Technologies for Precision Sports Science (IT4PSS)
☆ Token-Weighted RNN-T for Learning from Flawed Data
ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.
☆ Shimo Lab at "Discharge Me!": Discharge Summarization by Prompt-Driven Concatenation of Electronic Health Record Sections ACL2024
In this paper, we present our approach to the shared task "Discharge Me!" at the BioNLP Workshop 2024. The primary goal of this task is to reduce the time and effort clinicians spend on writing detailed notes in the electronic health record (EHR). Participants develop a pipeline to generate the "Brief Hospital Course" and "Discharge Instructions" sections from the EHR. Our approach involves a first step of extracting the relevant sections from the EHR. We then add explanatory prompts to these sections and concatenate them with separate tokens to create the input text. To train a text generation model, we perform LoRA fine-tuning on the ClinicalT5-large model. On the final test data, our approach achieved a ROUGE-1 score of $0.394$, which is comparable to the top solutions.
comment: BioNLP @ ACL2024
☆ LLM-Driven Multimodal Opinion Expression Identification
Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.
comment: 6 pages, 3 Figures
☆ EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models
Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA.
☆ Multilingual Knowledge Graph Completion from Pretrained Language Models with Knowledge Constraints ACL 2023
Multilingual Knowledge Graph Completion (mKGC) aim at solving queries like (h, r, ?) in different languages by reasoning a tail entity t thus improving multilingual knowledge graphs. Previous studies leverage multilingual pretrained language models (PLMs) and the generative paradigm to achieve mKGC. Although multilingual pretrained language models contain extensive knowledge of different languages, its pretraining tasks cannot be directly aligned with the mKGC tasks. Moreover, the majority of KGs and PLMs currently available exhibit a pronounced English-centric bias. This makes it difficult for mKGC to achieve good results, particularly in the context of low-resource languages. To overcome previous problems, this paper introduces global and local knowledge constraints for mKGC. The former is used to constrain the reasoning of answer entities, while the latter is used to enhance the representation of query contexts. The proposed method makes the pretrained model better adapt to the mKGC task. Experimental results on public datasets demonstrate that our method outperforms the previous SOTA on Hits@1 and Hits@10 by an average of 12.32% and 16.03%, which indicates that our proposed method has significant enhancement on mKGC.
comment: 11 pages, ACL 2023
☆ Octo-planner: On-device Language Model for Planner-Action Agents
AI agents have become increasingly significant in various domains, enabling autonomous decision-making and problem-solving. To function effectively, these agents require a planning process that determines the best course of action and then executes the planned actions. In this paper, we present an efficient on-device Planner-Action framework that separates planning and action execution into two distinct components: a planner agent based on Phi-3 Mini, a 3.8 billion parameter LLM optimized for edge devices, and an action agent using the Octopus model for function execution. The planner agent first responds to user queries by decomposing tasks into a sequence of sub-steps, which are then executed by the action agent. To optimize performance on resource-constrained devices, we employ model fine-tuning instead of in-context learning, reducing computational costs and energy consumption while improving response times. Our approach involves using GPT-4 to generate diverse planning queries and responses based on available functions, with subsequent validations to ensure data quality. We fine-tune the Phi-3 Mini model on this curated dataset, achieving a 97\% success rate in our in-domain test environment. To address multi-domain planning challenges, we developed a multi-LoRA training method that merges weights from LoRAs trained on distinct function subsets. This approach enables flexible handling of complex, multi-domain queries while maintaining computational efficiency on resource-constrained devices. To support further research, we have open-sourced our model weights at \url{https://huggingface.co/NexaAIDev/octopus-planning}. For the demo, please refer to \url{https://www.nexa4ai.com/octo-planner}.
☆ Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction ACL 2024
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer's effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at https://github.com/HITSZ-HLT/ST-w-Scorer-ABSA .
comment: Accepted to ACL 2024 Main Conference
☆ Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through instruction tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
☆ Exploring Energy-Based Models for Out-of-Distribution Detection in Dialect Identification
The diverse nature of dialects presents challenges for models trained on specific linguistic patterns, rendering them susceptible to errors when confronted with unseen or out-of-distribution (OOD) data. This study introduces a novel margin-enhanced joint energy model (MEJEM) tailored specifically for OOD detection in dialects. By integrating a generative model and the energy margin loss, our approach aims to enhance the robustness of dialect identification systems. Furthermore, we explore two OOD scores for OOD dialect detection, and our findings conclusively demonstrate that the energy score outperforms the softmax score. Leveraging Sharpness-Aware Minimization to optimize the training process of the joint model, we enhance model generalization by minimizing both loss and sharpness. Experiments conducted on dialect identification tasks validate the efficacy of Energy-Based Models and provide valuable insights into their performance.
☆ Evaluating Quality of Answers for Retrieval-Augmented Generation: A Strong LLM Is All You Need
We present a comprehensive evaluation of answer quality in Retrieval-Augmented Generation (RAG) applications using vRAG-Eval, a novel grading system that is designed to assess correctness, completeness, and honesty. We further map the grading of quality aspects aforementioned into a binary score, indicating an accept or reject decision, mirroring the intuitive "thumbs-up" or "thumbs-down" gesture commonly used in chat applications. This approach suits factual business settings where a clear decision opinion is essential. Our assessment applies vRAG-Eval to two Large Language Models (LLMs), evaluating the quality of answers generated by a vanilla RAG application. We compare these evaluations with human expert judgments and find a substantial alignment between GPT-4's assessments and those of human experts, reaching 83% agreement on accept or reject decisions. This study highlights the potential of LLMs as reliable evaluators in closed-domain, closed-ended settings, particularly when human evaluations require significant resources.
comment: 12 pages, 6 figures, 12 tables
☆ AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
☆ Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources
Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: "vaccine", "shot", and "ae". We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in "vaccine", "shot", and "ae" with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.
☆ PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.
☆ LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them
The recent success of Large Language Models (LLMs) has had a significant impact on the healthcare field, providing patients with medical advice, diagnostic information, and more. However, due to a lack of professional medical knowledge, patients are easily misled by generated erroneous information from LLMs, which may result in serious medical problems. To address this issue, we focus on tuning the LLMs to be medical assistants who collaborate with more experienced doctors. We first conduct a two-stage survey by inspiration-feedback to gain a broad understanding of the real needs of doctors for medical assistants. Based on this, we construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors, which includes 92K Q\&A samples from 22 tasks and 27 specialists. Moreover, we evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-\textit{test} containing 550 single-turn Q\&A and DotaBench containing 74 multi-turn conversations. The evaluation results indicate that being a medical assistant still poses challenges for existing open-source models, but DoctorFLAN can help them significantly. It demonstrates that the doctor-oriented dataset and benchmarks we construct can complement existing patient-oriented work and better promote medical LLMs research.
☆ Automated Clinical Data Extraction with Knowledge Conditioned LLMs
The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
☆ Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
How can sLLMs efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the sLLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by sLLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the sLLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies.
comment: preprint
☆ Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models
Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textit{DELD} significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research.
comment: 10 pages, 5 figures
☆ Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models COLING 2024
Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.
comment: Published at COLING 2024
☆ Multi-step Knowledge Retrieval and Inference over Unstructured Data
The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora's neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora's superior performance compared to well-known LLM and RAG baselines.
☆ Psychological Profiling in Cybersecurity: A Look at LLMs and Psycholinguistic Features
The increasing sophistication of cyber threats necessitates innovative approaches to cybersecurity. In this paper, we explore the potential of psychological profiling techniques, particularly focusing on the utilization of Large Language Models (LLMs) and psycholinguistic features. We investigate the intersection of psychology and cybersecurity, discussing how LLMs can be employed to analyze textual data for identifying psychological traits of threat actors. We explore the incorporation of psycholinguistic features, such as linguistic patterns and emotional cues, into cybersecurity frameworks. \iffalse Through case studies and experiments, we discuss the effectiveness of these methods in enhancing threat detection and mitigation strategies.\fi Our research underscores the importance of integrating psychological perspectives into cybersecurity practices to bolster defense mechanisms against evolving threats.
☆ Implicit Discourse Relation Classification For Nigerian Pidgin
Despite attempts to make Large Language Models multi-lingual, many of the world's languages are still severely under-resourced. This widens the performance gap between NLP and AI applications aimed at well-financed, and those aimed at less-resourced languages. In this paper, we focus on Nigerian Pidgin (NP), which is spoken by nearly 100 million people, but has comparatively very few NLP resources and corpora. We address the task of Implicit Discourse Relation Classification (IDRC) and systematically compare an approach translating NP data to English and then using a well-resourced IDRC tool and back-projecting the labels versus creating a synthetic discourse corpus for NP, in which we translate PDTB and project PDTB labels, and then train an NP IDR classifier. The latter approach of learning a "native" NP classifier outperforms our baseline by 13.27\% and 33.98\% in f$_{1}$ score for 4-way and 11-way classification, respectively.
☆ Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of LLMs for Analyzing Categorical Syllogism
There have been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for the categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by the existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then proceed to summarize the findings and observations for the performances of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of the quantifiers seems to be the current bottleneck that limits the performances of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan on the future release of categorical syllogism datasets. We hope our work will not only provide a timely review of the current literature regarding categorical syllogisms, but also motivate more interdisciplinary research between communities, specifically computational linguists and logicians.
☆ Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. Indeed, existing work has shown that LLMs can be used to great effect for many tasks, such as information retrieval (IR), and passage ranking. However, current state-of-the-art results heavily lean on the capabilities of the LLM being used. Currently, proprietary, and very large LLMs such as GPT-4 are the highest performing passage re-rankers. Hence, users without the resources to leverage top of the line LLMs, or ones that are closed source, are at a disadvantage. In this paper, we investigate the use of a pre-filtering step before passage re-ranking in IR. Our experiments show that by using a small number of human generated relevance scores, coupled with LLM relevance scoring, it is effectively possible to filter out irrelevant passages before re-ranking. Our experiments also show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task. Indeed, our results show that smaller models such as Mixtral can become competitive with much larger proprietary models (e.g., ChatGPT and GPT-4).
☆ WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model
Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.
comment: Under review; Model script available at https://github.com/zhu00121/WavRx
☆ Jailbreaking LLMs with Arabic Transliteration and Arabizi
This study identifies the potential vulnerabilities of Large Language Models (LLMs) to 'jailbreak' attacks, specifically focusing on the Arabic language and its various forms. While most research has concentrated on English-based prompt manipulation, our investigation broadens the scope to investigate the Arabic language. We initially tested the AdvBench benchmark in Standardized Arabic, finding that even with prompt manipulation techniques like prefix injection, it was insufficient to provoke LLMs into generating unsafe content. However, when using Arabic transliteration and chatspeak (or arabizi), we found that unsafe content could be produced on platforms like OpenAI GPT-4 and Anthropic Claude 3 Sonnet. Our findings suggest that using Arabic and its various forms could expose information that might remain hidden, potentially increasing the risk of jailbreak attacks. We hypothesize that this exposure could be due to the model's learned connection to specific words, highlighting the need for more comprehensive safety training across all language forms.
comment: 14 pages, 4 figures
☆ Learn it or Leave it: Module Composition and Pruning for Continual Learning
In real-world environments, continual learning is essential for machine learning models, as they need to acquire new knowledge incrementally without forgetting what they have already learned. While pretrained language models have shown impressive capabilities on various static tasks, applying them to continual learning poses significant challenges, including avoiding catastrophic forgetting, facilitating knowledge transfer, and maintaining parameter efficiency. In this paper, we introduce MoCL-P, a novel lightweight continual learning method that addresses these challenges simultaneously. Unlike traditional approaches that continuously expand parameters for newly arriving tasks, MoCL-P integrates task representation-guided module composition with adaptive pruning, effectively balancing knowledge integration and computational overhead. Our evaluation across three continual learning benchmarks with up to 176 tasks shows that MoCL-P achieves state-of-the-art performance and improves parameter efficiency by up to three times, demonstrating its potential for practical applications where resource requirements are constrained.
☆ Simulating The U.S. Senate: An LLM-Driven Agent Approach to Modeling Legislative Behavior and Bipartisanship
This study introduces a novel approach to simulating legislative processes using LLM-driven virtual agents, focusing on the U.S. Senate Intelligence Committee. We developed agents representing individual senators and placed them in simulated committee discussions. The agents demonstrated the ability to engage in realistic debate, provide thoughtful reflections, and find bipartisan solutions under certain conditions. Notably, the simulation also showed promise in modeling shifts towards bipartisanship in response to external perturbations. Our results indicate that this LLM-driven approach could become a valuable tool for understanding and potentially improving legislative processes, supporting a broader pattern of findings highlighting how LLM-based agents can usefully model real-world phenomena. Future works will focus on enhancing agent complexity, expanding the simulation scope, and exploring applications in policy testing and negotiation.
☆ Sequence Graph Network for Online Debate Analysis
Online debates involve a dynamic exchange of ideas over time, where participants need to actively consider their opponents' arguments, respond with counterarguments, reinforce their own points, and introduce more compelling arguments as the discussion unfolds. Modeling such a complex process is not a simple task, as it necessitates the incorporation of both sequential characteristics and the capability to capture interactions effectively. To address this challenge, we employ a sequence-graph approach. Building the conversation as a graph allows us to effectively model interactions between participants through directed edges. Simultaneously, the propagation of information along these edges in a sequential manner enables us to capture a more comprehensive representation of context. We also introduce a Sequence Graph Attention layer to illustrate the proposed information update scheme. The experimental results show that sequence graph networks achieve superior results to existing methods in online debates.
comment: 8 pages, 4 figures
☆ Learning to Correct for QA Reasoning with Black-box LLMs
An open challenge in recent machine learning is about how to improve the reasoning capability of large language models (LLMs) in a black-box setting, i.e., without access to detailed information such as output token probabilities. Existing approaches either rely on accessibility (which is often unrealistic) or involve significantly increased train- and inference-time costs. This paper addresses those limitations or shortcomings by proposing a novel approach, namely CoBB (Correct for improving QA reasoning of Black-Box LLMs). It uses a trained adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or improved reasonings. Specifically, the adaptation model is initialized with a relatively small open-source LLM and adapted over a collection of sub-sampled training pairs. To select the representative pairs of correct and incorrect reasonings, we formulated the dataset construction as an optimization problem that minimizes the statistical divergence between the sampled subset and the entire collection, and solved it via a genetic algorithm. We then train the adaptation model over the sampled pairs by contrasting the likelihoods of correct and incorrect reasonings. Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks, compared to the best-performing adaptation baselines.
comment: preprint, 18 pages
☆ The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations.
☆ Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization INTERSPEECH 2024
End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.
comment: Accepted at INTERSPEECH 2024
☆ Few-shot Personalization of LLMs with Mis-aligned Responses
As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.
comment: preprint, 30 pages
☆ Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the diverse LLMs' knowledge preferences inevitably poses an inevitable challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrate pair-wise, point-wise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs' internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions also provide empirical guidance for achieving reliable RAG systems. Our code is publicly available at https://github.com/dongguanting/DPA-RAG.
comment: Work in progress
☆ Human-AI Collaborative Taxonomy Construction: A Case Study in Profession-Specific Writing Assistants
Large Language Models (LLMs) have assisted humans in several writing tasks, including text revision and story generation. However, their effectiveness in supporting domain-specific writing, particularly in business contexts, is relatively less explored. Our formative study with industry professionals revealed the limitations in current LLMs' understanding of the nuances in such domain-specific writing. To address this gap, we propose an approach of human-AI collaborative taxonomy development to perform as a guideline for domain-specific writing assistants. This method integrates iterative feedback from domain experts and multiple interactions between these experts and LLMs to refine the taxonomy. Through larger-scale experiments, we aim to validate this methodology and thus improve LLM-powered writing assistance, tailoring it to meet the unique requirements of different stakeholder needs.
comment: Accepted to CHI 2024 In2Writing Workshop
☆ RouteLLM: Learning to Route LLMs with Preference Data
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
☆ Evaluating Copyright Takedown Methods for Language Models
Language models (LMs) derive their capabilities from extensive training on diverse data, including potentially copyrighted material. These models can memorize and generate content similar to their training data, posing potential concerns. Therefore, model creators are motivated to develop mitigation methods that prevent generating protected content. We term this procedure as copyright takedowns for LMs, noting the conceptual similarity to (but legal distinction from) the DMCA takedown This paper introduces the first evaluation of the feasibility and side effects of copyright takedowns for LMs. We propose CoTaEval, an evaluation framework to assess the effectiveness of copyright takedown methods, the impact on the model's ability to retain uncopyrightable factual knowledge from the training data whose recitation is embargoed, and how well the model maintains its general utility and efficiency. We examine several strategies, including adding system prompts, decoding-time filtering interventions, and unlearning approaches. Our findings indicate that no tested method excels across all metrics, showing significant room for research in this unique problem setting and indicating potential unresolved challenges for live policy proposals.
comment: 31 pages, 9 figures, 14 tables
☆ Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.
comment: Code, data, and models are available at https://github.com/dvlab-research/Step-DPO
♻ ☆ Situational Awareness Matters in 3D Vision Language Reasoning CVPR 2024
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
comment: CVPR 2024. Project Page: https://yunzeman.github.io/situation3d
♻ ☆ On the Impact of Voice Anonymization on Speech Diagnostic Applications: a Case Study on COVID-19 Detection
With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics. As the voice contains both linguistic and para-linguistic information (e.g., vocal pitch, intonation, speech rate, loudness), there is growing interest in voice anonymization to preserve speaker privacy and identity. Voice privacy challenges have emerged over the last few years and focus has been placed on removing speaker identity while keeping linguistic content intact. For affective computing and disease monitoring applications, however, the para-linguistic content may be more critical. Unfortunately, the effects that anonymization may have on these systems are still largely unknown. In this paper, we fill this gap and focus on one particular health monitoring application: speech-based COVID-19 diagnosis. We test three anonymization methods and their impact on five different state-of-the-art COVID-19 diagnostic systems using three public datasets. We validate the effectiveness of the anonymization methods, compare their computational complexity, and quantify the impact across different testing scenarios for both within- and across-dataset conditions. Additionally, we provided a comprehensive evaluation of the importance of different speech aspects for diagnostics and showed how they are affected by different types of anonymizers. Lastly, we show the benefits of using anonymized external data as a data augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss seen with anonymization.
comment: Updated version; Published at IEEE Transactions on Information Forensics and Security
♻ ☆ Large Language Models in the Clinic: A Comprehensive Benchmark
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labeling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes latent feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that combining LLM embeddings with clustering algorithms yields the best results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy and depth of news reporting.
♻ ☆ BASS: Batched Attention-optimized Speculative Sampling
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
♻ ☆ BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
comment: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)
♻ ☆ Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.
comment: 8 pages
♻ ☆ Large Knowledge Model: Perspectives and Challenges
Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of "knowledge". We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can boost traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, and LLM-enhanced symbolic reasoning. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This promising undertaking would entail several key challenges, such as disentangling knowledge base from language models, cognitive alignment with human knowledge, integration of perception and cognition, and building large commonsense models for interacting with physical world, among others. We finally propose a five-"A" principle to distinguish the concept of LKM.
comment: Data Intelligence, Published: Jun 18, 2024
♻ ☆ MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
♻ ☆ ToM-LM: Delegating Theory of Mind Reasoning to External Symbolic Executors in Large Language Models
Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is first fine-tuned through pairs of natural language and symbolic formulation representation of ToM problems and is then instructed to generate the symbolic formulation with a one-shot in-context example. The generated symbolic formulation is then executed by the SMCDEL model checker to perform transparent and verifiable ToM reasoning and give the final result. We demonstrate that our approach, ToM-LM, shows a significant improvement over all the constructed baselines. Our study proposes a novel view about externalizing a particular component of ToM reasoning, mainly reasoning about beliefs, and suggests generalizing it to other aspects of ToM reasoning.
comment: Accepted at NeSy 2024
♻ ☆ Cultural Bias and Cultural Alignment of Large Language Models
Culture fundamentally shapes people's reasoning, behavior, and communication. As people increasingly use generative artificial intelligence (AI) to expedite and automate personal and professional tasks, cultural values embedded in AI models may bias people's authentic expression and contribute to the dominance of certain cultures. We conduct a disaggregated evaluation of cultural bias for five widely used large language models (OpenAI's GPT-4o/4-turbo/4/3.5-turbo/3) by comparing the models' responses to nationally representative survey data. All models exhibit cultural values resembling English-speaking and Protestant European countries. We test cultural prompting as a control strategy to increase cultural alignment for each country/territory. For recent models (GPT-4, 4-turbo, 4o), this improves the cultural alignment of the models' output for 71-81% of countries and territories. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.
♻ ☆ VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.
♻ ☆ Active Preference Inference using Language Models and Probabilistic Reasoning
Actively inferring user preferences, for example by asking good questions, is important for any human-facing decision-making system. Active inference allows such systems to adapt and personalize themselves to nuanced individual preferences. To enable this ability for instruction-tuned large language models (LLMs), one may prompt them to ask users questions to infer their preferences, transforming the language models into more robust, interactive systems. However, out of the box, these models are not efficient at extracting preferences: the questions they generate are not informative, requiring a high number of user interactions and impeding the usability of the downstream system. In this work, we introduce an inference-time algorithm that helps LLMs quickly infer preferences by using more informative questions. Our algorithm uses a probabilistic model whose conditional distributions are defined by prompting an LLM, and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm outperforms baselines with the same underlying LLM on task performance while using fewer user interactions.
♻ ☆ OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).
comment: 10 pages
♻ ☆ MM-MATH: Advancing Multimodal Math Evaluation with Process Evaluation and Fine-grained Classification
To advance the evaluation of multimodal math reasoning in large multimodal models (LMMs), this paper introduces a novel benchmark, MM-MATH. MM-MATH consists of 5,929 open-ended middle school math problems with visual contexts, with fine-grained classification across difficulty, grade level, and knowledge points. Unlike existing benchmarks relying on binary answer comparison, MM-MATH incorporates both outcome and process evaluations. Process evaluation employs LMM-as-a-judge to automatically analyze solution steps, identifying and categorizing errors into specific error types. Extensive evaluation of ten models on MM-MATH reveals significant challenges for existing LMMs, highlighting their limited utilization of visual information and struggles with higher-difficulty problems. The best-performing model achieves only 31% accuracy on MM-MATH, compared to 82% for humans. This highlights the challenging nature of our benchmark for existing models and the significant gap between the multimodal reasoning capabilities of current models and humans. Our process evaluation reveals that diagram misinterpretation is the most common error, accounting for more than half of the total error cases, underscoring the need for improved image comprehension in multimodal reasoning.
♻ ☆ DoRA: Enhancing Parameter-Efficient Fine-Tuning with Dynamic Rank Distribution ACL 2024
Fine-tuning large-scale pre-trained models is inherently a resource-intensive task. While it can enhance the capabilities of the model, it also incurs substantial computational costs, posing challenges to the practical application of downstream tasks. Existing parameter-efficient fine-tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) rely on a bypass framework that ignores the differential parameter budget requirements across weight matrices, which may lead to suboptimal fine-tuning outcomes. To address this issue, we introduce the Dynamic Low-Rank Adaptation (DoRA) method. DoRA decomposes high-rank LoRA layers into structured single-rank components, allowing for dynamic pruning of parameter budget based on their importance to specific tasks during training, which makes the most of the limited parameter budget. Experimental results demonstrate that DoRA can achieve competitive performance compared with LoRA and full model fine-tuning, and outperform various strong baselines with the same storage parameter budget. Our code is available at https://github.com/MIkumikumi0116/DoRA
comment: Accepted by the main conference of ACL 2024
♻ ☆ SetBERT: Enhancing Retrieval Performance for Boolean Logic and Set Operation Queries
We introduce SetBERT, a fine-tuned BERT-based model designed to enhance query embeddings for set operations and Boolean logic queries, such as Intersection (AND), Difference (NOT), and Union (OR). SetBERT significantly improves retrieval performance for logic-structured queries, an area where both traditional and neural retrieval methods typically underperform. We propose an innovative use of inversed-contrastive loss, focusing on identifying the negative sentence, and fine-tuning BERT with a dataset generated via prompt GPT. Furthermore, we demonstrate that, unlike other BERT-based models, fine-tuning with triplet loss actually degrades performance for this specific task. Our experiments reveal that SetBERT-base not only significantly outperforms BERT-base (up to a 63% improvement in Recall) but also achieves performance comparable to the much larger BERT-large model, despite being only one-third the size.
comment: 10 pages, 1 figure
♻ ☆ Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task ICLR 2023
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
comment: ICLR 2023 oral (notable-top-5%): https://openreview.net/forum?id=DeG07_TcZvT ; code: https://github.com/likenneth/othello_world
♻ ☆ CHIRON: Rich Character Representations in Long-Form Narratives
Characters are integral to long-form narratives, but are poorly understood by existing story analysis and generation systems. While prior work has simplified characters via graph-based methods and brief character descriptions, we aim to better tackle the problem of representing complex characters by taking inspiration from advice given to professional writers. We propose CHIRON, a new `character sheet' based representation that organizes and filters textual information about characters. We construct CHIRON sheets in two steps: a Generation Module that prompts an LLM for character information via question-answering and a Validation Module that uses automated reasoning and a domain-specific entailment model to eliminate false facts about a character. We validate CHIRON via the downstream task of masked-character prediction, where our experiments show CHIRON is better and more flexible than comparable summary-based baselines. We also show that metrics derived from CHIRON can be used to automatically infer character-centricity in stories, and that these metrics align with human judgments.
♻ ☆ Inference-Time Intervention: Eliciting Truthful Answers from a Language Model NeurIPS 2023
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
comment: NeurIPS 2023 spotlight; code: https://github.com/likenneth/honest_llama
♻ ☆ Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
Large language models (LLMs) are susceptible to a type of attack known as jailbreaking, which misleads LLMs to output harmful contents. Although there are diverse jailbreak attack strategies, there is no unified understanding on why some methods succeed and others fail. This paper explores the behavior of harmful and harmless prompts in the LLM's representation space to investigate the intrinsic properties of successful jailbreak attacks. We hypothesize that successful attacks share some similar properties: They are effective in moving the representation of the harmful prompt towards the direction to the harmless prompts. We leverage hidden representations into the objective of existing jailbreak attacks to move the attacks along the acceptance direction, and conduct experiments to validate the above hypothesis using the proposed objective. We hope this study provides new insights into understanding how LLMs understand harmfulness information.
♻ ☆ LongIns: A Challenging Long-context Instruction-based Exam for LLMs
The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).
♻ ☆ Are AI-Generated Text Detectors Robust to Adversarial Perturbations? ACL 2024
The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confidence predictions under different noise, which improves the model's robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at \url{https://github.com/CarlanLark/Robust-AIGC-Detector}.
comment: Accepted to ACL 2024 main conference
♻ ☆ Jina CLIP: Your CLIP Model Is Also Your Text Retriever ICML2024
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
comment: 4 pages, MFM-EAI@ICML2024
♻ ☆ Caught in the Quicksand of Reasoning, Far from AGI Summit: Evaluating LLMs' Mathematical and Coding Competency through Ontology-guided Interventions
Recent advancements in Large Language Models (LLMs) have showcased striking results on existing logical reasoning benchmarks, with some models even surpassing human performance. However, the true depth of their competencies and robustness in reasoning tasks remains an open question. To this end, in this paper, we focus on two popular reasoning tasks: arithmetic reasoning and code generation. Particularly, we introduce: (i) a general ontology of perturbations for maths and coding questions, (ii) a semi-automatic method to apply these perturbations, and (iii) two datasets, MORE and CORE, respectively, of perturbed maths and coding problems to probe the limits of LLM capabilities in numeric reasoning and coding tasks. Through comprehensive evaluations of both closed-source and open-source LLMs, we show a significant performance drop across all the models against the perturbed questions, suggesting that the current LLMs lack robust problem solving skills and structured reasoning abilities in many areas, as defined by our ontology. We open source the datasets and source codes at: https://github.com/declare-lab/llm_robustness.
♻ ☆ 360$^\circ$REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System
Large language model agents have demonstrated remarkable advancements across various complex tasks. Recent works focus on optimizing the agent team or employing self-reflection to iteratively solve complex tasks. Since these agents are all based on the same LLM, only conducting self-evaluation or removing underperforming agents does not substantively enhance the capability of the agents. We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. In this paper, we propose Reusable Experience Accumulation with 360$^\circ$ Assessment (360$^\circ$REA), a hierarchical multi-agent framework inspired by corporate organizational practices. The framework employs a novel 360$^\circ$ performance assessment method for multi-perspective performance evaluation with fine-grained assessment. To enhance the capability of agents in addressing complex tasks, we introduce dual-level experience pool for agents to accumulate experience through fine-grained assessment. Extensive experiments on complex task datasets demonstrate the effectiveness of 360$^\circ$REA.
♻ ☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
♻ ☆ Compact Speech Translation Models via Discrete Speech Units Pretraining
We propose a pretraining method to use Self-Supervised Speech (SSS) model to creating more compact Speech-to-text Translation. In contrast to using the SSS model for initialization, our method is more suitable to memory constrained scenario such as on-device deployment. Our method is based on Discrete Speech Units (DSU) extracted from the SSS model. In the first step, our method pretrains two smaller encoder-decoder models on 1) Filterbank-to-DSU (Fbk-to-DSU) and 2) DSU-to-Translation (DSU-to-Trl) data respectively. The DSU thus become the distillation inputs of the smaller models. Subsequently, the encoder from the Fbk-to-DSU model and the decoder from the DSU-to-Trl model are taken to initialise the compact model. Finally, the compact model is finetuned on the paired Fbk-Trl data. In addition to being compact, our method requires no transcripts, making it applicable to low-resource settings. It also avoids speech discretization in inference and is more robust to the DSU tokenization. Evaluation on CoVoST-2 (X-En) shows that our method has consistent improvement over the baseline in three metrics while being compact i.e., only half the SSS model size.
comment: 11 pages, accepted at IWSLT 2024
♻ ☆ Two Tales of Persona in LLMs: A Survey of Role-Playing and Personalization
The concept of persona, originally adopted in dialogue literature, has re-surged as a promising framework for tailoring large language models (LLMs) to specific context (e.g., personalized search, LLM-as-a-judge). However, the growing research on leveraging persona in LLMs is relatively disorganized and lacks a systematic taxonomy. To close the gap, we present a comprehensive survey to categorize the current state of the field. We identify two lines of research, namely (1) LLM Role-Playing, where personas are assigned to LLMs, and (2) LLM Personalization, where LLMs take care of user personas. Additionally, we introduce existing methods for LLM personality evaluation. To the best of our knowledge, we present the first survey for role-playing and personalization in LLMs under the unified view of persona. We continuously maintain a paper collection to foster future endeavors: https://github.com/MiuLab/PersonaLLM-Survey
comment: 8-page version
♻ ☆ Case-Based or Rule-Based: How Do Transformers Do the Math?
Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as "rule-based reasoning" and "case-based reasoning". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length.
♻ ☆ Super Tiny Language Models
The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
comment: 11 pages, 4 figures
♻ ☆ Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
Open Information Extraction (OpenIE) is a fundamental yet challenging task in Natural Language Processing, which involves extracting all triples (subject, predicate, object) from a given sentence. While labeling-based methods have their merits, generation-based techniques offer unique advantages, such as the ability to generate tokens not present in the original sentence. However, these generation-based methods often require a significant amount of training data to learn the task form of OpenIE and substantial training time to overcome slow model convergence due to the order penalty. In this paper, we introduce a novel framework, OK-IE, that ingeniously transforms the task form of OpenIE into the pre-training task form of the T5 model, thereby reducing the need for extensive training data. Furthermore, we introduce an innovative concept of Anchor to control the sequence of model outputs, effectively eliminating the impact of order penalty on model convergence and significantly reducing training time. Experimental results indicate that, compared to previous SOTA methods, OK-IE requires only 1/100 of the training data (900 instances) and 1/120 of the training time (3 minutes) to achieve comparable results.
♻ ☆ Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
♻ ☆ SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
comment: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
♻ ☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
♻ ☆ CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
comment: Accepted by ACL 2024 (Main Conference)
♻ ☆ FaithLM: Towards Faithful Explanations for Large Language Models
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.
♻ ☆ Multiple-Choice Questions are Efficient and Robust LLM Evaluators
We present GSM-MC, a multiple-choice (MC) dataset constructed by collecting answers and incorrect predictions on GSM8K from 60 open-source models. Through extensive experiments, we show that LLMs' performance on the MC version of this popular benchmark is strongly correlated with their performance on the original version and is quite robust to distractor choices and option orders, while the evaluation time is reduced by a factor of up to 30. Following similar procedures, we introduce MATH-MC, constructed from MATH, and PythonIO, a new program reasoning MC dataset constructed from HumanEval and MBPP. Experimental results indicate that LLMs' performance on these MC benchmarks leaves much room for improvement. Our data and code are available at https://github.com/Geralt-Targaryen/MC-Evaluation.
comment: data at https://github.com/Geralt-Targaryen/MC-Evaluation
♻ ☆ Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
In this work we systematically review the recent advancements in software engineering with language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. Unlike previous works, we integrate software engineering (SE) with natural language processing (NLP) by discussing the perspectives of both sides: SE applies language models for development automation, while NLP adopts SE tasks for language model evaluation. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also go beyond programming and review LLMs' application in other software engineering activities including requirement engineering, testing, deployment, and operations in an endeavor to provide a global view of NLP in SE, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on GitHub at https://github.com/codefuse-ai/Awesome-Code-LLM.
comment: Repo: https://github.com/codefuse-ai/Awesome-Code-LLM. 9 figures, 18 tables, and 902 references. Under review
♻ ☆ Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment ACL 2024
Evaluating and Rethinking the current landscape of Large Multimodal Models (LMMs), we observe that widely-used visual-language projection approaches (e.g., Q-former or MLP) focus on the alignment of image-text descriptions yet ignore the visual knowledge-dimension alignment, i.e., connecting visuals to their relevant knowledge. Visual knowledge plays a significant role in analyzing, inferring, and interpreting information from visuals, helping improve the accuracy of answers to knowledge-based visual questions. In this paper, we mainly explore improving LMMs with visual-language knowledge alignment, especially aimed at challenging knowledge-based visual question answering (VQA). To this end, we present a Cognitive Visual-Language Mapper (CVLM), which contains a pretrained Visual Knowledge Aligner (VKA) and a Fine-grained Knowledge Adapter (FKA) used in the multimodal instruction tuning stage. Specifically, we design the VKA based on the interaction between a small language model and a visual encoder, training it on collected image-knowledge pairs to achieve visual knowledge acquisition and projection. FKA is employed to distill the fine-grained visual knowledge of an image and inject it into Large Language Models (LLMs). We conduct extensive experiments on knowledge-based VQA benchmarks and experimental results show that CVLM significantly improves the performance of LMMs on knowledge-based VQA (average gain by 5.0%). Ablation studies also verify the effectiveness of VKA and FKA, respectively. The codes are available at https://github.com/HITsz-TMG/Cognitive-Visual-Language-Mapper
comment: 12 pages,4 figures; Accepted by ACL 2024 Main Conference
♻ ☆ Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL
Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.
♻ ☆ Xmodel-LM Technical Report
We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at https://github.com/XiaoduoAILab/XmodelLM.
♻ ☆ Med-MoE: Mixture of Domain-Specific Experts for Lightweight Medical Vision-Language Models
Recent advancements in general-purpose or domain-specific multimodal large language models (LLMs) have witnessed remarkable progress for medical decision-making. However, they are designated for specific classification or generative tasks, and require model training or finetuning on large-scale datasets with sizeable parameters and tremendous computing, hindering their clinical utility across diverse resource-constrained scenarios in practice. In this paper, we propose a novel and lightweight framework Med-MoE (Mixture-of-Experts) that tackles both discriminative and generative multimodal medical tasks. The learning of Med-MoE consists of three steps: multimodal medical alignment, instruction tuning and routing, and domain-specific MoE tuning. After aligning multimodal medical images with LLM tokens, we then enable the model for different multimodal medical tasks with instruction tuning, together with a trainable router tailored for expert selection across input modalities. Finally, the model is tuned by integrating the router with multiple domain-specific experts, which are selectively activated and further empowered by meta expert. Comprehensive experiments on both open- and close-end medical question answering (Med-VQA) and image classification tasks across datasets such as VQA-RAD, SLAKE and Path-VQA demonstrate that our model can achieve performance superior to or on par with state-of-the-art baselines, while only requiring approximately 30\%-50\% of activated model parameters. Extensive analysis and ablations corroborate the effectiveness and practical utility of our method.
♻ ☆ Generating Chain-of-Thoughts with a Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought ICML 2024
To improve the ability of the large language model (LLMs) to tackle complex reasoning problems, chain-of-thoughts (CoT) methods were proposed to guide LLMs to reason step-by-step, enabling problem solving from simple to complex. State-of-the-art methods for generating such a chain involve interactive collaboration, where the learner generates candidate intermediate thoughts, evaluated by the LLM, guiding the generation of subsequent thoughts. However, a widespread yet understudied problem is that the evaluation from the LLM is typically noisy and unreliable, potentially misleading the generation process in selecting promising intermediate thoughts. In this paper, motivated by Vapnik's principle, we use pairwise-comparison evaluation instead of point-wise scoring to search for promising intermediate thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further alleviate the noise in the comparison, we incorporate techniques from ensemble learning and dueling bandits, proposing two variants of the algorithm. Experiments on three real-world tasks demonstrate the effectiveness of our proposed algorithm and verify the rationale of the pairwise comparison mechanism.
comment: ICML 2024
♻ ☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at https://github.com/CoderChen01/InterCLIP-MEP.
comment: 8 pages, 6 figures, 6 tables
♻ ☆ MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.
♻ ☆ Ouroboros: Generating Longer Drafts Phrase by Phrase for Faster Speculative Decoding
Speculative decoding is a widely used method that accelerates the generation process of large language models (LLMs) with no compromise in model performance. It achieves this goal by using an existing smaller model for drafting and then employing the target LLM to verify the draft in a low-cost parallel manner. Under such a drafting-verification framework, drafting efficiency has become a bottleneck in the final speedup of speculative decoding. Therefore, generating longer drafts at less cost can lead to better decoding speedup. To achieve this, we introduce Ouroboros, which can generate draft phrases to parallelize the drafting process and meanwhile lengthen drafts in a training-free manner. The experimental results on various typical text generation tasks show that Ouroboros can achieve speedups of up to $2.4\times$ over speculative decoding and $3.9\times$ over vanilla decoding, without fine-tuning draft and target models.
♻ ☆ Marathon: A Race Through the Realm of Long Context with Large Language Models
With the advancement of large language models (LLMs) and the expansion of their context windows, existing long-context benchmarks fall short in effectively evaluating the models' comprehension and reasoning abilities in extended texts. Moreover, conventional benchmarks relying on F1 metrics often inaccurately score responses: they may undervalue correct answers that differ from the reference responses and overvalue incorrect ones that resemble the reference texts. In response to these limitations, we introduce Marathon, a novel evaluation benchmark that adopts a multiple-choice question format. It is specifically designed to overcome the constraints of previous benchmarks and provide a rapid, precise, and unbiased appraisal of the long-context comprehension skills of large language models. We conducted comprehensive evaluations on the Marathon benchmark with a range of state-of-the-art LLMs and assessed the effectiveness of various optimization strategies tailored for long-context generation. We anticipate that the Marathon benchmark and its associated leaderboard will enable a more precise and equitable evaluation of LLMs' capabilities in understanding and reasoning over extended contexts. Marathon is available at https://github.com/Hambaobao/Marathon.
♻ ☆ Safely Learning with Private Data: A Federated Learning Framework for Large Language Model
Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.
♻ ☆ Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models ICML 2024
Quantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding. Our code and data are available at https://github.com/calc-cmu/pre-calc.
comment: AI4Math workshop, ICML 2024
♻ ☆ Key-Element-Informed sLLM Tuning for Document Summarization
Remarkable advances in large language models (LLMs) have enabled high-quality text summarization. However, this capability is currently accessible only through LLMs of substantial size or proprietary LLMs with usage fees. In response, smaller-scale LLMs (sLLMs) of easy accessibility and low costs have been extensively studied, yet they often suffer from missing key information and entities, i.e., low relevance, in particular, when input documents are long. We hence propose a key-element-informed instruction tuning for summarization, so-called KEITSum, which identifies key elements in documents and instructs sLLM to generate summaries capturing these key elements. Experimental results on dialogue and news datasets demonstrate that sLLM with KEITSum indeed provides high-quality summarization with higher relevance and less hallucinations, competitive to proprietary LLM.
comment: Interspeech 2024
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ How Good are LLMs at Relation Extraction under Low-Resource Scenario? Comprehensive Evaluation
Relation Extraction (RE) serves as a crucial technology for transforming unstructured text into structured information, especially within the framework of Knowledge Graph development. Its importance is emphasized by its essential role in various downstream tasks. Besides the conventional RE methods which are based on neural networks and pre-trained language models, large language models (LLMs) are also utilized in the research field of RE. However, on low-resource languages (LRLs), both conventional RE methods and LLM-based methods perform poorly on RE due to the data scarcity issues. To this end, this paper constructs low-resource relation extraction datasets in 10 LRLs in three regions (Central Asia, Southeast Asia and Middle East). The corpora are constructed by translating the original publicly available English RE datasets (NYT10, FewRel and CrossRE) using an effective multilingual machine translation. Then, we use the language perplexity (PPL) to filter out the low-quality data from the translated datasets. Finally, we conduct an empirical study and validate the performance of several open-source LLMs on these generated LRL RE datasets.
♻ ☆ Iterative Reasoning Preference Optimization
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
♻ ☆ Can Public LLMs be used for Self-Diagnosis of Medical Conditions ?
Advancements in deep learning have generated a large-scale interest in the development of foundational deep learning models. The development of Large Language Models (LLM) has evolved as a transformative paradigm in conversational tasks, which has led to its integration and extension even in the critical domain of healthcare. With LLMs becoming widely popular and their public access through open-source models and integration with other applications, there is a need to investigate their potential and limitations. One such crucial task where LLMs are applied but require a deeper understanding is that of self-diagnosis of medical conditions based on bias-validating symptoms in the interest of public health. The widespread integration of Gemini with Google search and GPT-4.0 with Bing search has led to a shift in the trend of self-diagnosis using search engines to conversational LLM models. Owing to the critical nature of the task, it is prudent to investigate and understand the potential and limitations of public LLMs in the task of self-diagnosis. In this study, we prepare a prompt engineered dataset of 10000 samples and test the performance on the general task of self-diagnosis. We compared the performance of both the state-of-the-art GPT-4.0 and the fee Gemini model on the task of self-diagnosis and recorded contrasting accuracies of 63.07% and 6.01%, respectively. We also discuss the challenges, limitations, and potential of both Gemini and GPT-4.0 for the task of self-diagnosis to facilitate future research and towards the broader impact of general public knowledge. Furthermore, we demonstrate the potential and improvement in performance for the task of self-diagnosis using Retrieval Augmented Generation.
comment: 11 Pages, 4 figures, Submitted to ACM Transactions on Computing for Healthcare
♻ ☆ A Survey on Human-AI Teaming with Large Pre-Trained Models
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. This paper surveys the pivotal integration of LPtMs with HAI, emphasizing how these models enhance collaborative intelligence beyond traditional approaches. It examines the potential of LPtMs in augmenting human capabilities, discussing this collaboration for AI model improvements, effective teaming, ethical considerations, and their broad applied implications in various sectors. Through this exploration, the study sheds light on the transformative impact of LPtM-enhanced HAI Teaming, providing insights for future research, policy development, and strategic implementations aimed at harnessing the full potential of this collaboration for research and societal benefit.
♻ ☆ Deception Detection from Linguistic and Physiological Data Streams Using Bimodal Convolutional Neural Networks
Deception detection is gaining increasing interest due to ethical and security concerns. This paper explores the application of convolutional neural networks for the purpose of multimodal deception detection. We use a dataset built by interviewing 104 subjects about two topics, with one truthful and one falsified response from each subject about each topic. In particular, we make three main contributions. First, we extract linguistic and physiological features from this data to train and construct the neural network models. Second, we propose a fused convolutional neural network model using both modalities in order to achieve an improved overall performance. Third, we compare our new approach with earlier methods designed for multimodal deception detection. We find that our system outperforms regular classification methods; our results indicate the feasibility of using neural networks for deception detection even in the presence of limited amounts of data.
comment: Accepted by 2024 5th International Conference on Information Science, Parallel and Distributed Systems
♻ ☆ Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated Text
Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
♻ ☆ CodeHalu: Code Hallucinations in LLMs Driven by Execution-based Verification
Large Language Models (LLMs) have made significant progress in code generation, providing developers with unprecedented automated programming support. However, LLMs often generate code that is syntactically correct and even semantically plausible but may not execute as expected or meet specified requirements. This phenomenon of hallucinations in the code domain has not been systematically explored. To enhance the community's understanding and research on this issue, we introduce the concept of code hallucinations and propose a classification method for code hallucination based on execution verification. We classify code hallucinations into four main types: mapping, naming, resource, and logic hallucinations, with each category further divided into different subcategories to understand and address the unique challenges faced by LLMs in code generation with finer granularity. Additionally, we develop a dynamic detection algorithm named CodeHalu to quantify code hallucinations and establish the CodeHaluEval benchmark, which includes 8,883 samples from 699 tasks to systematically and quantitatively evaluate code hallucinations. By evaluating 17 popular LLMs on this benchmark, we reveal significant differences in their accuracy and reliability in code generation and provide detailed insights for further improving the code generation capabilities of LLMs. The CodeHalu benchmark and code are publicly available at https://github.com/yuchen814/CodeHalu.
♻ ☆ Simultaneous Masking, Not Prompting Optimization: A Paradigm Shift in Fine-tuning LLMs for Simultaneous Translation
Large language models (LLMs) have achieved state-of-the-art performance in various language processing tasks, motivating their adoption in simultaneous translation. Current fine-tuning methods to adapt LLMs for simultaneous translation focus on prompting optimization strategies using either data augmentation or prompt structure modifications. However, these methods suffer from several issues, such as unnecessarily expanded training sets, computational inefficiency from dumping the key and value cache, increased prompt sizes, or restriction to a single decision policy. To eliminate these issues, in this work, we propose SimulMask, a new paradigm for fine-tuning LLMs for simultaneous translation. It utilizes a novel attention mask approach that models simultaneous translation during fine-tuning by masking attention for a desired decision policy. Applying the proposed SimulMask on a Falcon LLM for the IWSLT 2017 dataset, we have observed a significant translation quality improvement compared to state-of-the-art prompting optimization strategies on five language pairs while reducing the computational cost.
♻ ☆ Theory of Mind for Multi-Agent Collaboration via Large Language Models EMNLP 2023
While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents' planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.
comment: Accepted to EMNLP 2023 (Main Conference). Code available at https://github.com/romanlee6/multi_LLM_comm
♻ ☆ Deductive Closure Training of Language Models for Coherence, Accuracy, and Updatability ACL
While language models (LMs) can sometimes generate factually correct text and estimate truth values of individual claims, these generally do not reflect a globally coherent, manipulable model of the world. As a consequence, current LMs also generate incorrect or nonsensical content, and are difficult to edit and bring up to date. We present a method called Deductive Closure Training (DCT) that uses LMs themselves to identify implications of (and contradictions within) the text that they generate, yielding an efficient self-supervised procedure for improving LM factuality. Given a collection of seed documents, DCT prompts LMs to generate additional text implied by these documents, reason globally about the correctness of this generated text, and finally fine-tune on text inferred to be correct. Given seed documents from a trusted source, DCT provides a tool for supervised model updating; if seed documents are sampled from the LM itself, DCT enables fully unsupervised fine-tuning for improved coherence and accuracy. Across the CREAK, MQUaKE, and Reversal Curse datasets, supervised DCT improves LM fact verification and text generation accuracy by 3-26%; on CREAK fully unsupervised DCT improves verification accuracy by 12%. These results show that LMs' reasoning capabilities during inference can be leveraged during training to improve their reliability.
comment: ACL Findings
Machine Learning 155
☆ Towards Compositionality in Concept Learning ICML 2024
Concept-based interpretability methods offer a lens into the internals of foundation models by decomposing their embeddings into high-level concepts. These concept representations are most useful when they are compositional, meaning that the individual concepts compose to explain the full sample. We show that existing unsupervised concept extraction methods find concepts which are not compositional. To automatically discover compositional concept representations, we identify two salient properties of such representations, and propose Compositional Concept Extraction (CCE) for finding concepts which obey these properties. We evaluate CCE on five different datasets over image and text data. Our evaluation shows that CCE finds more compositional concept representations than baselines and yields better accuracy on four downstream classification tasks. Code and data are available at https://github.com/adaminsky/compositional_concepts .
comment: Accepted at ICML 2024. 26 pages, 10 figures
☆ Symbolic Learning Enables Self-Evolving Agents
The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".
comment: Code available at https://github.com/aiwaves-cn/agents
☆ Confident Natural Policy Gradient for Local Planning in $q_π$-realizable Constrained MDPs
The constrained Markov decision process (CMDP) framework emerges as an important reinforcement learning approach for imposing safety or other critical objectives while maximizing cumulative reward. However, the current understanding of how to learn efficiently in a CMDP environment with a potentially infinite number of states remains under investigation, particularly when function approximation is applied to the value functions. In this paper, we address the learning problem given linear function approximation with $q_{\pi}$-realizability, where the value functions of all policies are linearly representable with a known feature map, a setting known to be more general and challenging than other linear settings. Utilizing a local-access model, we propose a novel primal-dual algorithm that, after $\tilde{O}(\text{poly}(d) \epsilon^{-3})$ queries, outputs with high probability a policy that strictly satisfies the constraints while nearly optimizing the value with respect to a reward function. Here, $d$ is the feature dimension and $\epsilon > 0$ is a given error. The algorithm relies on a carefully crafted off-policy evaluation procedure to evaluate the policy using historical data, which informs policy updates through policy gradients and conserves samples. To our knowledge, this is the first result achieving polynomial sample complexity for CMDP in the $q_{\pi}$-realizable setting.
☆ APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/
☆ Mental Modeling of Reinforcement Learning Agents by Language Models
Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent's behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent's behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.
comment: https://lukaswill.github.io/
☆ Enhancing Federated Learning with Adaptive Differential Privacy and Priority-Based Aggregation
Federated learning (FL), a novel branch of distributed machine learning (ML), develops global models through a private procedure without direct access to local datasets. However, it is still possible to access the model updates (gradient updates of deep neural networks) transferred between clients and servers, potentially revealing sensitive local information to adversaries using model inversion attacks. Differential privacy (DP) offers a promising approach to addressing this issue by adding noise to the parameters. On the other hand, heterogeneities in data structure, storage, communication, and computational capabilities of devices can cause convergence problems and delays in developing the global model. A personalized weighted averaging of local parameters based on the resources of each device can yield a better aggregated model in each round. In this paper, to efficiently preserve privacy, we propose a personalized DP framework that injects noise based on clients' relative impact factors and aggregates parameters while considering heterogeneities and adjusting properties. To fulfill the DP requirements, we first analyze the convergence boundary of the FL algorithm when impact factors are personalized and fixed throughout the learning process. We then further study the convergence property considering time-varying (adaptive) impact factors.
☆ UniRec: A Dual Enhancement of Uniformity and Frequency in Sequential Recommendations
Representation learning in sequential recommendation is critical for accurately modeling user interaction patterns and improving recommendation precision. However, existing approaches predominantly emphasize item-to-item transitions, often neglecting the time intervals between interactions, which are closely related to behavior pattern changes. Additionally, broader interaction attributes, such as item frequency, are frequently overlooked. We found that both sequences with more uniform time intervals and items with higher frequency yield better prediction performance. Conversely, non-uniform sequences exacerbate user interest drift and less-frequent items are difficult to model due to sparse sampling, presenting unique challenges inadequately addressed by current methods. In this paper, we propose UniRec, a novel bidirectional enhancement sequential recommendation method. UniRec leverages sequence uniformity and item frequency to enhance performance, particularly improving the representation of non-uniform sequences and less-frequent items. These two branches mutually reinforce each other, driving comprehensive performance optimization in complex sequential recommendation scenarios. Additionally, we present a multidimensional time module to further enhance adaptability. To the best of our knowledge, UniRec is the first method to utilize the characteristics of uniformity and frequency for feature augmentation. Comparing with eleven advanced models across four datasets, we demonstrate that UniRec outperforms SOTA models significantly. The code is available at https://github.com/Linxi000/UniRec.
comment: 15 pages, 8 figures, for source code, see https://github.com/Linxi000/UniRec
☆ Bayesian inverse Navier-Stokes problems: joint flow field reconstruction and parameter learning
We formulate and solve a Bayesian inverse Navier-Stokes (N-S) problem that assimilates velocimetry data in order to jointly reconstruct a 3D flow field and learn the unknown N-S parameters, including the boundary position. By hardwiring a generalised N-S problem, and regularising its unknown parameters using Gaussian prior distributions, we learn the most likely parameters in a collapsed search space. The most likely flow field reconstruction is then the N-S solution that corresponds to the learned parameters. We develop the method in the variational setting and use a stabilised Nitsche weak form of the N-S problem that permits the control of all N-S parameters. To regularise the inferred the geometry, we use a viscous signed distance field (vSDF) as an auxiliary variable, which is given as the solution of a viscous Eikonal boundary value problem. We devise an algorithm that solves this inverse problem, and numerically implement it using an adjoint-consistent stabilised cut-cell finite element method. We then use this method to reconstruct magnetic resonance velocimetry (flow-MRI) data of a 3D steady laminar flow through a physical model of an aortic arch for two different Reynolds numbers and signal-to-noise ratio (SNR) levels (low/high). We find that the method can accurately i) reconstruct the low SNR data by filtering out the noise/artefacts and recovering flow features that are obscured by noise, and ii) reproduce the high SNR data without overfitting. Although the framework that we develop applies to 3D steady laminar flows in complex geometries, it readily extends to time-dependent laminar and Reynolds-averaged turbulent flows, as well as non-Newtonian (e.g. viscoelastic) fluids.
☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm
☆ Preference Elicitation for Offline Reinforcement Learning
Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.
☆ An Autotuning-based Optimization Framework for Mixed-kernel SVM Classifications in Smart Pixel Datasets and Heterojunction Transistors
Support Vector Machine (SVM) is a state-of-the-art classification method widely used in science and engineering due to its high accuracy, its ability to deal with high dimensional data, and its flexibility in modeling diverse sources of data. In this paper, we propose an autotuning-based optimization framework to quantify the ranges of hyperparameters in SVMs to identify their optimal choices, and apply the framework to two SVMs with the mixed-kernel between Sigmoid and Gaussian kernels for smart pixel datasets in high energy physics (HEP) and mixed-kernel heterojunction transistors (MKH). Our experimental results show that the optimal selection of hyperparameters in the SVMs and the kernels greatly varies for different applications and datasets, and choosing their optimal choices is critical for a high classification accuracy of the mixed kernel SVMs. Uninformed choices of hyperparameters C and coef0 in the mixed-kernel SVMs result in severely low accuracy, and the proposed framework effectively quantifies the proper ranges for the hyperparameters in the SVMs to identify their optimal choices to achieve the highest accuracy 94.6\% for the HEP application and the highest average accuracy 97.2\% with far less tuning time for the MKH application.
☆ Graph Neural Networks for Emulation of Finite-Element Ice Dynamics in Greenland and Antarctic Ice Sheets ICML 2024
Although numerical models provide accurate solutions for ice sheet dynamics based on physics laws, they accompany intensified computational demands to solve partial differential equations. In recent years, convolutional neural networks (CNNs) have been widely used as statistical emulators for those numerical models. However, since CNNs operate on regular grids, they cannot represent the refined meshes and computational efficiency of finite-element numerical models. Therefore, instead of CNNs, this study adopts an equivariant graph convolutional network (EGCN) as an emulator for the ice sheet dynamics modeling. EGCN reproduces ice thickness and velocity changes in the Helheim Glacier, Greenland, and Pine Island Glacier, Antarctica, with 260 times and 44 times faster computation time, respectively. Compared to the traditional CNN and graph convolutional network, EGCN shows outstanding accuracy in thickness prediction near fast ice streams by preserving the equivariance to the translation and rotation of graphs.
comment: 6 pages, 2 figures, submitted to the ICML 2024 Workshop on Machine Learning for Earth System Modeling
☆ Mixture of Experts in a Mixture of RL settings
Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.
☆ Differential error feedback for communication-efficient decentralized learning
Communication-constrained algorithms for decentralized learning and optimization rely on local updates coupled with the exchange of compressed signals. In this context, differential quantization is an effective technique to mitigate the negative impact of compression by leveraging correlations between successive iterates. In addition, the use of error feedback, which consists of incorporating the compression error into subsequent steps, is a powerful mechanism to compensate for the bias caused by the compression. Under error feedback, performance guarantees in the literature have so far focused on algorithms employing a fusion center or a special class of contractive compressors that cannot be implemented with a finite number of bits. In this work, we propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback. The approach is specifically tailored for decentralized learning problems where agents have individual risk functions to minimize subject to subspace constraints that require the minimizers across the network to lie in low-dimensional subspaces. This constrained formulation includes consensus or single-task optimization as special cases, and allows for more general task relatedness models such as multitask smoothness and coupled optimization. We show that, under some general conditions on the compression noise, and for sufficiently small step-sizes $\mu$, the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate: by reducing $\mu$, it is possible to keep the estimation errors small (on the order of $\mu$) without increasing indefinitely the bit rate as $\mu\rightarrow 0$. The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
comment: arXiv admin note: text overlap with arXiv:2209.07821
☆ Towards diffusion models for large-scale sea-ice modelling ICML 2024
We make the first steps towards diffusion models for unconditional generation of multivariate and Arctic-wide sea-ice states. While targeting to reduce the computational costs by diffusion in latent space, latent diffusion models also offer the possibility to integrate physical knowledge into the generation process. We tailor latent diffusion models to sea-ice physics with a censored Gaussian distribution in data space to generate data that follows the physical bounds of the modelled variables. Our latent diffusion models reach similar scores as the diffusion model trained in data space, but they smooth the generated fields as caused by the latent mapping. While enforcing physical bounds cannot reduce the smoothing, it improves the representation of the marginal ice zone. Therefore, for large-scale Earth system modelling, latent diffusion models can have many advantages compared to diffusion in data space if the significant barrier of smoothing can be resolved.
comment: 21 pages, 5 figure, Accepted at the ICML 2024 Machine Learning for Earth System Modeling workshop
☆ Do LLMs dream of elephants (when told not to)? Latent concept association and associative memory in transformers
Large Language Models (LLMs) have the capacity to store and recall facts. Through experimentation with open-source models, we observe that this ability to retrieve facts can be easily manipulated by changing contexts, even without altering their factual meanings. These findings highlight that LLMs might behave like an associative memory model where certain tokens in the contexts serve as clues to retrieving facts. We mathematically explore this property by studying how transformers, the building blocks of LLMs, can complete such memory tasks. We study a simple latent concept association problem with a one-layer transformer and we show theoretically and empirically that the transformer gathers information using self-attention and uses the value matrix for associative memory.
☆ Second Maximum of a Gaussian Random Field and Exact (t-)Spacing test
In this article, we introduce the novel concept of the second maximum of a Gaussian random field on a Riemannian submanifold. This second maximum serves as a powerful tool for characterizing the distribution of the maximum. By utilizing an ad-hoc Kac Rice formula, we derive the explicit form of the maximum's distribution, conditioned on the second maximum and some regressed component of the Riemannian Hessian. This approach results in an exact test, based on the evaluation of spacing between these maxima, which we refer to as the spacing test. We investigate the applicability of this test in detecting sparse alternatives within Gaussian symmetric tensors, continuous sparse deconvolution, and two-layered neural networks with smooth rectifiers. Our theoretical results are supported by numerical experiments, which illustrate the calibration and power of the proposed tests. More generally, this test can be applied to any Gaussian random field on a Riemannian manifold, and we provide a general framework for the application of the spacing test in continuous sparse kernel regression. Furthermore, when the variance-covariance function of the Gaussian random field is known up to a scaling factor, we derive an exact Studentized version of our test, coined the $t$-spacing test. This test is perfectly calibrated under the null hypothesis and has high power for detecting sparse alternatives.
comment: 5 figures, 22 pages main document, 2 pages supplements
☆ DoubleTake: Geometry Guided Depth Estimation
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood. In contrast, our model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames. We introduce a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry. We demonstrate that our method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.
☆ Adversarial Search Engine Optimization for Large Language Models
Large Language Models (LLMs) are increasingly used in applications where the model selects from competing third-party content, such as in LLM-powered search engines or chatbot plugins. In this paper, we introduce Preference Manipulation Attacks, a new class of attacks that manipulate an LLM's selections to favor the attacker. We demonstrate that carefully crafted website content or plugin documentations can trick an LLM to promote the attacker products and discredit competitors, thereby increasing user traffic and monetization. We show this leads to a prisoner's dilemma, where all parties are incentivized to launch attacks, but the collective effect degrades the LLM's outputs for everyone. We demonstrate our attacks on production LLM search engines (Bing and Perplexity) and plugin APIs (for GPT-4 and Claude). As LLMs are increasingly used to rank third-party content, we expect Preference Manipulation Attacks to emerge as a significant threat.
☆ KAGNNs: Kolmogorov-Arnold Networks meet Graph Learning
In recent years, Graph Neural Networks (GNNs) have become the de facto tool for learning node and graph representations. Most GNNs typically consist of a sequence of neighborhood aggregation (a.k.a., message passing) layers. Within each of these layers, the representation of each node is updated from an aggregation and transformation of its neighbours representations at the previous layer. The upper bound for the expressive power of message passing GNNs was reached through the use of MLPs as a transformation, due to their universal approximation capabilities. However, MLPs suffer from well-known limitations, which recently motivated the introduction of Kolmogorov-Arnold Networks (KANs). KANs rely on the Kolmogorov-Arnold representation theorem, rendering them a promising alternative to MLPs. In this work, we compare the performance of KANs against that of MLPs in graph learning tasks. We perform extensive experiments on node classification, graph classification and graph regression datasets. Our preliminary results indicate that while KANs are on-par with MLPs in classification tasks, they seem to have a clear advantage in the graph regression tasks.
☆ Learning pure quantum states (almost) without regret
We initiate the study of quantum state tomography with minimal regret. A learner has sequential oracle access to an unknown pure quantum state, and in each round selects a pure probe state. Regret is incurred if the unknown state is measured orthogonal to this probe, and the learner's goal is to minimise the expected cumulative regret over $T$ rounds. The challenge is to find a balance between the most informative measurements and measurements incurring minimal regret. We show that the cumulative regret scales as $\Theta(\operatorname{polylog} T)$ using a new tomography algorithm based on a median of means least squares estimator. This algorithm employs measurements biased towards the unknown state and produces online estimates that are optimal (up to logarithmic terms) in the number of observed samples.
comment: 24 pages, 2 figures
☆ Kolmogorov-Arnold Graph Neural Networks
Graph neural networks (GNNs) excel in learning from network-like data but often lack interpretability, making their application challenging in domains requiring transparent decision-making. We propose the Graph Kolmogorov-Arnold Network (GKAN), a novel GNN model leveraging spline-based activation functions on edges to enhance both accuracy and interpretability. Our experiments on five benchmark datasets demonstrate that GKAN outperforms state-of-the-art GNN models in node classification, link prediction, and graph classification tasks. In addition to the improved accuracy, GKAN's design inherently provides clear insights into the model's decision-making process, eliminating the need for post-hoc explainability techniques. This paper discusses the methodology, performance, and interpretability of GKAN, highlighting its potential for applications in domains where interpretability is crucial.
comment: 7 pages, 4 figures, under review
☆ Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control
Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG's power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/
☆ EmT: A Novel Transformer for Generalized Cross-subject EEG Emotion Recognition
Integrating prior knowledge of neurophysiology into neural network architecture enhances the performance of emotion decoding. While numerous techniques emphasize learning spatial and short-term temporal patterns, there has been limited emphasis on capturing the vital long-term contextual information associated with emotional cognitive processes. In order to address this discrepancy, we introduce a novel transformer model called emotion transformer (EmT). EmT is designed to excel in both generalized cross-subject EEG emotion classification and regression tasks. In EmT, EEG signals are transformed into a temporal graph format, creating a sequence of EEG feature graphs using a temporal graph construction module (TGC). A novel residual multi-view pyramid GCN module (RMPG) is then proposed to learn dynamic graph representations for each EEG feature graph within the series, and the learned representations of each graph are fused into one token. Furthermore, we design a temporal contextual transformer module (TCT) with two types of token mixers to learn the temporal contextual information. Finally, the task-specific output module (TSO) generates the desired outputs. Experiments on four publicly available datasets show that EmT achieves higher results than the baseline methods for both EEG emotion classification and regression tasks. The code is available at https://github.com/yi-ding-cs/EmT.
comment: 11 pages, 5 figures. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Efficient and Accurate Explanation Estimation with Distribution Compression ICML 2024
Exact computation of various machine learning explanations requires numerous model evaluations and in extreme cases becomes impractical. The computational cost of approximation increases with an ever-increasing size of data and model parameters. Many heuristics have been proposed to approximate post-hoc explanations efficiently. This paper shows that the standard i.i.d. sampling used in a broad spectrum of algorithms for explanation estimation leads to an approximation error worthy of improvement. To this end, we introduce Compress Then Explain (CTE), a new paradigm for more efficient and accurate explanation estimation. CTE uses distribution compression through kernel thinning to obtain a data sample that best approximates the marginal distribution. We show that CTE improves the estimation of removal-based local and global explanations with negligible computational overhead. It often achieves an on-par explanation approximation error using 2-3x less samples, i.e. requiring 2-3x less model evaluations. CTE is a simple, yet powerful, plug-in for any explanation method that now relies on i.i.d. sampling.
comment: To be presented at the ICML 2024 Workshop on DMLR
☆ Early Classification of Time Series: Taxonomy and Benchmark
In many situations, the measurements of a studied phenomenon are provided sequentially, and the prediction of its class needs to be made as early as possible so as not to incur too high a time penalty, but not too early and risk paying the cost of misclassification. This problem has been particularly studied in the case of time series, and is known as Early Classification of Time Series (ECTS). Although it has been the subject of a growing body of literature, there is still a lack of a systematic, shared evaluation protocol to compare the relative merits of the various existing methods. This document begins by situating these methods within a principle-based taxonomy. It defines dimensions for organizing their evaluation, and then reports the results of a very extensive set of experiments along these dimensions involving nine state-of-the art ECTS algorithms. In addition, these and other experiments can be carried out using an open-source library in which most of the existing ECTS algorithms have been implemented (see \url{https://github.com/ML-EDM/ml_edm}).
☆ Molecular Diffusion Models with Virtual Receptors
Machine learning approaches to Structure-Based Drug Design (SBDD) have proven quite fertile over the last few years. In particular, diffusion-based approaches to SBDD have shown great promise. We present a technique which expands on this diffusion approach in two crucial ways. First, we address the size disparity between the drug molecule and the target/receptor, which makes learning more challenging and inference slower. We do so through the notion of a Virtual Receptor, which is a compressed version of the receptor; it is learned so as to preserve key aspects of the structural information of the original receptor, while respecting the relevant group equivariance. Second, we incorporate a protein language embedding used originally in the context of protein folding. We experimentally demonstrate the contributions of both the virtual receptors and the protein embeddings: in practice, they lead to both better performance, as well as significantly faster computations.
☆ PDFA Distillation via String Probability Queries {PDFA Distillation via String Probability Queries}
Probabilistic deterministic finite automata (PDFA) are discrete event systems modeling conditional probabilities over languages: Given an already seen sequence of tokens they return the probability of tokens of interest to appear next. These types of models have gained interest in the domain of explainable machine learning, where they are used as surrogate models for neural networks trained as language models. In this work we present an algorithm to distill PDFA from neural networks. Our algorithm is a derivative of the L# algorithm and capable of learning PDFA from a new type of query, in which the algorithm infers conditional probabilities from the probability of the queried string to occur. We show its effectiveness on a recent public dataset by distilling PDFA from a set of trained neural networks.
comment: LearnAUT 2024
☆ Multi-modal Evidential Fusion Network for Trusted PET/CT Tumor Segmentation
Accurate segmentation of tumors in PET/CT images is important in computer-aided diagnosis and treatment of cancer. The key issue of such a segmentation problem lies in the effective integration of complementary information from PET and CT images. However, the quality of PET and CT images varies widely in clinical settings, which leads to uncertainty in the modality information extracted by networks. To take the uncertainty into account in multi-modal information fusion, this paper proposes a novel Multi-modal Evidential Fusion Network (MEFN) comprising a Cross-Modal Feature Learning (CFL) module and a Multi-modal Trusted Fusion (MTF) module. The CFL module reduces the domain gap upon modality conversion and highlights common tumor features, thereby alleviating the needs of the segmentation module to handle modality specificity. The MTF module utilizes mutual attention mechanisms and an uncertainty calibrator to fuse modality features based on modality uncertainty and then fuse the segmentation results under the guidance of Dempster-Shafer Theory. Besides, a new uncertainty perceptual loss is introduced to force the model focusing on uncertain features and hence improve its ability to extract trusted modality information. Extensive comparative experiments are conducted on two publicly available PET/CT datasets to evaluate the performance of our proposed method whose results demonstrate that our MEFN significantly outperforms state-of-the-art methods with improvements of 2.15% and 3.23% in DSC scores on the AutoPET dataset and the Hecktor dataset, respectively. More importantly, our model can provide radiologists with credible uncertainty of the segmentation results for their decision in accepting or rejecting the automatic segmentation results, which is particularly important for clinical applications. Our code will be available at https://github.com/QPaws/MEFN.
☆ Trade-off between Gradient Measurement Efficiency and Expressivity in Deep Quantum Neural Networks
Quantum neural networks (QNNs) require an efficient training algorithm to achieve practical quantum advantages. A promising approach is the use of gradient-based optimization algorithms, where gradients are estimated through quantum measurements. However, it is generally difficult to efficiently measure gradients in QNNs because the quantum state collapses upon measurement. In this work, we prove a general trade-off between gradient measurement efficiency and expressivity in a wide class of deep QNNs, elucidating the theoretical limits and possibilities of efficient gradient estimation. This trade-off implies that a more expressive QNN requires a higher measurement cost in gradient estimation, whereas we can increase gradient measurement efficiency by reducing the QNN expressivity to suit a given task. We further propose a general QNN ansatz called the stabilizer-logical product ansatz (SLPA), which can reach the upper limit of the trade-off inequality by leveraging the symmetric structure of the quantum circuit. In learning an unknown symmetric function, the SLPA drastically reduces the quantum resources required for training while maintaining accuracy and trainability compared to a well-designed symmetric circuit based on the parameter-shift method. Our results not only reveal a theoretical understanding of efficient training in QNNs but also provide a standard and broadly applicable efficient QNN design.
comment: 32 pages, 11 figures
☆ ContactNet: Geometric-Based Deep Learning Model for Predicting Protein-Protein Interactions
Deep learning approaches achieved significant progress in predicting protein structures. These methods are often applied to protein-protein interactions (PPIs) yet require Multiple Sequence Alignment (MSA) which is unavailable for various interactions, such as antibody-antigen. Computational docking methods are capable of sampling accurate complex models, but also produce thousands of invalid configurations. The design of scoring functions for identifying accurate models is a long-standing challenge. We develop a novel attention-based Graph Neural Network (GNN), ContactNet, for classifying PPI models obtained from docking algorithms into accurate and incorrect ones. When trained on docked antigen and modeled antibody structures, ContactNet doubles the accuracy of current state-of-the-art scoring functions, achieving accurate models among its Top-10 at 43% of the test cases. When applied to unbound antibodies, its Top-10 accuracy increases to 65%. This performance is achieved without MSA and the approach is applicable to other types of interactions, such as host-pathogens or general PPIs.
☆ Online Learning of Multiple Tasks and Their Relationships : Testing on Spam Email Data and EEG Signals Recorded in Construction Fields
This paper examines an online multi-task learning (OMTL) method, which processes data sequentially to predict labels across related tasks. The framework learns task weights and their relatedness concurrently. Unlike previous models that assumed static task relatedness, our approach treats tasks as initially independent, updating their relatedness iteratively using newly calculated weight vectors. We introduced three rules to update the task relatedness matrix: OMTLCOV, OMTLLOG, and OMTLVON, and compared them against a conventional method (CMTL) that uses a fixed relatedness value. Performance evaluations on three datasets a spam dataset and two EEG datasets from construction workers under varying conditions demonstrated that our OMTL methods outperform CMTL, improving accuracy by 1\% to 3\% on EEG data, and maintaining low error rates around 12\% on the spam dataset.
☆ Spatial-temporal Hierarchical Reinforcement Learning for Interpretable Pathology Image Super-Resolution
Pathology image are essential for accurately interpreting lesion cells in cytopathology screening, but acquiring high-resolution digital slides requires specialized equipment and long scanning times. Though super-resolution (SR) techniques can alleviate this problem, existing deep learning models recover pathology image in a black-box manner, which can lead to untruthful biological details and misdiagnosis. Additionally, current methods allocate the same computational resources to recover each pixel of pathology image, leading to the sub-optimal recovery issue due to the large variation of pathology image. In this paper, we propose the first hierarchical reinforcement learning framework named Spatial-Temporal hierARchical Reinforcement Learning (STAR-RL), mainly for addressing the aforementioned issues in pathology image super-resolution problem. We reformulate the SR problem as a Markov decision process of interpretable operations and adopt the hierarchical recovery mechanism in patch level, to avoid sub-optimal recovery. Specifically, the higher-level spatial manager is proposed to pick out the most corrupted patch for the lower-level patch worker. Moreover, the higher-level temporal manager is advanced to evaluate the selected patch and determine whether the optimization should be stopped earlier, thereby avoiding the over-processed problem. Under the guidance of spatial-temporal managers, the lower-level patch worker processes the selected patch with pixel-wise interpretable actions at each time step. Experimental results on medical images degraded by different kernels show the effectiveness of STAR-RL. Furthermore, STAR-RL validates the promotion in tumor diagnosis with a large margin and shows generalizability under various degradations. The source code is available at https://github.com/CUHK-AIM-Group/STAR-RL.
comment: Accepted to IEEE TRANSACTIONS ON MEDICAL IMAGING (TMI)
☆ Automated Immunophenotyping Assessment for Diagnosing Childhood Acute Leukemia using Set-Transformers
Acute Leukemia is the most common hematologic malignancy in children and adolescents. A key methodology in the diagnostic evaluation of this malignancy is immunophenotyping based on Multiparameter Flow Cytometry (FCM). However, this approach is manual, and thus time-consuming and subjective. To alleviate this situation, we propose in this paper the FCM-Former, a machine learning, self-attention based FCM-diagnostic tool, automating the immunophenotyping assessment in Childhood Acute Leukemia. The FCM-Former is trained in a supervised manner, by directly using flow cytometric data. Our FCM-Former achieves an accuracy of 96.5% assigning lineage to each sample among 960 cases of either acute B-cell, T-cell lymphoblastic, and acute myeloid leukemia (B-ALL, T-ALL, AML). To the best of our knowledge, the FCM-Former is the first work that automates the immunophenotyping assessment with FCM data in diagnosing pediatric Acute Leukemia.
comment: The paper has been accepted at IEEE EMBS 2024 (46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society)
☆ Evaluating and Benchmarking Foundation Models for Earth Observation and Geospatial AI
When we are primarily interested in solving several problems jointly with a given prescribed high performance accuracy for each target application, then Foundation Models should for most cases be used rather than problem-specific models. We focus on the specific Computer Vision application of Foundation Models for Earth Observation (EO) and geospatial AI. These models can solve important problems we are tackling, including for example land cover classification, crop type mapping, flood segmentation, building density estimation, and road regression segmentation. In this paper, we show that for a limited number of labelled data, Foundation Models achieve improved performance compared to problem-specific models. In this work, we also present our proposed evaluation benchmark for Foundation Models for EO. Benchmarking the generalization performance of Foundation Models is important as it has become difficult to standardize a fair comparison across the many different models that have been proposed recently. We present the results using our evaluation benchmark for EO Foundation Models and show that Foundation Models are label efficient in the downstream tasks and help us solve problems we are tackling in EO and remote sensing.
comment: 5 pages, 2 figures, Submitted
☆ Combining Automated Optimisation of Hyperparameters and Reward Shape
There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm's hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.
comment: Published in the Reinforcement Learning Journal 2024
☆ CAS: Confidence Assessments of classification algorithms for Semantic segmentation of EO data
Confidence assessments of semantic segmentation algorithms in remote sensing are important. It is a desirable property of models to a priori know if they produce an incorrect output. Evaluations of the confidence assigned to the estimates of models for the task of classification in Earth Observation (EO) are crucial as they can be used to achieve improved semantic segmentation performance and prevent high error rates during inference and deployment. The model we develop, the Confidence Assessments of classification algorithms for Semantic segmentation (CAS) model, performs confidence evaluations at both the segment and pixel levels, and outputs both labels and confidence. The outcome of this work has important applications. The main application is the evaluation of EO Foundation Models on semantic segmentation downstream tasks, in particular land cover classification using satellite Copernicus Sentinel-2 data. The evaluation shows that the proposed model is effective and outperforms other alternative baseline models.
comment: 5 pages, 7 figures, 4 tables, Submitted
☆ Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
The integration of artificial intelligence (AI) in medical diagnostics represents a significant advancement in managing upper gastrointestinal (GI) cancer, a major cause of global cancer mortality. Specifically for gastric cancer (GC), chronic inflammation causes changes in the mucosa such as atrophy, intestinal metaplasia (IM), dysplasia and ultimately cancer. Early detection through endoscopic regular surveillance is essential for better outcomes. Foundation models (FM), which are machine or deep learning models trained on diverse data and applicable to broad use cases, offer a promising solution to enhance the accuracy of endoscopy and its subsequent pathology image analysis. This review explores the recent advancements, applications, and challenges associated with FM in endoscopy and pathology imaging. We started by elucidating the core principles and architectures underlying these models, including their training methodologies and the pivotal role of large-scale data in developing their predictive capabilities. Moreover, this work discusses emerging trends and future research directions, emphasizing the integration of multimodal data, the development of more robust and equitable models, and the potential for real-time diagnostic support. This review aims to provide a roadmap for researchers and practitioners in navigating the complexities of incorporating FM into clinical practice for prevention/management of GC cases, thereby improving patient outcomes.
☆ Generative artificial intelligence in ophthalmology: multimodal retinal images for the diagnosis of Alzheimer's disease with convolutional neural networks
Background/Aim. This study aims to predict Amyloid Positron Emission Tomography (AmyloidPET) status with multimodal retinal imaging and convolutional neural networks (CNNs) and to improve the performance through pretraining with synthetic data. Methods. Fundus autofluorescence, optical coherence tomography (OCT), and OCT angiography images from 328 eyes of 59 AmyloidPET positive subjects and 108 AmyloidPET negative subjects were used for classification. Denoising Diffusion Probabilistic Models (DDPMs) were trained to generate synthetic images and unimodal CNNs were pretrained on synthetic data and finetuned on real data or trained solely on real data. Multimodal classifiers were developed to combine predictions of the four unimodal CNNs with patient metadata. Class activation maps of the unimodal classifiers provided insight into the network's attention to inputs. Results. DDPMs generated diverse, realistic images without memorization. Pretraining unimodal CNNs with synthetic data improved AUPR at most from 0.350 to 0.579. Integration of metadata in multimodal CNNs improved AUPR from 0.486 to 0.634, which was the best overall best classifier. Class activation maps highlighted relevant retinal regions which correlated with AD. Conclusion. Our method for generating and leveraging synthetic data has the potential to improve AmyloidPET prediction from multimodal retinal imaging. A DDPM can generate realistic and unique multimodal synthetic retinal images. Our best performing unimodal and multimodal classifiers were not pretrained on synthetic data, however pretraining with synthetic data slightly improved classification performance for two out of the four modalities.
☆ Guiding Video Prediction with Explicit Procedural Knowledge ICCV
We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.
comment: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
☆ A Closer Look into Mixture-of-Experts in Large Language Models
Mixture-of-experts (MoE) is gaining increasing attention due to its unique properties and remarkable performance, especially for language tasks. By sparsely activating a subset of parameters for each token, MoE architecture could increase the model size without sacrificing computational efficiency, achieving a better trade-off between performance and training costs. However, the underlying mechanism of MoE still lacks further exploration, and its modularization degree remains questionable. In this paper, we make an initial attempt to understand the inner workings of MoE-based large language models. Concretely, we comprehensively study the parametric and behavioral features of three recent MoE-based models and reveal some intriguing observations, including (1) Neurons act like fine-grained experts. (2) The router of MoE usually selects experts with larger output norms. (3) The expert diversity increases as the layer increases, while the last layer is an outlier. Based on the observations, we also provide suggestions for a broad spectrum of MoE practitioners, such as router design and expert allocation. We hope this work could shed light on future research on the MoE framework and other modular architectures. Code is available at https://github.com/kamanphoebe/Look-into-MoEs.
☆ Selective Prompting Tuning for Personalized Conversations with LLMs ACL 2024
In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90\%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (https://github.com/hqsiswiliam/SPT) is publicly available for further exploration.
comment: Accepted to ACL 2024 findings
☆ DeepExtremeCubes: Integrating Earth system spatio-temporal data for impact assessment of climate extremes
With climate extremes' rising frequency and intensity, robust analytical tools are crucial to predict their impacts on terrestrial ecosystems. Machine learning techniques show promise but require well-structured, high-quality, and curated analysis-ready datasets. Earth observation datasets comprehensively monitor ecosystem dynamics and responses to climatic extremes, yet the data complexity can challenge the effectiveness of machine learning models. Despite recent progress in deep learning to ecosystem monitoring, there is a need for datasets specifically designed to analyse compound heatwave and drought extreme impact. Here, we introduce the DeepExtremeCubes database, tailored to map around these extremes, focusing on persistent natural vegetation. It comprises over 40,000 spatially sampled small data cubes (i.e. minicubes) globally, with a spatial coverage of 2.5 by 2.5 km. Each minicube includes (i) Sentinel-2 L2A images, (ii) ERA5-Land variables and generated extreme event cube covering 2016 to 2022, and (iii) ancillary land cover and topography maps. The paper aims to (1) streamline data accessibility, structuring, pre-processing, and enhance scientific reproducibility, and (2) facilitate biosphere dynamics forecasting in response to compound extremes.
☆ NeBuLa: A discourse aware Minecraft Builder
When engaging in collaborative tasks, humans efficiently exploit the semantic structure of a conversation to optimize verbal and nonverbal interactions. But in recent "language to code" or "language to action" models, this information is lacking. We show how incorporating the prior discourse and nonlinguistic context of a conversation situated in a nonlinguistic environment can improve the "language to action" component of such interactions. We fine tune an LLM to predict actions based on prior context; our model, NeBuLa, doubles the net-action F1 score over the baseline on this task of Jayannavar et al.(2020). We also investigate our model's ability to construct shapes and understand location descriptions using a synthetic dataset.
comment: 10 pages, 3 figures
☆ FedAQ: Communication-Efficient Federated Edge Learning via Joint Uplink and Downlink Adaptive Quantization
Federated learning (FL) is a powerful machine learning paradigm which leverages the data as well as the computational resources of clients, while protecting clients' data privacy. However, the substantial model size and frequent aggregation between the server and clients result in significant communication overhead, making it challenging to deploy FL in resource-limited wireless networks. In this work, we aim to mitigate the communication overhead by using quantization. Previous research on quantization has primarily focused on the uplink communication, employing either fixed-bit quantization or adaptive quantization methods. In this work, we introduce a holistic approach by joint uplink and downlink adaptive quantization to reduce the communication overhead. In particular, we optimize the learning convergence by determining the optimal uplink and downlink quantization bit-length, with a communication energy constraint. Theoretical analysis shows that the optimal quantization levels depend on the range of model gradients or weights. Based on this insight, we propose a decreasing-trend quantization for the uplink and an increasing-trend quantization for the downlink, which aligns with the change of the model parameters during the training process. Experimental results show that, the proposed joint uplink and downlink adaptive quantization strategy can save up to 66.7% energy compared with the existing schemes.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Beyond Statistical Estimation: Differentially Private Individual Computation in the Shuffle Model
The shuffle model of differential privacy (DP) has recently emerged as a powerful one for decentralized computation without fully trustable parties. Since it anonymizes and permutes messages from clients through a shuffler, the privacy can be amplified and utility can be improved. However, the shuffling procedure in turn restricts its applications only to statistical tasks that are permutation-invariant. This work explores the feasibility of shuffle privacy amplification for prevalent non-statistical computations: spatial crowdsourcing, combinatorial optimization, location-based social systems, and federated learning with incentives, which suffer either computationally intractability or intolerable utility loss in existing approaches (e.g., secure MPC and local DP). We proposes a new paradigm of shuffle model that can provide critical security functionalities like message authorization and result access control, meanwhile maintaining the most of privacy amplification effects. It incurs almost the same computation/communication costs as the non-private setting, and permits the server to run arbitrary algorithms on (noisy) client information in plaintext. Our novel technique is introducing statistically random identity into DP and force identical random distribution on all clients, so as to support secure functionalities even after message shuffling and to maintain privacy amplification simultaneously. Given that existing DP randomizers fails in the new shuffle model, we also propose a new mechanism and prove its optimality therein. Experimental results on spatial crowdsourcing, location-based social system, and federated learning with incentives, show that our paradigm and mechanism is fast as non-private settings, while reducing up to 90% error and increasing utility performance indicates by 100%-300% relatively, and can be practical under reasonable privacy budget.
☆ Sparse deep neural networks for nonparametric estimation in high-dimensional sparse regression
Generalization theory has been established for sparse deep neural networks under high-dimensional regime. Beyond generalization, parameter estimation is also important since it is crucial for variable selection and interpretability of deep neural networks. Current theoretical studies concerning parameter estimation mainly focus on two-layer neural networks, which is due to the fact that the convergence of parameter estimation heavily relies on the regularity of the Hessian matrix, while the Hessian matrix of deep neural networks is highly singular. To avoid the unidentifiability of deep neural networks in parameter estimation, we propose to conduct nonparametric estimation of partial derivatives with respect to inputs. We first show that model convergence of sparse deep neural networks is guaranteed in that the sample complexity only grows with the logarithm of the number of parameters or the input dimension when the $\ell_{1}$-norm of parameters is well constrained. Then by bounding the norm and the divergence of partial derivatives, we establish that the convergence rate of nonparametric estimation of partial derivatives scales as $\mathcal{O}(n^{-1/4})$, a rate which is slower than the model convergence rate $\mathcal{O}(n^{-1/2})$. To the best of our knowledge, this study combines nonparametric estimation and parametric sparse deep neural networks for the first time. As nonparametric estimation of partial derivatives is of great significance for nonlinear variable selection, the current results show the promising future for the interpretability of deep neural networks.
☆ Sequential Disentanglement by Extracting Static Information From A Single Sequence Element ICML 2024
One of the fundamental representation learning tasks is unsupervised sequential disentanglement, where latent codes of inputs are decomposed to a single static factor and a sequence of dynamic factors. To extract this latent information, existing methods condition the static and dynamic codes on the entire input sequence. Unfortunately, these models often suffer from information leakage, i.e., the dynamic vectors encode both static and dynamic information, or vice versa, leading to a non-disentangled representation. Attempts to alleviate this problem via reducing the dynamic dimension and auxiliary loss terms gain only partial success. Instead, we propose a novel and simple architecture that mitigates information leakage by offering a simple and effective subtraction inductive bias while conditioning on a single sample. Remarkably, the resulting variational framework is simpler in terms of required loss terms, hyperparameters, and data augmentation. We evaluate our method on multiple data-modality benchmarks including general time series, video, and audio, and we show beyond state-of-the-art results on generation and prediction tasks in comparison to several strong baselines.
comment: Accepted to ICML 2024; The first four authors contributed equally
☆ CTS: Sim-to-Real Unsupervised Domain Adaptation on 3D Detection
Simulation data can be accurately labeled and have been expected to improve the performance of data-driven algorithms, including object detection. However, due to the various domain inconsistencies from simulation to reality (sim-to-real), cross-domain object detection algorithms usually suffer from dramatic performance drops. While numerous unsupervised domain adaptation (UDA) methods have been developed to address cross-domain tasks between real-world datasets, progress in sim-to-real remains limited. This paper presents a novel Complex-to-Simple (CTS) framework to transfer models from labeled simulation (source) to unlabeled reality (target) domains. Based on a two-stage detector, the novelty of this work is threefold: 1) developing fixed-size anchor heads and RoI augmentation to address size bias and feature diversity between two domains, thereby improving the quality of pseudo-label; 2) developing a novel corner-format representation of aleatoric uncertainty (AU) for the bounding box, to uniformly quantify pseudo-label quality; 3) developing a noise-aware mean teacher domain adaptation method based on AU, as well as object-level and frame-level sampling strategies, to migrate the impact of noisy labels. Experimental results demonstrate that our proposed approach significantly enhances the sim-to-real domain adaptation capability of 3D object detection models, outperforming state-of-the-art cross-domain algorithms, which are usually developed for real-to-real UDA tasks.
☆ ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92\% and a top-5 accuracy of 97.5\%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
comment: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics
☆ ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.
comment: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics
☆ Robust personnel rostering: how accurate should absenteeism predictions be?
Disruptions to personnel rosters caused by absenteeism often necessitate last-minute adjustments to the employees' working hours. A common strategy to mitigate the impact of such changes is to assign employees to reserve shifts: special on-call duties during which an employee can be called in to cover for an absent employee. To maximize roster robustness, we assume a predict-then-optimize approach that uses absence predictions from a machine learning model to schedule an adequate number of reserve shifts. In this paper we propose a methodology to evaluate the robustness of rosters generated by the predict-then-optimize approach, assuming the machine learning model will make predictions at a predetermined prediction performance level. Instead of training and testing machine learning models, our methodology simulates the predictions based on a characterization of model performance. We show how this methodology can be applied to identify the minimum performance level needed for the model to outperform simple non-data-driven robust rostering policies. In a computational study on a nurse rostering problem, we demonstrate how the predict-then-optimize approach outperforms non-data-driven policies under reasonable performance requirements, particularly when employees possess interchangeable skills.
☆ Token-Weighted RNN-T for Learning from Flawed Data
ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.
☆ Learning for Bandits under Action Erasures
We consider a novel multi-arm bandit (MAB) setup, where a learner needs to communicate the actions to distributed agents over erasure channels, while the rewards for the actions are directly available to the learner through external sensors. In our model, while the distributed agents know if an action is erased, the central learner does not (there is no feedback), and thus does not know whether the observed reward resulted from the desired action or not. We propose a scheme that can work on top of any (existing or future) MAB algorithm and make it robust to action erasures. Our scheme results in a worst-case regret over action-erasure channels that is at most a factor of $O(1/\sqrt{1-\epsilon})$ away from the no-erasure worst-case regret of the underlying MAB algorithm, where $\epsilon$ is the erasure probability. We also propose a modification of the successive arm elimination algorithm and prove that its worst-case regret is $\Tilde{O}(\sqrt{KT}+K/(1-\epsilon))$, which we prove is optimal by providing a matching lower bound.
☆ Learning Optimal Filters Using Variational Inference
Filtering-the task of estimating the conditional distribution of states of a dynamical system given partial, noisy, observations-is important in many areas of science and engineering, including weather and climate prediction. However, the filtering distribution is generally intractable to obtain for high-dimensional, nonlinear systems. Filters used in practice, such as the ensemble Kalman filter (EnKF), are biased for nonlinear systems and have numerous tuning parameters. Here, we present a framework for learning a parameterized analysis map-the map that takes a forecast distribution and observations to the filtering distribution-using variational inference. We show that this methodology can be used to learn gain matrices for filtering linear and nonlinear dynamical systems, as well as inflation and localization parameters for an EnKF. Future work will apply this framework to learn new filtering algorithms.
☆ Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents ICML 2024
Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness. In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of $2.16\times$ under the strongest attack, but also surpass previous robustly-trained agents by an average factor of $2.13\times$. This represents a significant leap forward in the field. Furthermore, we introduce Smoothed Attack, which is $1.89\times$ more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.
comment: Published in ICML 2024
☆ AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
☆ Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies
Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level's actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)--a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.
☆ Multimodal foundation world models for generalist embodied agents
Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.
☆ MT2ST: Adaptive Multi-Task to Single-Task Learning
The conventional training approaches often face challenges in balancing the breadth of multi-task learning (MTL) with the depth of single-task learning (STL). To address this issue, we introduce the Multi-Task to Single-Task (MT2ST) framework, a groundbreaking approach that can combine the generalizability of MTL with the precision of STL. Our work include two strategies: 'Diminish' and 'Switch'. 'Diminish' Strategy will gradually reduce the influence of auxiliary tasks, while the 'Switch' strategy involves a shift from multi-tasking to single-tasking at a specific timepoint at the training process. In this paper, we propose the Multi-Task to Single-Task (MT2ST) framework, a novel approach that significantly enhances the efficiency and accuracy of word embedding training while concurrently addressing prevalent issues such as overfitting. Our empirical studies demonstrate that MT2ST can reduce training time by 67\% when contrasted with single-task learning approaches, and by 13\% compared to traditional multi-task learning methods. These findings underscore MT2ST's potential to be a powerful tools for word embedding training acceleration.
☆ Local Linear Recovery Guarantee of Deep Neural Networks at Overparameterization
Determining whether deep neural network (DNN) models can reliably recover target functions at overparameterization is a critical yet complex issue in the theory of deep learning. To advance understanding in this area, we introduce a concept we term "local linear recovery" (LLR), a weaker form of target function recovery that renders the problem more amenable to theoretical analysis. In the sense of LLR, we prove that functions expressible by narrower DNNs are guaranteed to be recoverable from fewer samples than model parameters. Specifically, we establish upper limits on the optimistic sample sizes, defined as the smallest sample size necessary to guarantee LLR, for functions in the space of a given DNN. Furthermore, we prove that these upper bounds are achieved in the case of two-layer tanh neural networks. Our research lays a solid groundwork for future investigations into the recovery capabilities of DNNs in overparameterized scenarios.
comment: arXiv admin note: text overlap with arXiv:2211.11623
☆ Boosting Soft Q-Learning by Bounding
An agent's ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.
comment: To appear in the 1st Reinforcement Learning Conference
☆ AutoOPE: Automated Off-Policy Estimator Selection
The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. This problem is of utmost importance for various application domains, e.g., recommendation systems, medical treatments, and many others. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. In the literature, several estimators have been developed, all with different characteristics and theoretical guarantees. Therefore, there is no dominant estimator, and each estimator may be the best one for different OPE problems, depending on the characteristics of the dataset at hand. While the selection of the estimator is a crucial choice for an accurate OPE, this problem has been widely overlooked in the literature. We propose an automated data-driven OPE estimator selection method based on machine learning. In particular, the core idea we propose in this paper is to create several synthetic OPE tasks and use a machine learning model trained to predict the best estimator for those synthetic tasks. We empirically show how our method is able to generalize to unseen tasks and make a better estimator selection compared to a baseline method on several real-world datasets, with a computational cost significantly lower than the one of the baseline.
☆ SC-MoE: Switch Conformer Mixture of Experts for Unified Streaming and Non-streaming Code-Switching ASR
In this work, we propose a Switch-Conformer-based MoE system named SC-MoE for unified streaming and non-streaming code-switching (CS) automatic speech recognition (ASR), where we design a streaming MoE layer consisting of three language experts, which correspond to Mandarin, English, and blank, respectively, and equipped with a language identification (LID) network with a Connectionist Temporal Classification (CTC) loss as a router in the encoder of SC-MoE to achieve a real-time streaming CS ASR system. To further utilize the language information embedded in text, we also incorporate MoE layers into the decoder of SC-MoE. In addition, we introduce routers into every MoE layer of the encoder and the decoder and achieve better recognition performance. Experimental results show that the SC-MoE significantly improves CS ASR performances over baseline with comparable computational efficiency.
comment: Accepted by InterSpeech 2024; 5 pages, 2 figures
☆ MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views
Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.
comment: 8 pages, 5 figures
☆ Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models COLING 2024
Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.
comment: Published at COLING 2024
☆ Learning Neural Networks with Sparse Activations COLT 2024
A core component present in many successful neural network architectures, is an MLP block of two fully connected layers with a non-linear activation in between. An intriguing phenomenon observed empirically, including in transformer architectures, is that, after training, the activations in the hidden layer of this MLP block tend to be extremely sparse on any given input. Unlike traditional forms of sparsity, where there are neurons/weights which can be deleted from the network, this form of {\em dynamic} activation sparsity appears to be harder to exploit to get more efficient networks. Motivated by this we initiate a formal study of PAC learnability of MLP layers that exhibit activation sparsity. We present a variety of results showing that such classes of functions do lead to provable computational and statistical advantages over their non-sparse counterparts. Our hope is that a better theoretical understanding of {\em sparsely activated} networks would lead to methods that can exploit activation sparsity in practice.
comment: Proceedings of the 37th Conference on Learning Theory (COLT 2024), 20 pages
☆ Operator Learning of Lipschitz Operators: An Information-Theoretic Perspective
Operator learning based on neural operators has emerged as a promising paradigm for the data-driven approximation of operators, mapping between infinite-dimensional Banach spaces. Despite significant empirical progress, our theoretical understanding regarding the efficiency of these approximations remains incomplete. This work addresses the parametric complexity of neural operator approximations for the general class of Lipschitz continuous operators. Motivated by recent findings on the limitations of specific architectures, termed curse of parametric complexity, we here adopt an information-theoretic perspective. Our main contribution establishes lower bounds on the metric entropy of Lipschitz operators in two approximation settings; uniform approximation over a compact set of input functions, and approximation in expectation, with input functions drawn from a probability measure. It is shown that these entropy bounds imply that, regardless of the activation function used, neural operator architectures attaining an approximation accuracy $\epsilon$ must have a size that is exponentially large in $\epsilon^{-1}$. The size of architectures is here measured by counting the number of encoded bits necessary to store the given model in computational memory. The results of this work elucidate fundamental trade-offs and limitations in
☆ Unified Uncertainties: Combining Input, Data and Model Uncertainty into a Single Formulation ICML 2024
Modelling uncertainty in Machine Learning models is essential for achieving safe and reliable predictions. Most research on uncertainty focuses on output uncertainty (predictions), but minimal attention is paid to uncertainty at inputs. We propose a method for propagating uncertainty in the inputs through a Neural Network that is simultaneously able to estimate input, data, and model uncertainty. Our results show that this propagation of input uncertainty results in a more stable decision boundary even under large amounts of input noise than comparatively simple Monte Carlo sampling. Additionally, we discuss and demonstrate that input uncertainty, when propagated through the model, results in model uncertainty at the outputs. The explicit incorporation of input uncertainty may be beneficial in situations where the amount of input uncertainty is known, though good datasets for this are still needed.
comment: 4 pages, 3 figures, with appendix. LatinX in AI Research Workshop @ ICML 2024 Camera Ready
☆ Psychological Profiling in Cybersecurity: A Look at LLMs and Psycholinguistic Features
The increasing sophistication of cyber threats necessitates innovative approaches to cybersecurity. In this paper, we explore the potential of psychological profiling techniques, particularly focusing on the utilization of Large Language Models (LLMs) and psycholinguistic features. We investigate the intersection of psychology and cybersecurity, discussing how LLMs can be employed to analyze textual data for identifying psychological traits of threat actors. We explore the incorporation of psycholinguistic features, such as linguistic patterns and emotional cues, into cybersecurity frameworks. \iffalse Through case studies and experiments, we discuss the effectiveness of these methods in enhancing threat detection and mitigation strategies.\fi Our research underscores the importance of integrating psychological perspectives into cybersecurity practices to bolster defense mechanisms against evolving threats.
☆ Learning to Remove Cuts in Integer Linear Programming
Cutting plane methods are a fundamental approach for solving integer linear programs (ILPs). In each iteration of such methods, additional linear constraints (cuts) are introduced to the constraint set with the aim of excluding the previous fractional optimal solution while not affecting the optimal integer solution. In this work, we explore a novel approach within cutting plane methods: instead of only adding new cuts, we also consider the removal of previous cuts introduced at any of the preceding iterations of the method under a learnable parametric criteria. We demonstrate that in fundamental combinatorial optimization settings such cut removal policies can lead to significant improvements over both human-based and machine learning-guided cut addition policies even when implemented with simple models.
comment: International Conference on Machine Learning
☆ Aligning Model Properties via Conformal Risk Control
AI model alignment is crucial due to inadvertent biases in training data and the underspecified pipeline in modern machine learning, where numerous models with excellent test set metrics can be produced, yet they may not meet end-user requirements. Recent advances demonstrate that post-training model alignment via human feedback can address some of these challenges. However, these methods are often confined to settings (such as generative AI) where humans can interpret model outputs and provide feedback. In traditional non-generative settings, where model outputs are numerical values or classes, detecting misalignment through single-sample outputs is highly challenging. In this paper we consider an alternative strategy. We propose interpreting model alignment through property testing, defining an aligned model $f$ as one belonging to a subset $\mathcal{P}$ of functions that exhibit specific desired behaviors. We focus on post-processing a pre-trained model $f$ to better align with $\mathcal{P}$ using conformal risk control. Specifically, we develop a general procedure for converting queries for a given property $\mathcal{P}$ to a collection of loss functions suitable for use in a conformal risk control algorithm. We prove a probabilistic guarantee that the resulting conformal interval around $f$ contains a function approximately satisfying $\mathcal{P}$. Given the capabilities of modern AI models with extensive parameters and training data, one might assume alignment issues will resolve naturally. However, increasing training data or parameters in a random feature model doesn't eliminate the need for alignment techniques when pre-training data is biased. We demonstrate our alignment methodology on supervised learning datasets for properties like monotonicity and concavity. Our flexible procedure can be applied to various desired properties.
☆ ADO-LLM: Analog Design Bayesian Optimization with In-Context Learning of Large Language Models
Analog circuit design requires substantial human expertise and involvement, which is a significant roadblock to design productivity. Bayesian Optimization (BO), a popular machine learning based optimization strategy, has been leveraged to automate analog design given its applicability across various circuit topologies and technologies. Traditional BO methods employ black box Gaussian Process surrogate models and optimized labeled data queries to find optimization solutions by trading off between exploration and exploitation. However, the search for the optimal design solution in BO can be expensive from both a computational and data usage point of view, particularly for high dimensional optimization problems. This paper presents ADO-LLM, the first work integrating large language models (LLMs) with Bayesian Optimization for analog design optimization. ADO-LLM leverages the LLM's ability to infuse domain knowledge to rapidly generate viable design points to remedy BO's inefficiency in finding high value design areas specifically under the limited design space coverage of the BO's probabilistic surrogate model. In the meantime, sampling of design points evaluated in the iterative BO process provides quality demonstrations for the LLM to generate high quality design points while leveraging infused broad design knowledge. Furthermore, the diversity brought by BO's exploration enriches the contextual understanding of the LLM and allows it to more broadly search in the design space and prevent repetitive and redundant suggestions. We evaluate the proposed framework on two different types of analog circuits and demonstrate notable improvements in design efficiency and effectiveness.
comment: 8 pages, 3 figures
♻ ☆ Situational Awareness Matters in 3D Vision Language Reasoning CVPR 2024
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
comment: CVPR 2024. Project Page: https://yunzeman.github.io/situation3d
♻ ☆ On Convex Data-Driven Inverse Optimal Control for Nonlinear, Non-stationary and Stochastic Systems
This paper is concerned with a finite-horizon inverse control problem, which has the goal of reconstructing, from observations, the possibly non-convex and non-stationary cost driving the actions of an agent. In this context, we present a result enabling cost reconstruction by solving an optimization problem that is convex even when the agent cost is not and when the underlying dynamics is nonlinear, non-stationary and stochastic. To obtain this result, we also study a finite-horizon forward control problem that has randomized policies as decision variables. We turn our findings into algorithmic procedures and show the effectiveness of our approach via in-silico and hardware validations. All experiments confirm the effectiveness of our approach.
comment: 17 pages, 5 figures. An early version of this paper with only a sketch of the proof for one of the results and without the hardware validation was presentation at the 62nd IEEE Conference on Decision and Control. arXiv admin note: text overlap with arXiv:2303.17957
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labeling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes latent feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that combining LLM embeddings with clustering algorithms yields the best results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy and depth of news reporting.
♻ ☆ Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor
Contact-rich tasks continue to present a variety of challenges for robotic manipulation. In this work, we leverage a multimodal visuotactile sensor within the framework of imitation learning (IL) to perform contact rich tasks that involve relative motion (slipping/sliding) between the end-effector and object. We introduce two algorithmic contributions, tactile force matching and learned mode switching, as complimentary methods for improving IL. Tactile force matching enhances kinesthetic teaching by reading approximate forces during the demonstration and generating an adapted robot trajectory that recreates the recorded forces. Learned mode switching uses IL to couple visual and tactile sensor modes with the learned motion policy, simplifying the transition from reaching to contacting. We perform robotic manipulation experiments on four door opening tasks with a variety of observation and method configurations to study the utility of our proposed improvements and multimodal visuotactile sensing. Our results show that the inclusion of force matching raises average policy success rates by 62.5%, visuotactile mode switching by 30.3%, and visuotactile data as a policy input by 42.5%, emphasizing the value of see-through tactile sensing for IL, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.
comment: Submitted to IEEE Transactions on Robotics (T-RO): Special Section on Tactile Robotics
♻ ☆ BASS: Batched Attention-optimized Speculative Sampling
Speculative decoding has emerged as a powerful method to improve latency and throughput in hosting large language models. However, most existing implementations focus on generating a single sequence. Real-world generative AI applications often require multiple responses and how to perform speculative decoding in a batched setting while preserving its latency benefits poses non-trivial challenges. This paper describes a system of batched speculative decoding that sets a new state of the art in multi-sequence generation latency and that demonstrates superior GPU utilization as well as quality of generations within a time budget. For example, for a 7.8B-size model on a single A100 GPU and with a batch size of 8, each sequence is generated at an average speed of 5.8ms per token, the overall throughput being 1.1K tokens per second. These results represent state-of-the-art latency and a 2.15X speed-up over optimized regular decoding. Within a time budget that regular decoding does not finish, our system is able to generate sequences with HumanEval Pass@First of 43% and Pass@All of 61%, far exceeding what's feasible with single-sequence speculative decoding. Our peak GPU utilization during decoding reaches as high as 15.8%, more than 3X the highest of that of regular decoding and around 10X of single-sequence speculative decoding.
♻ ☆ Learning Generalizable Program and Architecture Representations for Performance Modeling SC 2024
Performance modeling is an essential tool in many areas, including performance characterization/optimization, design space exploration, and resource allocation problems, to name a few. However, existing performance modeling approaches have limitations, such as high computational cost for discrete-event simulators, narrow flexibility of hardware emulators, or restricted accuracy/generality of analytical/data-driven models. To address these limitations, this paper proposes PerfVec, a novel deep learning-based performance modeling framework that learns high-dimensional and independent/orthogonal program and microarchitecture representations. Once learned, a program representation can be used to predict its performance on any microarchitecture, and likewise, a microarchitecture representation can be applied in the performance prediction of any program. Additionally, PerfVec yields a foundation model that captures the performance essence of instructions, which can be directly used by developers in numerous performance modeling related tasks without incurring its training cost. The evaluation demonstrates that PerfVec is more general and efficient than previous approaches.
comment: To be published at SC 2024
♻ ☆ Scaling and renormalization in high-dimensional regression
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models using the basic tools of random matrix theory and free probability. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning. Analytic formulas for the training and generalization errors are obtained in a few lines of algebra directly from the properties of the $S$-transform of free probability. This allows for a straightforward identification of the sources of power-law scaling in model performance. We compute the generalization error of a broad class of random feature models. We find that in all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. These novel results allow us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
comment: 68 pages, 17 figures
♻ ☆ Mixtures of Experts Unlock Parameter Scaling for Deep RL
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
♻ ☆ VADA: a Data-Driven Simulator for Nanopore Sequencing
Nanopore sequencing offers the ability for real-time analysis of long DNA sequences at a low cost, enabling new applications such as early detection of cancer. Due to the complex nature of nanopore measurements and the high cost of obtaining ground truth datasets, there is a need for nanopore simulators. Existing simulators rely on handcrafted rules and parameters and do not learn an internal representation that would allow for analysing underlying biological factors of interest. Instead, we propose VADA, a purely data-driven method for simulating nanopores based on an autoregressive latent variable model. We embed subsequences of DNA and introduce a conditional prior to address the challenge of a collapsing conditioning. We introduce an auxiliary regressor on the latent variable to encourage our model to learn an informative latent representation. We empirically demonstrate that our model achieves competitive simulation performance on experimental nanopore data. Moreover, we show we have learned an informative latent representation that is predictive of the DNA labels. We hypothesize that other biological factors of interest, beyond the DNA labels, can potentially be extracted from such a learned latent representation.
♻ ☆ Robustness to Subpopulation Shift with Domain Label Noise via Regularized Annotation of Domains
Existing methods for last layer retraining that aim to optimize worst-group accuracy (WGA) rely heavily on well-annotated groups in the training data. We show, both in theory and practice, that annotation-based data augmentations using either downsampling or upweighting for WGA are susceptible to domain annotation noise, and in high-noise regimes approach the WGA of a model trained with vanilla empirical risk minimization. We introduce Regularized Annotation of Domains (RAD) in order to train robust last layer classifiers without the need for explicit domain annotations. Our results show that RAD is competitive with other recently proposed domain annotation-free techniques. Most importantly, RAD outperforms state-of-the-art annotation-reliant methods even with only 5% noise in the training data for several publicly available datasets.
comment: Generalized Gaussian assumption
♻ ☆ ReLU Neural Networks with Linear Layers are Biased Towards Single- and Multi-Index Models
Neural networks often operate in the overparameterized regime, in which there are far more parameters than training samples, allowing the training data to be fit perfectly. That is, training the network effectively learns an interpolating function, and properties of the interpolant affect predictions the network will make on new samples. This manuscript explores how properties of such functions learned by neural networks of depth greater than two layers. Our framework considers a family of networks of varying depths that all have the same capacity but different representation costs. The representation cost of a function induced by a neural network architecture is the minimum sum of squared weights needed for the network to represent the function; it reflects the function space bias associated with the architecture. Our results show that adding additional linear layers to the input side of a shallow ReLU network yields a representation cost favoring functions with low mixed variation - that is, it has limited variation in directions orthogonal to a low-dimensional subspace and can be well approximated by a single- or multi-index model. Such functions may be represented by the composition of a function with low two-layer representation cost and a low-rank linear operator. Our experiments confirm this behavior in standard network training regimes. They additionally show that linear layers can improve generalization and the learned network is well-aligned with the true latent low-dimensional linear subspace when data is generated using a multi-index model.
♻ ☆ ProFLingo: A Fingerprinting-based Intellectual Property Protection Scheme for Large Language Models
Large language models (LLMs) have attracted significant attention in recent years. Due to their "Large" nature, training LLMs from scratch consumes immense computational resources. Since several major players in the artificial intelligence (AI) field have open-sourced their original LLMs, an increasing number of individual researchers and smaller companies are able to build derivative LLMs based on these open-sourced models at much lower costs. However, this practice opens up possibilities for unauthorized use or reproduction that may not comply with licensing agreements, and fine-tuning can change the model's behavior, thus complicating the determination of model ownership. Current intellectual property (IP) protection schemes for LLMs are either designed for white-box settings or require additional modifications to the original model, which restricts their use in real-world settings. In this paper, we propose ProFLingo, a black-box fingerprinting-based IP protection scheme for LLMs. ProFLingo generates queries that elicit specific responses from an original model, thereby establishing unique fingerprints. Our scheme assesses the effectiveness of these queries on a suspect model to determine whether it has been derived from the original model. ProFLingo offers a non-invasive approach, which neither requires knowledge of the suspect model nor modifications to the base model or its training process. To the best of our knowledge, our method represents the first black-box fingerprinting technique for IP protection for LLMs. Our source code and generated queries are available at: https://github.com/hengvt/ProFLingo.
comment: This is the author's pre-print version of the work. It is posted here for your personal use. Not for redistribution
♻ ☆ Normalizing Flows for Conformal Regression UAI 2024
Conformal Prediction (CP) algorithms estimate the uncertainty of a prediction model by calibrating its outputs on labeled data. The same calibration scheme usually applies to any model and data without modifications. The obtained prediction intervals are valid by construction but could be inefficient, i.e. unnecessarily big, if the prediction errors are not uniformly distributed over the input space. We present a general scheme to localize the intervals by training the calibration process. The standard prediction error is replaced by an optimized distance metric that depends explicitly on the object attributes. Learning the optimal metric is equivalent to training a Normalizing Flow that acts on the joint distribution of the errors and the inputs. Unlike the Error Reweighting CP algorithm of Papadopoulos et al. (2008), the framework allows estimating the gap between nominal and empirical conditional validity. The approach is compatible with existing locally-adaptive CP strategies based on re-weighting the calibration samples and applies to any point-prediction model without retraining.
comment: To be presented at the 40th Conference on Uncertainty in Artificial Intelligence (UAI 2024). Changes from v1: improved Section 1.2, two figures replaced, minor typos fixed
♻ ☆ ICTSurF: Implicit Continuous-Time Survival Functions with Neural Networks
Survival analysis is a widely known method for predicting the likelihood of an event over time. The challenge of dealing with censored samples still remains. Traditional methods, such as the Cox Proportional Hazards (CPH) model, hinge on the limitations due to the strong assumptions of proportional hazards and the predetermined relationships between covariates. The rise of models based on deep neural networks (DNNs) has demonstrated enhanced effectiveness in survival analysis. This research introduces the Implicit Continuous-Time Survival Function (ICTSurF), built on a continuous-time survival model, and constructs survival distribution through implicit representation. As a result, our method is capable of accepting inputs in continuous-time space and producing survival probabilities in continuous-time space, independent of neural network architecture. Comparative assessments with existing methods underscore the high competitiveness of our proposed approach. Our implementation of ICTSurF is available at https://github.com/44REAM/ICTSurF.
♻ ☆ DiarizationLM: Speaker Diarization Post-Processing with Large Language Models
In this paper, we introduce DiarizationLM, a framework to leverage large language models (LLM) to post-process the outputs from a speaker diarization system. Various goals can be achieved with the proposed framework, such as improving the readability of the diarized transcript, or reducing the word diarization error rate (WDER). In this framework, the outputs of the automatic speech recognition (ASR) and speaker diarization systems are represented as a compact textual format, which is included in the prompt to an optionally finetuned LLM. The outputs of the LLM can be used as the refined diarization results with the desired enhancement. As a post-processing step, this framework can be easily applied to any off-the-shelf ASR and speaker diarization systems without retraining existing components. Our experiments show that a finetuned PaLM 2-S model can reduce the WDER by rel. 55.5% on the Fisher telephone conversation dataset, and rel. 44.9% on the Callhome English dataset.
♻ ☆ RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation
In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
comment: Fixed typos + alpha
♻ ☆ Benchmarking mortality risk prediction from electrocardiograms
Several recent high-impact studies leverage large hospital-owned electrocardiographic (ECG) databases to model and predict patient mortality. MIMIC-IV, released September 2023, is the first comparable public dataset and includes 800,000 ECGs from a U.S. hospital system. Previously, the largest public ECG dataset was Code-15, containing 345,000 ECGs collected during routine care in Brazil. These datasets now provide an excellent resource for a broader audience to explore ECG survival modeling. Here, we benchmark survival model performance on Code-15 and MIMIC-IV with two neural network architectures, compare four deep survival modeling approaches to Cox regressions trained on classifier outputs, and evaluate performance at one to ten years. Our results yield AUROC and concordance scores comparable to past work (circa 0.8) and reasonable AUPRC scores (MIMIC-IV: 0.4-0.5, Code-15: 0.05-0.13) considering the fraction of ECG samples linked to a mortality (MIMIC-IV: 27\%, Code-15: 4\%). When evaluating models on the opposite dataset, AUROC and concordance values drop by 0.1-0.15, which may be due to cohort differences. All code and results are made public.
comment: 9 pages plus appendix, 2 figures
♻ ☆ Active Preference Inference using Language Models and Probabilistic Reasoning
Actively inferring user preferences, for example by asking good questions, is important for any human-facing decision-making system. Active inference allows such systems to adapt and personalize themselves to nuanced individual preferences. To enable this ability for instruction-tuned large language models (LLMs), one may prompt them to ask users questions to infer their preferences, transforming the language models into more robust, interactive systems. However, out of the box, these models are not efficient at extracting preferences: the questions they generate are not informative, requiring a high number of user interactions and impeding the usability of the downstream system. In this work, we introduce an inference-time algorithm that helps LLMs quickly infer preferences by using more informative questions. Our algorithm uses a probabilistic model whose conditional distributions are defined by prompting an LLM, and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm outperforms baselines with the same underlying LLM on task performance while using fewer user interactions.
♻ ☆ Efficient Low-rank Identification via Accelerated Iteratively Reweighted Nuclear Norm Minimization
This paper considers the problem of minimizing the sum of a smooth function and the Schatten-$p$ norm of the matrix. Our contribution involves proposing accelerated iteratively reweighted nuclear norm methods designed for solving the nonconvex low-rank minimization problem. Two major novelties characterize our approach. Firstly, the proposed method possesses a rank identification property, enabling the provable identification of the "correct" rank of the stationary point within a finite number of iterations. Secondly, we introduce an adaptive updating strategy for smoothing parameters. This strategy automatically fixes parameters associated with zero singular values as constants upon detecting the "correct" rank while quickly driving the rest of the parameters to zero. This adaptive behavior transforms the algorithm into one that effectively solves smooth problems after a few iterations, setting our work apart from existing iteratively reweighted methods for low-rank optimization. We prove the global convergence of the proposed algorithm, guaranteeing that every limit point of the iterates is a critical point. Furthermore, a local convergence rate analysis is provided under the Kurdyka-{\L}ojasiewicz property. We conduct numerical experiments using both synthetic and real data to showcase our algorithm's efficiency and superiority over existing methods.
comment: Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ An Information Theoretic Perspective on Conformal Prediction
Conformal Prediction (CP) is a distribution-free uncertainty estimation framework that constructs prediction sets guaranteed to contain the true answer with a user-specified probability. Intuitively, the size of the prediction set encodes a general notion of uncertainty, with larger sets associated with higher degrees of uncertainty. In this work, we leverage information theory to connect conformal prediction to other notions of uncertainty. More precisely, we prove three different ways to upper bound the intrinsic uncertainty, as described by the conditional entropy of the target variable given the inputs, by combining CP with information theoretical inequalities. Moreover, we demonstrate two direct and useful applications of such connection between conformal prediction and information theory: (i) more principled and effective conformal training objectives that generalize previous approaches and enable end-to-end training of machine learning models from scratch, and (ii) a natural mechanism to incorporate side information into conformal prediction. We empirically validate both applications in centralized and federated learning settings, showing our theoretical results translate to lower inefficiency (average prediction set size) for popular CP methods.
♻ ☆ WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) constitutes a comprehensive labeled dataset employed in machine learning applications. Nevertheless, the methodologies for data preparation, preprocessing, and classification documented in the literature exhibit considerable variability and are typically not applied to the dataset in its entirety. This study initially undertakes a concise review of the state-of-the-art benchmarks pertaining to the dataset, with a particular focus on clarifying data preparation and preprocessing techniques. Subsequently, we explore the utilization of the Wavelet Scattering Transform (WST) and Mel spectrogram as preprocessing mechanisms for feature extraction. In this paper, we introduce \textbf{WhaleNet} (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations, leveraging both WST and Mel spectrogram for enhanced feature discrimination. By integrating the insights derived from WST and Mel representations, we achieved an improvement in classification accuracy by $8-10\%$ over existing architectures, corresponding to a classification accuracy of $97.61\%$.
♻ ☆ Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task ICLR 2023
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
comment: ICLR 2023 oral (notable-top-5%): https://openreview.net/forum?id=DeG07_TcZvT ; code: https://github.com/likenneth/othello_world
♻ ☆ The Fundamental Limits of Least-Privilege Learning
The promise of least-privilege learning -- to find feature representations that are useful for a learning task but prevent inference of any sensitive information unrelated to this task -- is highly appealing. However, so far this concept has only been stated informally. It thus remains an open question whether and how we can achieve this goal. In this work, we provide the first formalisation of the least-privilege principle for machine learning and characterise its feasibility. We prove that there is a fundamental trade-off between a representation's utility for a given task and its leakage beyond the intended task: it is not possible to learn representations that have high utility for the intended task but, at the same time prevent inference of any attribute other than the task label itself. This trade-off holds under realistic assumptions on the data distribution and regardless of the technique used to learn the feature mappings that produce these representations. We empirically validate this result for a wide range of learning techniques, model architectures, and datasets.
♻ ☆ Inference-Time Intervention: Eliciting Truthful Answers from a Language Model NeurIPS 2023
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
comment: NeurIPS 2023 spotlight; code: https://github.com/likenneth/honest_llama
♻ ☆ MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time
Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.
♻ ☆ Topological data quality via 0-dimensional persistence matching
Data quality is crucial for the successful training, generalization and performance of artificial intelligence models. We propose to measure data quality for supervised learning using topological data analysis techniques. Specifically, we provide a novel topological invariant based on persistence matchings induced by inclusions and using $0$-dimensional persistent homology. We show that such an invariant is stable. We provide an algorithm and relate it to images, kernels, and cokernels of the induced morphisms. Also, we show that the invariant allows us to understand whether the subset "represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. This approach enables us to explain why the chosen dataset will lead to poor performance.
♻ ☆ Introducing 3DCNN ResNets for ASD full-body kinematic assessment: a comparison with hand-crafted features
Autism Spectrum Disorder (ASD) is characterized by challenges in social communication and restricted patterns, with motor abnormalities gaining traction for early detection. However, kinematic analysis in ASD is limited, often lacking robust validation and relying on hand-crafted features for single tasks, leading to inconsistencies across studies. End-to-end models have emerged as promising methods to overcome the need for feature engineering. Our aim is to propose a newly adapted 3DCNN ResNet from and compare it to widely used hand-crafted features for motor ASD assessment. Specifically, we developed a virtual reality environment with multiple motor tasks and trained models using both approaches. We prioritized a reliable validation framework with repeated cross-validation. Results show the proposed model achieves a maximum accuracy of 85$\pm$3%, outperforming state-of-the-art end-to-end models with short 1-to-3 minute samples. Our comparative analysis with hand-crafted features shows feature-engineered models outperformed our end-to-end model in certain tasks. However, our end-to-end model achieved a higher mean AUC of 0.80$\pm$0.03. Additionally, statistical differences were found in model variance, with our end-to-end model providing more consistent results with less variability across all VR tasks, demonstrating domain generalization and reliability. These findings show that end-to-end models enable less variable and context-independent ASD classification without requiring domain knowledge or task specificity. However, they also recognize the effectiveness of hand-crafted features in specific task scenarios.
comment: This work has been submitted to Expert Systems with Applications for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Combining Reconstruction and Contrastive Methods for Multimodal Representations in RL
Learning self-supervised representations using reconstruction or contrastive losses improves performance and sample complexity of image-based and multimodal reinforcement learning (RL). Here, different self-supervised loss functions have distinct advantages and limitations depending on the information density of the underlying sensor modality. Reconstruction provides strong learning signals but is susceptible to distractions and spurious information. While contrastive approaches can ignore those, they may fail to capture all relevant details and can lead to representation collapse. For multimodal RL, this suggests that different modalities should be treated differently based on the amount of distractions in the signal. We propose Contrastive Reconstructive Aggregated representation Learning (CoRAL), a unified framework enabling us to choose the most appropriate self-supervised loss for each sensor modality and allowing the representation to better focus on relevant aspects. We evaluate CoRAL's benefits on a wide range of tasks with images containing distractions or occlusions, a new locomotion suite, and a challenging manipulation suite with visually realistic distractions. Our results show that learning a multimodal representation by combining contrastive and reconstruction-based losses can significantly improve performance and solve tasks that are out of reach for more naive representation learning approaches and other recent baselines.
comment: Published in "Reinforcement Learning Conference (RLC)", August 2024
♻ ☆ The Challenges of the Nonlinear Regime for Physics-Informed Neural Networks
The Neural Tangent Kernel (NTK) viewpoint is widely employed to analyze the training dynamics of overparameterized Physics-Informed Neural Networks (PINNs). However, unlike the case of linear Partial Differential Equations (PDEs), we show how the NTK perspective falls short in the nonlinear scenario. Specifically, we establish that the NTK yields a random matrix at initialization that is not constant during training, contrary to conventional belief. Another significant difference from the linear regime is that, even in the idealistic infinite-width limit, the Hessian does not vanish and hence it cannot be disregarded during training. This motivates the adoption of second-order optimization methods. We explore the convergence guarantees of such methods in both linear and nonlinear cases, addressing challenges such as spectral bias and slow convergence. Every theoretical result is supported by numerical examples with both linear and nonlinear PDEs, and we highlight the benefits of second-order methods in benchmark test cases.
comment: 10 pages, 4 figures, appendix of 12 additional pages
♻ ☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
♻ ☆ Autoencoder-based Anomaly Detection System for Online Data Quality Monitoring of the CMS Electromagnetic Calorimeter
The CMS detector is a general-purpose apparatus that detects high-energy collisions produced at the LHC. Online Data Quality Monitoring of the CMS electromagnetic calorimeter is a vital operational tool that allows detector experts to quickly identify, localize, and diagnose a broad range of detector issues that could affect the quality of physics data. A real-time autoencoder-based anomaly detection system using semi-supervised machine learning is presented enabling the detection of anomalies in the CMS electromagnetic calorimeter data. A novel method is introduced which maximizes the anomaly detection performance by exploiting the time-dependent evolution of anomalies as well as spatial variations in the detector response. The autoencoder-based system is able to efficiently detect anomalies, while maintaining a very low false discovery rate. The performance of the system is validated with anomalies found in 2018 and 2022 LHC collision data. Additionally, the first results from deploying the autoencoder-based system in the CMS online Data Quality Monitoring workflow during the beginning of Run 3 of the LHC are presented, showing its ability to detect issues missed by the existing system.
comment: Replaced with the published version. Added the journal reference and the DOI
♻ ☆ SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems CCS'24
Recent advancements in multi-agent reinforcement learning (MARL) have opened up vast application prospects, such as swarm control of drones, collaborative manipulation by robotic arms, and multi-target encirclement. However, potential security threats during the MARL deployment need more attention and thorough investigation. Recent research reveals that attackers can rapidly exploit the victim's vulnerabilities, generating adversarial policies that result in the failure of specific tasks. For instance, reducing the winning rate of a superhuman-level Go AI to around 20%. Existing studies predominantly focus on two-player competitive environments, assuming attackers possess complete global state observation. In this study, we unveil, for the first time, the capability of attackers to generate adversarial policies even when restricted to partial observations of the victims in multi-agent competitive environments. Specifically, we propose a novel black-box attack (SUB-PLAY) that incorporates the concept of constructing multiple subgames to mitigate the impact of partial observability and suggests sharing transitions among subpolicies to improve attackers' exploitative ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under three typical partial observability limitations. Visualization results indicate that adversarial policies induce significantly different activations of the victims' policy networks. Furthermore, we evaluate three potential defenses aimed at exploring ways to mitigate security threats posed by adversarial policies, providing constructive recommendations for deploying MARL in competitive environments.
comment: To appear in the ACM Conference on Computer and Communications Security (CCS'24), October 14-18, 2024, Salt Lake City, UT, USA
♻ ☆ General Distribution Learning: A theoretical framework for Deep Learning
There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima for generalization, and the exceptional performance of deep architectures in solving physical problems. This paper introduces General Distribution Learning (GD Learning), a novel theoretical learning framework designed to address a comprehensive range of machine learning and statistical tasks, including classification, regression and parameter estimation. Departing from traditional statistical machine learning, GD Learning focuses on the true underlying distribution. In GD Learning, learning error, corresponding to the expected error in classical statistical learning framework, is divided into fitting errors due to models and algorithms, as well as sampling errors introduced by limited sampling data. The framework significantly incorporates prior knowledge, especially in scenarios characterized by data scarcity, thereby enhancing performance. Within the GD Learning framework, we demonstrate that the global optimal solutions in non-convex optimization can be approached by minimizing the gradient norm and the non-uniformity of the eigenvalues of the model's Jacobian matrix. This insight leads to the development of the gradient structure control algorithm. GD Learning also offers fresh insights into the questions on deep learning, including overparameterization and non-convex optimization, bias-variance trade-off, and the mechanism of flat minima.
comment: arXiv admin note: text overlap with arXiv:2105.04026 by other authors. arXiv admin note: text overlap with arXiv:2105.04026 by other authors
♻ ☆ Single-Model Attribution of Generative Models Through Final-Layer Inversion ICML2024
Recent breakthroughs in generative modeling have sparked interest in practical single-model attribution. Such methods predict whether a sample was generated by a specific generator or not, for instance, to prove intellectual property theft. However, previous works are either limited to the closed-world setting or require undesirable changes to the generative model. We address these shortcomings by, first, viewing single-model attribution through the lens of anomaly detection. Arising from this change of perspective, we propose FLIPAD, a new approach for single-model attribution in the open-world setting based on final-layer inversion and anomaly detection. We show that the utilized final-layer inversion can be reduced to a convex lasso optimization problem, making our approach theoretically sound and computationally efficient. The theoretical findings are accompanied by an experimental study demonstrating the effectiveness of our approach and its flexibility to various domains.
comment: Accepted at the Forty-first International Conference on Machine Learning [ICML2024]
♻ ☆ Deep Fusion: Efficient Network Training via Pre-trained Initializations
In recent years, deep learning has made remarkable progress in a wide range of domains, with a particularly notable impact on natural language processing tasks. One of the challenges associated with training deep neural networks in the context of LLMs is the need for large amounts of computational resources and time. To mitigate this, network growing algorithms offer potential cost savings, but their underlying mechanisms are poorly understood. We present two notable contributions in this paper. First, we present Deep Fusion, an efficient approach to network training that leverages pre-trained initializations of smaller networks. Second, we propose a theoretical framework using backward error analysis to illustrate the dynamics of mid-training network growth. Our experiments show how Deep Fusion is a practical and effective approach that not only accelerates the training process but also reduces computational requirements, maintaining or surpassing traditional training methods' performance in various NLP tasks and T5 model sizes. Finally, we validate our theoretical framework, which guides the optimal use of Deep Fusion, showing that with carefully optimized training dynamics, it significantly reduces both training time and resource consumption.
♻ ☆ Weisfeiler Leman for Euclidean Equivariant Machine Learning
The $k$-Weisfeiler-Leman ($k$-WL) graph isomorphism test hierarchy is a common method for assessing the expressive power of graph neural networks (GNNs). Recently, GNNs whose expressive power is equivalent to the $2$-WL test were proven to be universal on weighted graphs which encode $3\mathrm{D}$ point cloud data, yet this result is limited to invariant continuous functions on point clouds. In this paper, we extend this result in three ways: Firstly, we show that PPGN can simulate $2$-WL uniformly on all point clouds with low complexity. Secondly, we show that $2$-WL tests can be extended to point clouds which include both positions and velocities, a scenario often encountered in applications. Finally, we provide a general framework for proving equivariant universality and leverage it to prove that a simple modification of this invariant PPGN architecture can be used to obtain a universal equivariant architecture that can approximate all continuous equivariant functions uniformly. Building on our results, we develop our WeLNet architecture, which sets new state-of-the-art results on the N-Body dynamics task and the GEOM-QM9 molecular conformation generation task.
♻ ☆ A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
♻ ☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
♻ ☆ Adversarial Multi-dueling Bandits
We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
♻ ☆ Improving Local Training in Federated Learning via Temperature Scaling
Federated learning is inherently hampered by data heterogeneity: non-i.i.d. training data over local clients. We propose a novel model training approach for federated learning, FLex&Chill, which exploits the Logit Chilling method. Through extensive evaluations, we demonstrate that, in the presence of non-i.i.d. data characteristics inherent in federated learning systems, this approach can expedite model convergence and improve inference accuracy. Quantitatively, from our experiments, we observe up to 6X improvement in the global federated learning model convergence time, and up to 3.37% improvement in inference accuracy.
comment: 24 pages
♻ ☆ MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions
The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at https://github.com/francescodisalvo05/medmnistc-api}{github.com/francescodisalvo05/medmnistc-api .
♻ ☆ Learning Antenna Pointing Correction in Operations: Efficient Calibration of a Black Box
We propose an efficient offline pointing calibration method for operational antenna systems which does not require any downtime. Our approach minimizes the calibration effort and exploits technical signal information which is typically used for monitoring and control purposes in ground station operations. Using a standard antenna interface and data from an operational satellite contact, we come up with a robust strategy for training data set generation. On top of this, we learn the parameters of a suitable coordinate transform by means of linear regression. In our experiments, we show the usefulness of the method in a real-world setup.
comment: 5 pages, to be published in the conference proceedings of the European Signal Processing Conference (EUSIPCO) 2024, camera-ready fixing typos and extending motivation, test description, calibration strategy description, as well as result discussion
♻ ☆ ECGrecover: a Deep Learning Approach for Electrocardiogram Signal Completion
In this work, we address the challenge of reconstructing the complete 12-lead ECG signal from incomplete parts of it. We focus on two main scenarii: (i) reconstructing missing signal segments within an ECG lead and (ii) recovering missing leads from a single-lead. We propose a model with a U-Net architecture trained on a novel objective function to address the reconstruction problem. This function incorporates both spatial and temporal aspects of the ECG by combining the distance in amplitude between the reconstructed and real signals with the signal trend. Through comprehensive assessments using both a real-life dataset and a publicly accessible one, we demonstrate that the proposed approach consistently outperforms state-of-the-art methods based on generative adversarial networks and a CopyPaste strategy. Our proposed model demonstrates superior performance in standard distortion metrics and preserves critical ECG characteristics, particularly the P, Q, R, S, and T wave coordinates. Two emerging clinical applications emphasize the relevance of our work. The first is the increasing need to digitize paper-stored ECGs for utilization in AI-based applications (automatic annotation and risk-quantification), often limited to digital ECG complete 10s recordings. The second is the widespread use of wearable devices that record ECGs but typically capture only a small subset of the 12 standard leads. In both cases, a non-negligible amount of information is lost or not recorded, which our approach aims to recover to overcome these limitations.
♻ ☆ Boosting the Cross-Architecture Generalization of Dataset Distillation through an Empirical Study
The poor cross-architecture generalization of dataset distillation greatly weakens its practical significance. This paper attempts to mitigate this issue through an empirical study, which suggests that the synthetic datasets undergo an inductive bias towards the distillation model. Therefore, the evaluation model is strictly confined to having similar architectures of the distillation model. We propose a novel method of EvaLuation with distillation Feature (ELF), which utilizes features from intermediate layers of the distillation model for the cross-architecture evaluation. In this manner, the evaluation model learns from bias-free knowledge therefore its architecture becomes unfettered while retaining performance. By performing extensive experiments, we successfully prove that ELF can well enhance the cross-architecture generalization of current DD methods. Code of this project is at \url{https://github.com/Lirui-Zhao/ELF}.
♻ ☆ Fast Learnings of Coupled Nonnegative Tensor Decomposition Using Optimal Gradient and Low-rank Approximation
Tensor decomposition is a fundamental technique widely applied in signal processing, machine learning, and various other fields. However, traditional tensor decomposition methods encounter limitations when jointly analyzing multi-block tensors, as they often struggle to effectively explore shared information among tensors. In this study, we first introduce a novel coupled nonnegative CANDECOMP/PARAFAC decomposition algorithm optimized by the alternating proximal gradient method (CoNCPD-APG). This algorithm is specially designed to address the challenges of jointly decomposing different tensors that are partially or fully linked, while simultaneously extracting common components, individual components and, core tensors. Recognizing the computational challenges inherent in optimizing nonnegative constraints over high-dimensional tensor data, we further propose the lraCoNCPD-APG algorithm. By integrating low-rank approximation with the proposed CoNCPD-APG method, the proposed algorithm can significantly decrease the computational burden without compromising decomposition quality, particularly for multi-block large-scale tensors. Simulation experiments conducted on synthetic data, real-world face image data, and two kinds of electroencephalography (EEG) data demonstrate the practicality and superiority of the proposed algorithms for coupled nonnegative tensor decomposition problems. Our results underscore the efficacy of our methods in uncovering meaningful patterns and structures from complex multi-block tensor data, thereby offering valuable insights for future applications.
comment: 15 pages, 6 figures
♻ ☆ Transferable Reward Learning by Dynamics-Agnostic Discriminator Ensemble
Recovering reward function from expert demonstrations is a fundamental problem in reinforcement learning. The recovered reward function captures the motivation of the expert. Agents can imitate experts by following these reward functions in their environment, which is known as apprentice learning. However, the agents may face environments different from the demonstrations, and therefore, desire transferable reward functions. Classical reward learning methods such as inverse reinforcement learning (IRL) or, equivalently, adversarial imitation learning (AIL), recover reward functions coupled with training dynamics, which are hard to be transferable. Previous dynamics-agnostic reward learning methods rely on assumptions such as that the reward function has to be state-only, restricting their applicability. In this work, we present a dynamics-agnostic discriminator-ensemble reward learning method (DARL) within the AIL framework, capable of learning both state-action and state-only reward functions. DARL achieves this by decoupling the reward function from training dynamics, employing a dynamics-agnostic discriminator on a latent space derived from the original state-action space. This latent space is optimized to minimize information on the dynamics. We moreover discover the policy-dependency issue of the AIL framework that reduces the transferability. DARL represents the reward function as an ensemble of discriminators during training to eliminate policy dependencies. Empirical studies on MuJoCo tasks with changed dynamics show that DARL better recovers the reward function and results in better imitation performance in transferred environments, handling both state-only and state-action reward scenarios.
♻ ☆ Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
♻ ☆ Benchmarking General-Purpose In-Context Learning
In-context learning (ICL) empowers generative models to address new tasks effectively and efficiently on the fly, without relying on any artificially crafted optimization techniques. In this paper, we study extending ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, namely General-Purpose In-Context Learning (GPICL). To this end, we introduce two lightweight benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark encompasses a vast number of tasks characterized by significant task variance, facilitating meta-training that minimizes inductive bias. These tasks are also crafted to promote long-horizon in-context learning through continuous generation and interaction. These characteristics necessitate the models to leverage contexts and history interactions to enhance their capabilities, across domains such as language modeling, decision-making, and world modeling. Our experiments on the baseline models demonstrate that meta-training with minimal inductive bias and ICL from the ground up is feasible across all the domains we've discussed. Additionally, our findings indicate that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.
♻ ☆ SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
comment: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
♻ ☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
♻ ☆ FaithLM: Towards Faithful Explanations for Large Language Models
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.
♻ ☆ A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures
As the adoption of explainable AI (XAI) continues to expand, the urgency to address its privacy implications intensifies. Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations. This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures. Our contribution to this field comprises a thorough analysis of research papers with a connected taxonomy that facilitates the categorisation of privacy attacks and countermeasures based on the targeted explanations. This work also includes an initial investigation into the causes of privacy leaks. Finally, we discuss unresolved issues and prospective research directions uncovered in our analysis. This survey aims to be a valuable resource for the research community and offers clear insights for those new to this domain. To support ongoing research, we have established an online resource repository, which will be continuously updated with new and relevant findings. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-privex.
comment: Revision
♻ ☆ A GPU-Accelerated Bi-linear ADMM Algorithm for Distributed Sparse Machine Learning
This paper introduces the Bi-linear consensus Alternating Direction Method of Multipliers (Bi-cADMM), aimed at solving large-scale regularized Sparse Machine Learning (SML) problems defined over a network of computational nodes. Mathematically, these are stated as minimization problems with convex local loss functions over a global decision vector, subject to an explicit $\ell_0$ norm constraint to enforce the desired sparsity. The considered SML problem generalizes different sparse regression and classification models, such as sparse linear and logistic regression, sparse softmax regression, and sparse support vector machines. Bi-cADMM leverages a bi-linear consensus reformulation of the original non-convex SML problem and a hierarchical decomposition strategy that divides the problem into smaller sub-problems amenable to parallel computing. In Bi-cADMM, this decomposition strategy is based on a two-phase approach. Initially, it performs a sample decomposition of the data and distributes local datasets across computational nodes. Subsequently, a delayed feature decomposition of the data is conducted on Graphics Processing Units (GPUs) available to each node. This methodology allows Bi-cADMM to undertake computationally intensive data-centric computations on GPUs, while CPUs handle more cost-effective computations. The proposed algorithm is implemented within an open-source Python package called Parallel Sparse Fitting Toolbox (PsFiT), which is publicly available. Finally, computational experiments demonstrate the efficiency and scalability of our algorithm through numerical benchmarks across various SML problems featuring distributed datasets.
♻ ☆ Empathy Detection from Text, Audiovisual, Audio or Physiological Signals: Task Formulations and Machine Learning Methods
Empathy indicates an individual's ability to understand others. Over the past few years, empathy has drawn attention from various disciplines, including but not limited to Affective Computing, Cognitive Science and Psychology. Detecting empathy has potential applications in society, healthcare and education. Despite being a broad and overlapping topic, the avenue of empathy detection leveraging Machine Learning remains underexplored from a systematic literature review perspective. We collected 828 papers from 10 well-known databases, systematically screened them and analysed the final 61 papers. Our analyses reveal several prominent task formulations $-$ including empathy on localised utterances or overall expressions, unidirectional or parallel empathy, and emotional contagion $-$ in monadic, dyadic and group interactions. Empathy detection methods are summarised based on four input modalities $-$ text, audiovisual, audio and physiological signals $-$ thereby presenting modality-specific network architecture design protocols. We discuss challenges, research gaps and potential applications in the Affective Computing-based empathy domain, which can facilitate new avenues of exploration. We further enlist the public availability of datasets and codes. We believe that our work is a stepping stone to developing a robust empathy detection system that can be deployed in practice to enhance the overall well-being of human life.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice
♻ ☆ ES-GNN: Generalizing Graph Neural Networks Beyond Homophily with Edge Splitting
While Graph Neural Networks (GNNs) have achieved enormous success in multiple graph analytical tasks, modern variants mostly rely on the strong inductive bias of homophily. However, real-world networks typically exhibit both homophilic and heterophilic linking patterns, wherein adjacent nodes may share dissimilar attributes and distinct labels. Therefore, GNNs smoothing node proximity holistically may aggregate both task-relevant and irrelevant (even harmful) information, limiting their ability to generalize to heterophilic graphs and potentially causing non-robustness. In this work, we propose a novel Edge Splitting GNN (ES-GNN) framework to adaptively distinguish between graph edges either relevant or irrelevant to learning tasks. This essentially transfers the original graph into two subgraphs with the same node set but complementary edge sets dynamically. Given that, information propagation separately on these subgraphs and edge splitting are alternatively conducted, thus disentangling the task-relevant and irrelevant features. Theoretically, we show that our ES-GNN can be regarded as a solution to a disentangled graph denoising problem, which further illustrates our motivations and interprets the improved generalization beyond homophily. Extensive experiments over 11 benchmark and 1 synthetic datasets not only demonstrate the effective performance of ES-GNN but also highlight its robustness to adversarial graphs and mitigation of the over-smoothing problem.
comment: Under review
♻ ☆ Single-sample versus case-control sampling scheme for Positive Unlabeled data: the story of two scenarios
In the paper we argue that performance of the classifiers based on Empirical Risk Minimization (ERM) for positive unlabeled data, which are designed for case-control sampling scheme may significantly deteriorate when applied to a single-sample scenario. We reveal why their behavior depends, in all but very specific cases, on the scenario. Also, we introduce a single-sample case analogue of the popular non-negative risk classifier designed for case-control data and compare its performance with the original proposal. We show that the significant differences occur between them, especiall when half or more positive of observations are labeled. The opposite case when ERM minimizer designed for the case-control case is applied for single-sample data is also considered and similar conclusions are drawn. Taking into account difference of scenarios requires a sole, but crucial, change in the definition of the Empirical Risk.
♻ ☆ Generating Chain-of-Thoughts with a Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought ICML 2024
To improve the ability of the large language model (LLMs) to tackle complex reasoning problems, chain-of-thoughts (CoT) methods were proposed to guide LLMs to reason step-by-step, enabling problem solving from simple to complex. State-of-the-art methods for generating such a chain involve interactive collaboration, where the learner generates candidate intermediate thoughts, evaluated by the LLM, guiding the generation of subsequent thoughts. However, a widespread yet understudied problem is that the evaluation from the LLM is typically noisy and unreliable, potentially misleading the generation process in selecting promising intermediate thoughts. In this paper, motivated by Vapnik's principle, we use pairwise-comparison evaluation instead of point-wise scoring to search for promising intermediate thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further alleviate the noise in the comparison, we incorporate techniques from ensemble learning and dueling bandits, proposing two variants of the algorithm. Experiments on three real-world tasks demonstrate the effectiveness of our proposed algorithm and verify the rationale of the pairwise comparison mechanism.
comment: ICML 2024
♻ ☆ LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging ICML 2024
Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.
comment: ICML 2024
♻ ☆ CIMRL: Combining IMitation and Reinforcement Learning for Safe Autonomous Driving
Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach - a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks.
♻ ☆ Neural Methods for Amortised Inference
Simulation-based methods for statistical inference have evolved dramatically over the past 50 years, keeping pace with technological advancements. The field is undergoing a new revolution as it embraces the representational capacity of neural networks, optimisation libraries and graphics processing units for learning complex mappings between data and inferential targets. The resulting tools are amortised, in the sense that they allow rapid inference through fast feedforward operations. In this article we review recent progress in the context of point estimation, approximate Bayesian inference, summary-statistic construction, and likelihood approximation. We also cover software, and include a simple illustration to showcase the wide array of tools available for amortised inference and the benefits they offer over Markov chain Monte Carlo methods. The article concludes with an overview of relevant topics and an outlook on future research directions.
comment: 45 pages, 11 figures, 2 tables
Iterated Denoising Energy Matching for Sampling from Boltzmann Densities ICML 2024
Efficiently generating statistically independent samples from an unnormalized probability distribution, such as equilibrium samples of many-body systems, is a foundational problem in science. In this paper, we propose Iterated Denoising Energy Matching (iDEM), an iterative algorithm that uses a novel stochastic score matching objective leveraging solely the energy function and its gradient -- and no data samples -- to train a diffusion-based sampler. Specifically, iDEM alternates between (I) sampling regions of high model density from a diffusion-based sampler and (II) using these samples in our stochastic matching objective to further improve the sampler. iDEM is scalable to high dimensions as the inner matching objective, is simulation-free, and requires no MCMC samples. Moreover, by leveraging the fast mode mixing behavior of diffusion, iDEM smooths out the energy landscape enabling efficient exploration and learning of an amortized sampler. We evaluate iDEM on a suite of tasks ranging from standard synthetic energy functions to invariant $n$-body particle systems. We show that the proposed approach achieves state-of-the-art performance on all metrics and trains $2-5\times$ faster, which allows it to be the first method to train using energy on the challenging $55$-particle Lennard-Jones system.
comment: Published at ICML 2024. Code for iDEM is available at https://github.com/jarridrb/dem
♻ ☆ A Large-Scale Exploration of $μ$-Transfer NeurIPS
Large artificial neural networks have become a mainstay of language, vision, and audio processing and synthesis, yet their initializations and learning rates are often set in an unsophisticated fashion, due to the high cost of hyperparameter sweeps at scale. The $\mu$-Parameterization ($\mu$P) offers a potential solution to this challenge, yielding scaling rules for model initialization and learning rates while reportedly enabling zero-shot hyperparameter transfer from small to large models. Despite its evident promise, the $\mu$P method is not yet widely adopted, perhaps due to higher implementation complexity, many variations, or complex theoretical background. This work investigates $\mu$P empirically, focusing on the ubiquitous transformer architecture, and aims to answer a simple question: does $\mu$-Transfer yield optimal learning rates in practice? Studying models of up to 10B parameters and training budgets of up to 190B tokens, we find $\mu$-Transfer works as intended for the majority of important cases, yet also identify a few cases where it may not.
comment: V5: Improved exposition and formatting, expanded bib. V4: NeurIPS style, extra experiments and expanded bib. V3: Formatting, extra experiments, expanded bib. V2: Formatting, extra experiments
♻ ☆ STEEL: Singularity-aware Reinforcement Learning
Batch reinforcement learning (RL) aims at leveraging pre-collected data to find an optimal policy that maximizes the expected total rewards in a dynamic environment. The existing methods require absolutely continuous assumption (e.g., there do not exist non-overlapping regions) on the distribution induced by target policies with respect to the data distribution over either the state or action or both. We propose a new batch RL algorithm that allows for singularity for both state and action spaces (e.g., existence of non-overlapping regions between offline data distribution and the distribution induced by the target policies) in the setting of an infinite-horizon Markov decision process with continuous states and actions. We call our algorithm STEEL: SingulariTy-awarE rEinforcement Learning. Our algorithm is motivated by a new error analysis on off-policy evaluation, where we use maximum mean discrepancy, together with distributionally robust optimization, to characterize the error of off-policy evaluation caused by the possible singularity and to enable model extrapolation. By leveraging the idea of pessimism and under some technical conditions, we derive a first finite-sample regret guarantee for our proposed algorithm under singularity. Compared with existing algorithms,by requiring only minimal data-coverage assumption, STEEL improves the applicability and robustness of batch RL. In addition, a two-step adaptive STEEL, which is nearly tuning-free, is proposed. Extensive simulation studies and one (semi)-real experiment on personalized pricing demonstrate the superior performance of our methods in dealing with possible singularity in batch RL.
♻ ☆ Multi-Agent Imitation Learning: Value is Easy, Regret is Hard
We study a multi-agent imitation learning (MAIL) problem where we take the perspective of a learner attempting to coordinate a group of agents based on demonstrations of an expert doing so. Most prior work in MAIL essentially reduces the problem to matching the behavior of the expert within the support of the demonstrations. While doing so is sufficient to drive the value gap between the learner and the expert to zero under the assumption that agents are non-strategic, it does not guarantee robustness to deviations by strategic agents. Intuitively, this is because strategic deviations can depend on a counterfactual quantity: the coordinator's recommendations outside of the state distribution their recommendations induce. In response, we initiate the study of an alternative objective for MAIL in Markov Games we term the regret gap that explicitly accounts for potential deviations by agents in the group. We first perform an in-depth exploration of the relationship between the value and regret gaps. First, we show that while the value gap can be efficiently minimized via a direct extension of single-agent IL algorithms, even value equivalence can lead to an arbitrarily large regret gap. This implies that achieving regret equivalence is harder than achieving value equivalence in MAIL. We then provide a pair of efficient reductions to no-regret online convex optimization that are capable of minimizing the regret gap (a) under a coverage assumption on the expert (MALICE) or (b) with access to a queryable expert (BLADES).
♻ ☆ Antigen-Specific Antibody Design via Direct Energy-based Preference Optimization
Antibody design, a crucial task with significant implications across various disciplines such as therapeutics and biology, presents considerable challenges due to its intricate nature. In this paper, we tackle antigen-specific antibody sequence-structure co-design as an optimization problem towards specific preferences, considering both rationality and functionality. Leveraging a pre-trained conditional diffusion model that jointly models sequences and structures of antibodies with equivariant neural networks, we propose direct energy-based preference optimization to guide the generation of antibodies with both rational structures and considerable binding affinities to given antigens. Our method involves fine-tuning the pre-trained diffusion model using a residue-level decomposed energy preference. Additionally, we employ gradient surgery to address conflicts between various types of energy, such as attraction and repulsion. Experiments on RAbD benchmark show that our approach effectively optimizes the energy of generated antibodies and achieves state-of-the-art performance in designing high-quality antibodies with low total energy and high binding affinity simultaneously, demonstrating the superiority of our approach.
♻ ☆ GCondenser: Benchmarking Graph Condensation
Large-scale graphs are valuable for graph representation learning, yet the abundant data in these graphs hinders the efficiency of the training process. Graph condensation (GC) alleviates this issue by compressing the large graph into a significantly smaller one that still supports effective model training. Although recent research has introduced various approaches to improve the effectiveness of the condensed graph, comprehensive and practical evaluations across different GC methods are neglected. This paper proposes the first large-scale graph condensation benchmark, GCondenser, to holistically evaluate and compare mainstream GC methods. GCondenser includes a standardised GC paradigm, consisting of condensation, validation, and evaluation procedures, as well as enabling extensions to new GC methods and datasets. With GCondenser, a comprehensive performance study is conducted, presenting the effectiveness of existing methods. GCondenser is open-sourced and available at https://github.com/superallen13/GCondenser.
comment: GCondenser is open-sourced and available at https://github.com/superallen13/GCondenser
♻ ☆ Reinforcement Learning from Delayed Observations via World Models
In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.
♻ ☆ Stabilizing Policy Gradients for Stochastic Differential Equations via Consistency with Perturbation Process ICML 2024
Considering generating samples with high rewards, we focus on optimizing deep neural networks parameterized stochastic differential equations (SDEs), the advanced generative models with high expressiveness, with policy gradient, the leading algorithm in reinforcement learning. Nevertheless, when applying policy gradients to SDEs, since the policy gradient is estimated on a finite set of trajectories, it can be ill-defined, and the policy behavior in data-scarce regions may be uncontrolled. This challenge compromises the stability of policy gradients and negatively impacts sample complexity. To address these issues, we propose constraining the SDE to be consistent with its associated perturbation process. Since the perturbation process covers the entire space and is easy to sample, we can mitigate the aforementioned problems. Our framework offers a general approach allowing for a versatile selection of policy gradient methods to effectively and efficiently train SDEs. We evaluate our algorithm on the task of structure-based drug design and optimize the binding affinity of generated ligand molecules. Our method achieves the best Vina score -9.07 on the CrossDocked2020 dataset.
comment: Accepted to ICML 2024
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ MgNO: Efficient Parameterization of Linear Operators via Multigrid
In this work, we propose a concise neural operator architecture for operator learning. Drawing an analogy with a conventional fully connected neural network, we define the neural operator as follows: the output of the $i$-th neuron in a nonlinear operator layer is defined by $O_i(u) = \sigma\left( \sum_j W_{ij} u + B_{ij}\right)$. Here, $ W_{ij}$ denotes the bounded linear operator connecting $j$-th input neuron to $i$-th output neuron, and the bias $ B_{ij}$ takes the form of a function rather than a scalar. Given its new universal approximation property, the efficient parameterization of the bounded linear operators between two neurons (Banach spaces) plays a critical role. As a result, we introduce MgNO, utilizing multigrid structures to parameterize these linear operators between neurons. This approach offers both mathematical rigor and practical expressivity. Additionally, MgNO obviates the need for conventional lifting and projecting operators typically required in previous neural operators. Moreover, it seamlessly accommodates diverse boundary conditions. Our empirical observations reveal that MgNO exhibits superior ease of training compared to other CNN-based models, while also displaying a reduced susceptibility to overfitting when contrasted with spectral-type neural operators. We demonstrate the efficiency and accuracy of our method with consistently state-of-the-art performance on different types of partial differential equations (PDEs).
♻ ☆ Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.
♻ ☆ Vanilla Bayesian Optimization Performs Great in High Dimensions
High-dimensional problems have long been considered the Achilles' heel of Bayesian optimization algorithms. Spurred by the curse of dimensionality, a large collection of algorithms aim to make it more performant in this setting, commonly by imposing various simplifying assumptions on the objective. In this paper, we identify the degeneracies that make vanilla Bayesian optimization poorly suited to high-dimensional tasks, and further show how existing algorithms address these degeneracies through the lens of lowering the model complexity. Moreover, we propose an enhancement to the prior assumptions that are typical to vanilla Bayesian optimization algorithms, which reduces the complexity to manageable levels without imposing structural restrictions on the objective. Our modification - a simple scaling of the Gaussian process lengthscale prior with the dimensionality - reveals that standard Bayesian optimization works drastically better than previously thought in high dimensions, clearly outperforming existing state-of-the-art algorithms on multiple commonly considered real-world high-dimensional tasks.
♻ ☆ Pointwise convergence of Fourier series and deep neural network for the indicator function of d-dimensional ball
In this paper, we clarify the crucial difference between a deep neural network and the Fourier series. For the multiple Fourier series of periodization of some radial functions on $\mathbb{R}^d$, Kuratsubo (2010) investigated the behavior of the spherical partial sum and discovered the third phenomenon other than the well-known Gibbs-Wilbraham and Pinsky phenomena. In particular, the third one exhibits prevention of pointwise convergence. In contrast to it, we give a specific deep neural network and prove pointwise convergence.
comment: When the version 2 was rejected (where I submitted it to an AI journal), I realized I needed to further clarify the key point, and also realized the field is rather Fourier analysis
♻ ☆ Learning in RKHM: a $C^*$-Algebraic Twist for Kernel Machines
Supervised learning in reproducing kernel Hilbert space (RKHS) and vector-valued RKHS (vvRKHS) has been investigated for more than 30 years. In this paper, we provide a new twist to this rich literature by generalizing supervised learning in RKHS and vvRKHS to reproducing kernel Hilbert $C^*$-module (RKHM), and show how to construct effective positive-definite kernels by considering the perspective of $C^*$-algebra. Unlike the cases of RKHS and vvRKHS, we can use $C^*$-algebras to enlarge representation spaces. This enables us to construct RKHMs whose representation power goes beyond RKHSs, vvRKHSs, and existing methods such as convolutional neural networks. Our framework is suitable, for example, for effectively analyzing image data by allowing the interaction of Fourier components.
comment: We corrected errors in the experiments in Section 6.2
♻ ☆ Unleashing the Expressive Power of Pulse-Based Quantum Neural Networks
Quantum machine learning (QML) based on Noisy Intermediate-Scale Quantum (NISQ) devices hinges on the optimal utilization of limited quantum resources. While gate-based QML models are user-friendly for software engineers, their expressivity is restricted by the permissible circuit depth within a finite coherence time. In contrast, pulse-based models enable the construction of "infinitely" deep quantum neural networks within the same time, which may unleash greater expressive power for complex learning tasks. In this paper, this potential is investigated from the perspective of quantum control theory. We first indicate that the nonlinearity of pulse-based models comes from the encoding process that can be viewed as the continuous limit of data-reuploading in gate-based models. Subsequently, we prove that the pulse-based model can approximate arbitrary nonlinear functions when the underlying physical system is ensemble controllable. Under this condition, numerical simulations demonstrate the enhanced expressivity by either increasing the pulse length or the number of qubits. As anticipated, we show through numerical examples that the pulse-based model can unleash more expressive power compared to the gate-based model. These findings lay a theoretical foundation for understanding and designing expressive QML models using NISQ devices.
comment: 12 pages; 6 figures
♻ ☆ Bayesian identification of nonseparable Hamiltonians with multiplicative noise using deep learning and reduced-order modeling
This paper presents a structure-preserving Bayesian approach for learning nonseparable Hamiltonian systems using stochastic dynamic models allowing for statistically-dependent, vector-valued additive and multiplicative measurement noise. The approach is comprised of three main facets. First, we derive a Gaussian filter for a statistically-dependent, vector-valued, additive and multiplicative noise model that is needed to evaluate the likelihood within the Bayesian posterior. Second, we develop a novel algorithm for cost-effective application of Bayesian system identification to high-dimensional systems. Third, we demonstrate how structure-preserving methods can be incorporated into the proposed framework, using nonseparable Hamiltonians as an illustrative system class. We assess the method's performance based on the forecasting accuracy of a model estimated from-single trajectory data. We compare the Bayesian method to a state-of-the-art machine learning method on a canonical nonseparable Hamiltonian model and a chaotic double pendulum model with small, noisy training datasets. The results show that using the Bayesian posterior as a training objective can yield upwards of 724 times improvement in Hamiltonian mean squared error using training data with up to 10% multiplicative noise compared to a standard training objective. Lastly, we demonstrate the utility of the novel algorithm for parameter estimation of a 64-dimensional model of the spatially-discretized nonlinear Schr\"odinger equation with data corrupted by up to 20% multiplicative noise.
♻ ☆ To smooth a cloud or to pin it down: Guarantees and Insights on Score Matching in Denoising Diffusion Models
Denoising diffusion models are a class of generative models which have recently achieved state-of-the-art results across many domains. Gradual noise is added to the data using a diffusion process, which transforms the data distribution into a Gaussian. Samples from the generative model are then obtained by simulating an approximation of the time reversal of this diffusion initialized by Gaussian samples. Recent research has explored adapting diffusion models for sampling and inference tasks. In this paper, we leverage known connections to stochastic control akin to the F\"ollmer drift to extend established neural network approximation results for the F\"ollmer drift to denoising diffusion models and samplers.
comment: arXiv admin note: text overlap with arXiv:1903.01608 by other authors
♻ ☆ Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity
Large language models (LLMs) are increasingly integrated into many online services, yet they remain cost-prohibitive to deploy due to the requirement of expensive GPU instances. Prior work has addressed the high cost of LLM serving by improving the inference engine, but less attention has been given to selecting the most cost-efficient GPU type(s) for a specific LLM service. There is a large and growing landscape of GPU types and, within these options, higher cost does not always lead to increased performance. Instead, through a comprehensive investigation, we find that three key LLM service characteristics (request size, request rate, SLO) strongly influence GPU cost efficiency, and differing GPU types are most cost efficient for differing LLM service settings. As a result, the most cost-efficient allocation for a given service is typically a mix of heterogeneous GPU types. Based on this analysis, we introduce M\'elange, a GPU allocation framework that navigates these diverse LLM service characteristics and heterogeneous GPU option space to automatically and efficiently derive the minimal-cost GPU allocation for a given LLM service. We formulate the GPU allocation task as a cost-aware bin packing problem where GPUs are bins and items are slices of the service workload. Our formulation's constraints account for a service's unique characteristics, allowing M\'elange to be flexible to support diverse service settings and heterogeneity-aware to adapt the GPU allocation to a specific service. Compared to using only a single GPU type, M\'elange reduces deployment costs by up to 77\% in conversational settings, 33\% in document-based settings, and 51\% in a mixed setting.
♻ ☆ PGODE: Towards High-quality System Dynamics Modeling ICML 2024
This paper studies the problem of modeling multi-agent dynamical systems, where agents could interact mutually to influence their behaviors. Recent research predominantly uses geometric graphs to depict these mutual interactions, which are then captured by powerful graph neural networks (GNNs). However, predicting interacting dynamics in challenging scenarios such as out-of-distribution shift and complicated underlying rules remains unsolved. In this paper, we propose a new approach named Prototypical Graph ODE (PGODE) to address the problem. The core of PGODE is to incorporate prototype decomposition from contextual knowledge into a continuous graph ODE framework. Specifically, PGODE employs representation disentanglement and system parameters to extract both object-level and system-level contexts from historical trajectories, which allows us to explicitly model their independent influence and thus enhances the generalization capability under system changes. Then, we integrate these disentangled latent representations into a graph ODE model, which determines a combination of various interacting prototypes for enhanced model expressivity. The entire model is optimized using an end-to-end variational inference framework to maximize the likelihood. Extensive experiments in both in-distribution and out-of-distribution settings validate the superiority of PGODE compared to various baselines.
comment: Accepted by ICML 2024
♻ ☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
comment: - Moved Table 1 from before to after Section 4.2 heading - Renamed output pdf file with paper title
♻ ☆ Machine Learning-Enabled Software and System Architecture Frameworks
Various architecture frameworks for software, systems, and enterprises have been proposed in the literature. They identified several stakeholders and defined modeling perspectives, architecture viewpoints, and views to frame and address stakeholder concerns. However, the stakeholders with data science and Machine Learning (ML) related concerns, such as data scientists and data engineers, are yet to be included in existing architecture frameworks. Only this way can we envision a holistic system architecture description of an ML-enabled system. Note that the ML component behavior and functionalities are special and should be distinguished from traditional software system behavior and functionalities. The main reason is that the actual functionality should be inferred from data instead of being specified at design time. Additionally, the structural models of ML components, such as ML model architectures, are typically specified using different notations and formalisms from what the Software Engineering (SE) community uses for software structural models. Yet, these two aspects, namely ML and non-ML, are becoming so intertwined that it necessitates an extension of software architecture frameworks and modeling practices toward supporting ML-enabled system architectures. In this paper, we address this gap through an empirical study using an online survey instrument. We surveyed 61 subject matter experts from over 25 organizations in 10 countries.
comment: Journal manuscript
♻ ☆ Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method
Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus. Open questions include whether neural collapse leads to better generalization and, if so, why and how training beyond the plateau helps. We model neural collapse as an information bottleneck (IB) problem in order to investigate whether such a compact representation exists and discover its connection to generalization. We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem. Recent research has shown that two deep neural networks independently trained with the same contrastive loss objective are linearly identifiable, meaning that the resulting representations are equivalent up to a matrix transformation. We leverage linear identifiability to approximate an analytical solution of the IB problem. This approximation demonstrates that when class means exhibit $K$-simplex Equiangular Tight Frame (ETF) behavior (e.g., $K$=10 for CIFAR10 and $K$=100 for CIFAR100), they coincide with the critical phase transitions of the corresponding IB problem. The performance plateau occurs once the optimal solution for the IB problem includes all of these phase transitions. We also show that the resulting $K$-simplex ETF can be packed into a $K$-dimensional Gaussian distribution using supervised contrastive learning with a ResNet50 backbone. This geometry suggests that the $K$-simplex ETF learned by supervised contrastive learning approximates the optimal features for source coding. Hence, there is a direct correspondence between optimal IB solutions and generalization in contrastive learning.
Systems and Control 19
☆ Addressing intra-area oscillations and frequency stability after DC segmentation of a large AC power system
In the last decades, various events have shown that electromechanical oscillations are a major concern for large interconnected Alternating Current (AC) power systems. Segmentation of AC power systems with High Voltage Direct Current (HVDC) systems (DC segmentation, for short) is a method that consists in turning large AC grids into a set of asynchronous AC clusters linked by HVDC links. It is a promising solution to mitigate electromechanical oscillations and other issues. In particular, an appropriately placed DC segmentation can stop a selected inter-area electromechanical oscillation mode. However, without supplementary controllers, DC segmentation will not contribute to the damping of the intra-area oscillation modes in the remaining AC clusters and will deteriorate the frequency stability of the power system. This paper aims at filling this gap and proposes the use of DC segmentation with HVDC systems based on Voltage Source Converters (VSC-HVDC) with supplementary controllers in the converter stations: (a) active-power supplementary controllers for frequency support among the asynchronous AC clusters and (b) a reactive-power supplementary controllers for Power Oscillation Damping (POD-Q), in order to damp the intra-area oscillation modes. The proposed supplementary controllers and their design will be presented, and their efficiency will be demonstrated on the Nordic 44 test system with DC segmentation by means of non-linear time-domain simulation and small-signal stability analysis.
comment: arXiv admin note: text overlap with arXiv:2302.09931
☆ Algebraic Connectivity Control and Maintenance in Multi-Agent Networks under Attack
This paper studies the problem of increasing the connectivity of an ad-hoc peer-to-peer network subject to cyber-attacks targeting the agents in the network. The adopted strategy involves the design of local interaction rules for the agents to locally modify the graph topology by adding and removing links with neighbors. Two distributed protocols are presented to boost the algebraic connectivity of the network graph beyond $k-2\sqrt{k-1}$ where $k\in \mathbb{N}$ is a free design parameter; these two protocols are achieved through the distributed construction of random (approximate) regular graphs. One protocol leverages coordinated actions between pairs of neighboring agents and is mathematically proven to converge to the desired graph topology. The other protocol relies solely on the uncoordinated actions of individual agents and it is validated by a spectral analysis through Monte-Carlo simulations. Numerical simulations offer a comparative analysis with other state-of-the-art algorithms, showing the ability of both proposed protocols to maintain high levels of connectivity despite attacks carried out with full knowledge of the network structure, and highlighting their superior performance.
☆ System for Measurement of Electric Energy Using Beacons with Optical Sensors and LoRaWAN Transmission
In this article, we present the results of experiments with finding an efficient radio transmission method for an electric energy measurement system called OneMeter 2.0. This system offers a way of collecting energy usage data from beacons attached to regular, non-smart meters. In our study, we compared several low power wide area network (LPWAN) protocols, out of which we chose the LoRaWAN protocol. We verified the energy consumption of a LoRa-based transmission unit, as well as the transmission range between network nodes in urban conditions. We discovered that LoRaWAN-based transmission was highly energy-efficient and offered decent coverage, even in a difficult, dense urban environment.
comment: 7 pages, 8 figures
☆ Optimal Multi-Robot Communication-Aware Trajectory Planning by Constraining the Fiedler Value
The paper present a novel approach for the solution of the Multi-Robot Communication-Aware Trajectory Planning, which builds on a general optimisation framework where the changes in robots positions are used as decision variable, and linear constraints on the trajectories of the robots are introduced to ensure communication performance and collision avoidance. The Fiedler value is adopted as communication performance metric. The validity of the method in computing both feasible and optimal trajectories for the robots is demonstrated both in simulation and experimentally. Results show that the constraint on the Fiedler value ensures that the robot network fulfils its objective while maintaining communication connectivity at all times. Further, the paper shows that the introduction of approximations for the constraints enables a significant improvement in the computational time of the solution, which remain very close to the optimal solution.
☆ Exploiting Data Significance in Remote Estimation of Discrete-State Markov Sources
We consider the semantics-aware remote estimation of a discrete-state Markov source with normal (low-priority) and alarm (high-priority) states. Erroneously announcing a normal state at the destination when the source is actually in an alarm state (i.e., missed alarm error) incurs a significantly higher cost than falsely announcing an alarm state when the source is in a normal state (i.e., false alarm error). Moreover, successive reception of an estimation error may cause significant lasting impact, e.g., maintenance cost and misoperations. Motivated by this, we assign different costs to different estimation errors and introduce two new age metrics, namely the Age of Missed Alarm (AoMA) and the Age of False Alarm (AoFA), to account for the lasting impact incurred by different estimation errors. Notably, the two age processes evolve dependently and can distinguish between different types of estimation errors and different synced states. The aim is to achieve an optimal trade-off between the cost of estimation error, lasting impact, and communication utilization. The problem is formulated as an average-cost, countably infinite state-space Markov decision process (MDP). We show that the optimal policy exhibits a switching-type structure, making it amenable to policy storage and algorithm design. Notably, when the source is symmetric and states are equally important, the optimal policy has identical thresholds, i.e., threshold-type. Theoretical and numerical results underscore that our approach extends the current understanding of the Age of Incorrect Information (AoII) and the cost of actuation error (CAE), showing that they are specific instances within our broader framework.
☆ USLC: Universal Self-Learning Control via Physical Performance Policy-Optimization Neural Network
This study addresses the challenge of achieving real-time Universal Self-Learning Control (USLC) in nonlinear dynamic systems with uncertain models. The proposed control method incorporates a Universal Self-Learning module, which introduces a model-free online executor-evaluator framework to enable controller adaptation in the presence of unknown disturbances. By leveraging a neural network model trained on historical system performance data, the controller can autonomously learn to approximate optimal performance during each learning cycle. Consequently, the controller's structural parameters are incrementally adjusted to achieve a performance threshold comparable to human-level performance. Utilizing nonlinear system stability theory, specifically in the context of three-dimensional manifold space, we demonstrate the stability of USLC in Lipschitz continuous systems. We illustrate the USLC framework numerically with two case studies: a low-order circuit system and a high-order morphing fixed-wing attitude control system. The simulation results verify the effectiveness and universality of the proposed method.
comment: 15 Pages
☆ Coordination of Transmission and Distribution Systems in Load Restoration
The power distribution system is evolving in the form of an intelligent grid. The proliferation of distributed energy resources (DERs) makes the previously passive system active and more complicated. With the adoption of de-carbonization principles, large-scale coal and nuclear power plants are being gradually replaced by renewables and carbon-free DERs. With this rapid transformation, the power system operates with less inertia and minimal margins. In recent years, power systems have been facing apocalyptic weather events more frequently, and large-scale blackouts have become regular. After the complete or partial blackouts, the power system goes through different stages before it reaches the normal operating condition. The load restoration is the stage where the power system is fully established after the blackouts; however, due to the limiting ramping rates of centralized generation, the energization of large amounts of loads is delayed by some time. To mitigate the negative impact of ramping rates of centralized generation, DERs in distribution systems are proposed to serve the loads in both transmission and distribution systems in coordination with limited centralized generation in the transmission system. The problem is formulated as a centralized or integrated transmission and distribution (T$\&$D) coordination model. The modified IEEE 14 bus test case and IEEE 13 node test feeders are used to validate the proposed strategy; the results indicate the validity of the proposed model.
☆ Tiered Service Architecture for Remote Patient Monitoring
We develop a remote patient monitoring (RPM) service architecture, which has two tiers of monitoring: ordinary and intensive. The patient's health state improves or worsens in each time period according to certain probabilities, which depend on the monitoring tier. The patient incurs a "loss of quality of life" cost or an "invasiveness" cost, which is higher under intensive monitoring than under ordinary. On the other hand, their health improves faster under intensive monitoring than under ordinary. In each period, the service decides which monitoring tier to use, based on the health of the patient. We investigate the optimal policy for making that choice by formulating the problem using dynamic programming. We first provide analytic conditions for selecting ordinary vs intensive monitoring in the asymptotic regime where the number of health states is large. In the general case, we investigate the optimal policy numerically. We observe a threshold behavior, that is, when the patient's health drops below a certain threshold the service switches them to intensive monitoring, while ordinary monitoring is used during adequately good health states of the patient. The modeling and analysis provides a general framework for managing RPM services for various health conditions with medically/clinically defined system parameters.
comment: Submitted to IEEE Healthcom 2024. 7 pages
☆ Invited: Human-Inspired Distributed Wearable AI
The explosive surge in Human-AI interactions, fused with a soaring fascination in wearable technology, has ignited a frenzy of innovation and the emergence of a myriad of Wearable AI devices, each wielding diverse form factors, tackling tasks from health surveillance to turbocharging productivity. This paper delves into the vision for wearable AI technology, addressing the technical bottlenecks that stand in the way of its promised advancements. Embracing a paradigm shift, we introduce a Human-Inspired Distributed Network for Wearable AI, enabled by high-speed ultra-low-power secure connectivity via the emerging 'Body as a Wire' (Wi-R) technology. This breakthrough acts as the missing link: the artificial nervous system, seamlessly interconnecting all wearables and implantables, ushering in a new era of interconnected intelligence, where featherweight, perpetually operating wearable AI nodes redefine the boundaries of possibility.
comment: 5 pages, 3 figures, DAC 2024
☆ An open-source Autonomous Surface Vehicle for Acoustic Tracking, Bathymetric and Photogrammetric Surveys
Autonomous Surface Vehicles (ASV) are becoming more affordable and include a wide variety of sensors and capacities with applications from ocean physics such as the Saildrone project to ecology with the tracking of marine species in the wild. Here, we present a multi-modal, affordable, open source, and reproducible ASV to track marine animal in shallow waters, collect information on bathymetry, and carry out photogrammetry surveys. The current specification enables scientists to track an animal equipped with an acoustic tag for 5~h and a spatial accuracy of 1~m. For bathymetric or photogrammetry surveys, the ASV can cover 100 x 100~m areas in 2~h with a distance of 1-m between transects. Depending on the sensors included in the ASV, it has a price ranging from \$2,434 to \$11,072. We illustrate these developments with a case study and a field survey for each of the different application proposed.
☆ Data-driven identification of port-Hamiltonian DAE systems by Gaussian processes
Port-Hamiltonian systems (pHS) allow for a structure-preserving modeling of dynamical systems. Coupling pHS via linear relations between input and output defines an overall pHS, which is structure preserving. However, in multiphysics applications, some subsystems do not allow for a physical pHS description, as (a) this is not available or (b) too expensive. Here, data-driven approaches can be used to deliver a pHS for such subsystems, which can then be coupled to the other subsystems in a structure-preserving way. In this work, we derive a data-driven identification approach for port-Hamiltonian differential algebraic equation (DAE) systems. The approach uses input and state space data to estimate nonlinear effort functions of pH-DAEs. As underlying technique, we us (multi-task) Gaussian processes. This work thereby extends over the current state of the art, in which only port-Hamiltonian ordinary differential equation systems could be identified via Gaussian processes. We apply this approach successfully to two applications from network design and constrained multibody system dynamics, based on pH-DAE system of index one and three, respectively.
☆ State-Based Automation for Time-Restricted Eating Adherence
Developing and enforcing study protocols is a foundational component of medical research. As study complexity for participant interactions increases, translating study protocols to supporting application code becomes challenging. A collaboration exists between the University of Kentucky and Arizona State University to determine the efficacy of time-restricted eating in improving metabolic risk among postmenopausal women. This study utilizes a graph-based approach to monitor and support adherence to a designated schedule, enabling the validation and step-wise audit of participants' statuses to derive dependable conclusions. A texting service, driven by a participant graph, automatically manages interactions and collects data. Participant data is then accessible to the research study team via a website, which enables viewing, management, and exportation. This paper presents a system for automatically managing participants in a time-restricted eating study that eliminates time-consuming interactions with participants.
comment: 8 pages, 4 figures, submitted to AMIA 2024 Annual Symposium
☆ Battery Operations in Electricity Markets: Strategic Behavior and Distortions
Electric power systems are undergoing a major transformation as they integrate intermittent renewable energy sources, and batteries to smooth out variations in renewable energy production. As privately-owned batteries grow from their role as marginal "price-takers" to significant players in the market, a natural question arises: How do batteries operate in electricity markets, and how does the strategic behavior of decentralized batteries distort decisions compared to centralized batteries? We propose an analytically tractable model that captures salient features of the highly complex electricity market. We derive in closed form the resulting battery behavior and generation cost in three operating regimes: (i) no battery, (ii) centralized battery, and (ii) decentralized profit-maximizing battery. We establish that a decentralized battery distorts its discharge decisions in three ways. First, there is quantity withholding, i.e., discharging less than centrally optimal. Second, there is a shift in participation from day-ahead to real-time, i.e., postponing some of its discharge from day-ahead to real-time. Third, there is reduction in real-time responsiveness, or discharging less in response to smoothing real-time demand than centrally optimal. We quantify each of the three forms of distortions in terms of market fundamentals. To illustrate our results, we calibrate our model to Los Angeles and Houston and show that the loss from incentive misalignment could be consequential.
☆ Markov Decision Process and Approximate Dynamic Programming for a Patient Assignment Scheduling problem
We study the Patient Assignment Scheduling (PAS) problem in a random environment that arises in the management of patient flow in the hospital systems, due to the stochastic nature of the arrivals as well as the Length of Stay distribution. We develop a Markov Decision Process (MDP) which aims to assign the newly arrived patients in an optimal way so as to minimise the total expected long-run cost per unit time over an infinite horizon. We assume Poisson arrival rates that depend on patient types, and Length of Stay distributions that depend on whether patients stay in their primary wards or not. Since the instances of realistic size of this problem are not easy to solve, we develop numerical methods based on Approximate Dynamic Programming. We illustrate the theory with numerical examples with parameters obtained by fitting to data from a tertiary referral hospital in Australia, and demonstrate the application potential of our methodology under practical considerations.
comment: Submitted to Annals of Operations Research
♻ ☆ Algorithm to find new identifiable reparametrizations of parametric rational ODE models
Structural identifiability concerns the question of which unknown parameters of a model can be recovered from (perfect) input-output data. If all of the parameters of a model can be recovered from data, the model is said to be identifiable. However, in many models, there are parameters that can take on an infinite number of values but yield the same input-output data. In this case, those parameters and the model are called unidentifiable. The question is then what to do with an unidentifiable model. One can either adjust the model, if experimentally feasible, or try to find a reparametrization to make the model identifiable. In this paper, we take the latter approach. While existing approaches to find identifiable reparametrizations were limited to scaling reparametrizations or were not guaranteed to find a globally identifiable reparametrization even if it exists, we significantly broaden the class of models for which we can find a globally identifiable model with the same input-output behavior as the original one. We also prove that, for linear models, a globally identifiable reparametrization always exists and show that, for a certain class of linear compartmental models, an explicit reparametrization formula exists. We illustrate our method on several examples and provide detailed analysis in supplementary material on github.
♻ ☆ RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation
In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
comment: Fixed typos + alpha
♻ ☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
comment: All the supplementary files mentioned in the manuscript will be open-source upon acceptance
♻ ☆ Estimating True Beliefs in Opinion Dynamics with Social Pressure
Social networks often exert social pressure, causing individuals to adapt their expressed opinions to conform to their peers. An agent in such systems can be modeled as having a (true and unchanging) inherent belief while broadcasting a declared opinion at each time step based on her inherent belief and the past declared opinions of her neighbors. An important question in this setting is parameter estimation: how to disentangle the effects of social pressure to estimate inherent beliefs from declared opinions. This is useful for forecasting when agents' declared opinions are influenced by social pressure while real-world behavior only depends on their inherent beliefs. To address this, Jadbabaie et al. formulated the Interacting P\'olya Urn model of opinion dynamics under social pressure and studied it on complete-graph social networks using an aggregate estimator, and found that their estimator converges to the inherent beliefs unless majority pressure pushes the network to consensus. In this work, we studythis model on arbitrary networks, providing an estimator which converges to the inherent beliefs even in consensus situations. Finally, we bound the convergence rate of our estimator in both consensus and non-consensus scenarios; to get the bound for consensus scenarios (which converge slower than non-consensus) we additionally found how quickly the system converges to consensus.
♻ ☆ AeroTraj: Trajectory Planning for Fast, and Accurate 3D Reconstruction Using a Drone-based LiDAR
This paper presents AeroTraj, a system that enables fast, accurate, and automated reconstruction of 3D models of large buildings using a drone-mounted LiDAR. LiDAR point clouds can be used directly to assemble 3D models if their positions are accurately determined. AeroTraj uses SLAM for this, but must ensure complete and accurate reconstruction while minimizing drone battery usage. Doing this requires balancing competing constraints: drone speed, height, and orientation. AeroTraj exploits building geometry in designing an optimal trajectory that incorporates these constraints. Even with an optimal trajectory, SLAM's position error can drift over time, so AeroTraj tracks drift in-flight by offloading computations to the cloud and invokes a re-calibration procedure to minimize error. AeroTraj can reconstruct large structures with centimeter-level accuracy and with an average end-to-end latency below 250 ms, significantly outperforming the state of the art.
Multiagent Systems 8
☆ Differential error feedback for communication-efficient decentralized learning
Communication-constrained algorithms for decentralized learning and optimization rely on local updates coupled with the exchange of compressed signals. In this context, differential quantization is an effective technique to mitigate the negative impact of compression by leveraging correlations between successive iterates. In addition, the use of error feedback, which consists of incorporating the compression error into subsequent steps, is a powerful mechanism to compensate for the bias caused by the compression. Under error feedback, performance guarantees in the literature have so far focused on algorithms employing a fusion center or a special class of contractive compressors that cannot be implemented with a finite number of bits. In this work, we propose a new decentralized communication-efficient learning approach that blends differential quantization with error feedback. The approach is specifically tailored for decentralized learning problems where agents have individual risk functions to minimize subject to subspace constraints that require the minimizers across the network to lie in low-dimensional subspaces. This constrained formulation includes consensus or single-task optimization as special cases, and allows for more general task relatedness models such as multitask smoothness and coupled optimization. We show that, under some general conditions on the compression noise, and for sufficiently small step-sizes $\mu$, the resulting communication-efficient strategy is stable both in terms of mean-square error and average bit rate: by reducing $\mu$, it is possible to keep the estimation errors small (on the order of $\mu$) without increasing indefinitely the bit rate as $\mu\rightarrow 0$. The results establish that, in the small step-size regime and with a finite number of bits, it is possible to attain the performance achievable in the absence of compression.
comment: arXiv admin note: text overlap with arXiv:2209.07821
☆ Building multiscale models with PhysiBoSS, an agent-based modeling tool
Multiscale models provide a unique tool for studying complex processes that study events occurring at different scales across space and time. In the context of biological systems, such models can simulate mechanisms happening at the intracellular level such as signaling, and at the extracellular level where cells communicate and coordinate with other cells. They aim to understand the impact of genetic or environmental deregulation observed in complex diseases, describe the interplay between a pathological tissue and the immune system, and suggest strategies to revert the diseased phenotypes. The construction of these multiscale models remains a very complex task, including the choice of the components to consider, the level of details of the processes to simulate, or the fitting of the parameters to the data. One additional difficulty is the expert knowledge needed to program these models in languages such as C++ or Python, which may discourage the participation of non-experts. Simplifying this process through structured description formalisms -- coupled with a graphical interface -- is crucial in making modeling more accessible to the broader scientific community, as well as streamlining the process for advanced users. This article introduces three examples of multiscale models which rely on the framework PhysiBoSS, an add-on of PhysiCell that includes intracellular descriptions as continuous time Boolean models to the agent-based approach. The article demonstrates how to easily construct such models, relying on PhysiCell Studio, the PhysiCell Graphical User Interface. A step-by-step tutorial is provided as a Supplementary Material and all models are provided at: https://physiboss.github.io/tutorial/.
☆ Emergence of social hierarchies in a society with two competitive classes
Agent-based models describing social interactions among individuals can help to better understand emerging macroscopic patterns in societies. One of the topics which is worth tackling is the formation of different kinds of hierarchies that emerge in social spaces such as cities. Here we propose a Bonabeau-like model by adding a second class of agents. The fundamental particularity of our model is that only a pairwise interaction between agents of the opposite class is allowed. Agent fitness can thus only change by competition among the two classes, while the total fitness in the society remains constant. The main result is that for a broad range of values of the model parameters, the fitness of the agents of each class show a decay in time except for one or very few agents which capture almost all the fitness in the society. Numerical simulations also reveal a singular shift from egalitarian to hierarchical society for each class. This behaviour depends on the control parameter $\eta$, playing the role of the inverse of the temperature of the system. Results are invariant with regard to the system size, contingent solely on the quantity of agents within each class. Finally, a couple of scaling laws are provided thus showing a data collapse from different model parameters and they follow a shape which can be related to the presence of a phase transition in the model.
comment: 19 pages with 15 figures
☆ Intrinsic Action Tendency Consistency for Cooperative Multi-Agent Reinforcement Learning AAAI-2024
Efficient collaboration in the centralized training with decentralized execution (CTDE) paradigm remains a challenge in cooperative multi-agent systems. We identify divergent action tendencies among agents as a significant obstacle to CTDE's training efficiency, requiring a large number of training samples to achieve a unified consensus on agents' policies. This divergence stems from the lack of adequate team consensus-related guidance signals during credit assignments in CTDE. To address this, we propose Intrinsic Action Tendency Consistency, a novel approach for cooperative multi-agent reinforcement learning. It integrates intrinsic rewards, obtained through an action model, into a reward-additive CTDE (RA-CTDE) framework. We formulate an action model that enables surrounding agents to predict the central agent's action tendency. Leveraging these predictions, we compute a cooperative intrinsic reward that encourages agents to match their actions with their neighbors' predictions. We establish the equivalence between RA-CTDE and CTDE through theoretical analyses, demonstrating that CTDE's training process can be achieved using agents' individual targets. Building on this insight, we introduce a novel method to combine intrinsic rewards and CTDE. Extensive experiments on challenging tasks in SMAC and GRF benchmarks showcase the improved performance of our method.
comment: The AAAI-2024 paper with the appendix
♻ ☆ MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
♻ ☆ 360$^\circ$REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System
Large language model agents have demonstrated remarkable advancements across various complex tasks. Recent works focus on optimizing the agent team or employing self-reflection to iteratively solve complex tasks. Since these agents are all based on the same LLM, only conducting self-evaluation or removing underperforming agents does not substantively enhance the capability of the agents. We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. In this paper, we propose Reusable Experience Accumulation with 360$^\circ$ Assessment (360$^\circ$REA), a hierarchical multi-agent framework inspired by corporate organizational practices. The framework employs a novel 360$^\circ$ performance assessment method for multi-perspective performance evaluation with fine-grained assessment. To enhance the capability of agents in addressing complex tasks, we introduce dual-level experience pool for agents to accumulate experience through fine-grained assessment. Extensive experiments on complex task datasets demonstrate the effectiveness of 360$^\circ$REA.
♻ ☆ Nonparametric Strategy Test
We present a nonparametric statistical test for determining whether an agent is following a given mixed strategy in a repeated strategic-form game given samples of the agent's play. This involves two components: determining whether the agent's frequencies of pure strategies are sufficiently close to the target frequencies, and determining whether the pure strategies selected are independent between different game iterations. Our integrated test involves applying a chi-squared goodness of fit test for the first component and a generalized Wald-Wolfowitz runs test for the second component. The results from both tests are combined using Bonferroni correction to produce a complete test for a given significance level $\alpha.$ We applied the test to publicly available data of human rock-paper-scissors play. The data consists of 50 iterations of play for 500 human players. We test with a null hypothesis that the players are following a uniform random strategy independently at each game iteration. Using a significance level of $\alpha = 0.05$, we conclude that 305 (61%) of the subjects are following the target strategy.
♻ ☆ Empirical Analysis of Fictitious Play for Nash Equilibrium Computation in Multiplayer Games
While fictitious play is guaranteed to converge to Nash equilibrium in certain game classes, such as two-player zero-sum games, it is not guaranteed to converge in non-zero-sum and multiplayer games. We show that fictitious play in fact leads to improved Nash equilibrium approximation over a variety of game classes and sizes than (counterfactual) regret minimization, which has recently produced superhuman play for multiplayer poker. We also show that when fictitious play is run several times using random initializations it is able to solve several known challenge problems in which the standard version is known to not converge, including Shapley's classic counterexample. These provide some of the first positive results for fictitious play in these settings, despite the fact that worst-case theoretical results are negative.
comment: Subsumes 2020 article "Fictitious Play Outperforms Counterfactual Regret Minimization" and 2022 article "Random Initialization Solves Shapley's Fictitious Play Counterexample."
Artificial Intelligence 158
☆ Symbolic Learning Enables Self-Evolving Agents
The AI community has been exploring a pathway to artificial general intelligence (AGI) by developing "language agents", which are complex large language models (LLMs) pipelines involving both prompting techniques and tool usage methods. While language agents have demonstrated impressive capabilities for many real-world tasks, a fundamental limitation of current language agents research is that they are model-centric, or engineering-centric. That's to say, the progress on prompts, tools, and pipelines of language agents requires substantial manual engineering efforts from human experts rather than automatically learning from data. We believe the transition from model-centric, or engineering-centric, to data-centric, i.e., the ability of language agents to autonomously learn and evolve in environments, is the key for them to possibly achieve AGI. In this work, we introduce agent symbolic learning, a systematic framework that enables language agents to optimize themselves on their own in a data-centric way using symbolic optimizers. Specifically, we consider agents as symbolic networks where learnable weights are defined by prompts, tools, and the way they are stacked together. Agent symbolic learning is designed to optimize the symbolic network within language agents by mimicking two fundamental algorithms in connectionist learning: back-propagation and gradient descent. Instead of dealing with numeric weights, agent symbolic learning works with natural language simulacrums of weights, loss, and gradients. We conduct proof-of-concept experiments on both standard benchmarks and complex real-world tasks and show that agent symbolic learning enables language agents to update themselves after being created and deployed in the wild, resulting in "self-evolving agents".
comment: Code available at https://github.com/aiwaves-cn/agents
☆ APIGen: Automated Pipeline for Generating Verifiable and Diverse Function-Calling Datasets
The advancement of function-calling agent models requires diverse, reliable, and high-quality datasets. This paper presents APIGen, an automated data generation pipeline designed to synthesize verifiable high-quality datasets for function-calling applications. We leverage APIGen and collect 3,673 executable APIs across 21 different categories to generate diverse function-calling datasets in a scalable and structured manner. Each data in our dataset is verified through three hierarchical stages: format checking, actual function executions, and semantic verification, ensuring its reliability and correctness. We demonstrate that models trained with our curated datasets, even with only 7B parameters, can achieve state-of-the-art performance on the Berkeley Function-Calling Benchmark, outperforming multiple GPT-4 models. Moreover, our 1B model achieves exceptional performance, surpassing GPT-3.5-Turbo and Claude-3 Haiku. We release a dataset containing 60,000 high-quality entries, aiming to advance the field of function-calling agent domains. The dataset is available on Huggingface: https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k and the project homepage: https://apigen-pipeline.github.io/
☆ Mental Modeling of Reinforcement Learning Agents by Language Models
Can emergent language models faithfully model the intelligence of decision-making agents? Though modern language models exhibit already some reasoning ability, and theoretically can potentially express any probable distribution over tokens, it remains underexplored how the world knowledge these pretrained models have memorized can be utilized to comprehend an agent's behaviour in the physical world. This study empirically examines, for the first time, how well large language models (LLMs) can build a mental model of agents, termed agent mental modelling, by reasoning about an agent's behaviour and its effect on states from agent interaction history. This research may unveil the potential of leveraging LLMs for elucidating RL agent behaviour, addressing a key challenge in eXplainable reinforcement learning (XRL). To this end, we propose specific evaluation metrics and test them on selected RL task datasets of varying complexity, reporting findings on agent mental model establishment. Our results disclose that LLMs are not yet capable of fully mental modelling agents through inference alone without further innovations. This work thus provides new insights into the capabilities and limitations of modern LLMs.
comment: https://lukaswill.github.io/
☆ Role-Play Zero-Shot Prompting with Large Language Models for Open-Domain Human-Machine Conversation SIGDIAL 2023
Recently, various methods have been proposed to create open-domain conversational agents with Large Language Models (LLMs). These models are able to answer user queries, but in a one-way Q&A format rather than a true conversation. Fine-tuning on particular datasets is the usual way to modify their style to increase conversational ability, but this is expensive and usually only available in a few languages. In this study, we explore role-play zero-shot prompting as an efficient and cost-effective solution for open-domain conversation, using capable multilingual LLMs (Beeching et al., 2023) trained to obey instructions. We design a prompting system that, when combined with an instruction-following model - here Vicuna (Chiang et al., 2023) - produces conversational agents that match and even surpass fine-tuned models in human evaluation in French in two different tasks.
comment: Updated version of a paper originally submitted at SIGDIAL 2023
☆ Detecting Brittle Decisions for Free: Leveraging Margin Consistency in Deep Robust Classifiers
Despite extensive research on adversarial training strategies to improve robustness, the decisions of even the most robust deep learning models can still be quite sensitive to imperceptible perturbations, creating serious risks when deploying them for high-stakes real-world applications. While detecting such cases may be critical, evaluating a model's vulnerability at a per-instance level using adversarial attacks is computationally too intensive and unsuitable for real-time deployment scenarios. The input space margin is the exact score to detect non-robust samples and is intractable for deep neural networks. This paper introduces the concept of margin consistency -- a property that links the input space margins and the logit margins in robust models -- for efficient detection of vulnerable samples. First, we establish that margin consistency is a necessary and sufficient condition to use a model's logit margin as a score for identifying non-robust samples. Next, through comprehensive empirical analysis of various robustly trained models on CIFAR10 and CIFAR100 datasets, we show that they indicate strong margin consistency with a strong correlation between their input space margins and the logit margins. Then, we show that we can effectively use the logit margin to confidently detect brittle decisions with such models and accurately estimate robust accuracy on an arbitrarily large test set by estimating the input margins only on a small subset. Finally, we address cases where the model is not sufficiently margin-consistent by learning a pseudo-margin from the feature representation. Our findings highlight the potential of leveraging deep representations to efficiently assess adversarial vulnerability in deployment scenarios.
comment: 11 pages, 7 figures, 2 tables, 1 algorithm
☆ Preference Elicitation for Offline Reinforcement Learning
Applying reinforcement learning (RL) to real-world problems is often made challenging by the inability to interact with the environment and the difficulty of designing reward functions. Offline RL addresses the first challenge by considering access to an offline dataset of environment interactions labeled by the reward function. In contrast, Preference-based RL does not assume access to the reward function and learns it from preferences, but typically requires an online interaction with the environment. We bridge the gap between these frameworks by exploring efficient methods for acquiring preference feedback in a fully offline setup. We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm, which leverages a learned environment model to elicit preference feedback on simulated rollouts. Drawing on insights from both the offline RL and the preference-based RL literature, our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy. We provide theoretical guarantees regarding the sample complexity of our approach, dependent on how well the offline data covers the optimal policy. Finally, we demonstrate the empirical performance of Sim-OPRL in different environments.
☆ Cascading Large Language Models for Salient Event Graph Generation
Generating event graphs from long documents is challenging due to the inherent complexity of multiple tasks involved such as detecting events, identifying their relationships, and reconciling unstructured input with structured graphs. Recent studies typically consider all events with equal importance, failing to distinguish salient events crucial for understanding narratives. This paper presents CALLMSAE, a CAscading Large Language Model framework for SAlient Event graph generation, which leverages the capabilities of LLMs and eliminates the need for costly human annotations. We first identify salient events by prompting LLMs to generate summaries, from which salient events are identified. Next, we develop an iterative code refinement prompting strategy to generate event relation graphs, removing hallucinated relations and recovering missing edges. Fine-tuning contextualised graph generation models on the LLM-generated graphs outperforms the models trained on CAEVO-generated data. Experimental results on a human-annotated test set show that the proposed method generates salient and more accurate graphs, outperforming competitive baselines.
comment: 9 + 12 pages
☆ Graph Neural Networks for Emulation of Finite-Element Ice Dynamics in Greenland and Antarctic Ice Sheets ICML 2024
Although numerical models provide accurate solutions for ice sheet dynamics based on physics laws, they accompany intensified computational demands to solve partial differential equations. In recent years, convolutional neural networks (CNNs) have been widely used as statistical emulators for those numerical models. However, since CNNs operate on regular grids, they cannot represent the refined meshes and computational efficiency of finite-element numerical models. Therefore, instead of CNNs, this study adopts an equivariant graph convolutional network (EGCN) as an emulator for the ice sheet dynamics modeling. EGCN reproduces ice thickness and velocity changes in the Helheim Glacier, Greenland, and Pine Island Glacier, Antarctica, with 260 times and 44 times faster computation time, respectively. Compared to the traditional CNN and graph convolutional network, EGCN shows outstanding accuracy in thickness prediction near fast ice streams by preserving the equivariance to the translation and rotation of graphs.
comment: 6 pages, 2 figures, submitted to the ICML 2024 Workshop on Machine Learning for Earth System Modeling
☆ Mixture of Experts in a Mixture of RL settings
Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.
☆ BiTrack: Bidirectional Offline 3D Multi-Object Tracking Using Camera-LiDAR Data
Compared with real-time multi-object tracking (MOT), offline multi-object tracking (OMOT) has the advantages to perform 2D-3D detection fusion, erroneous link correction, and full track optimization but has to deal with the challenges from bounding box misalignment and track evaluation, editing, and refinement. This paper proposes "BiTrack", a 3D OMOT framework that includes modules of 2D-3D detection fusion, initial trajectory generation, and bidirectional trajectory re-optimization to achieve optimal tracking results from camera-LiDAR data. The novelty of this paper includes threefold: (1) development of a point-level object registration technique that employs a density-based similarity metric to achieve accurate fusion of 2D-3D detection results; (2) development of a set of data association and track management skills that utilizes a vertex-based similarity metric as well as false alarm rejection and track recovery mechanisms to generate reliable bidirectional object trajectories; (3) development of a trajectory re-optimization scheme that re-organizes track fragments of different fidelities in a greedy fashion, as well as refines each trajectory with completion and smoothing techniques. The experiment results on the KITTI dataset demonstrate that BiTrack achieves the state-of-the-art performance for 3D OMOT tasks in terms of accuracy and efficiency.
☆ IRCAN: Mitigating Knowledge Conflicts in LLM Generation via Identifying and Reweighting Context-Aware Neurons
It is widely acknowledged that large language models (LLMs) encode a vast reservoir of knowledge after being trained on mass data. Recent studies disclose knowledge conflicts in LLM generation, wherein outdated or incorrect parametric knowledge (i.e., encoded knowledge) contradicts new knowledge provided in the context. To mitigate such knowledge conflicts, we propose a novel framework, IRCAN (Identifying and Reweighting Context-Aware Neurons) to capitalize on neurons that are crucial in processing contextual cues. Specifically, IRCAN first identifies neurons that significantly contribute to context processing, utilizing a context-aware attribution score derived from integrated gradients. Subsequently, the identified context-aware neurons are strengthened via reweighting. In doing so, we steer LLMs to generate context-sensitive outputs with respect to the new knowledge provided in the context. Extensive experiments conducted across a variety of models and tasks demonstrate that IRCAN not only achieves remarkable improvements in handling knowledge conflicts but also offers a scalable, plug-andplay solution that can be integrated seamlessly with existing models.
comment: 19 pages, 13 figures, 5 tables
☆ AlphaForge: A Framework to Mine and Dynamically Combine Formulaic Alpha Factors
The variability and low signal-to-noise ratio in financial data, combined with the necessity for interpretability, make the alpha factor mining workflow a crucial component of quantitative investment. Transitioning from early manual extraction to genetic programming, the most advanced approach in this domain currently employs reinforcement learning to mine a set of combination factors with fixed weights. However, the performance of resultant alpha factors exhibits inconsistency, and the inflexibility of fixed factor weights proves insufficient in adapting to the dynamic nature of financial markets. To address this issue, this paper proposes a two-stage formulaic alpha generating framework AlphaForge, for alpha factor mining and factor combination. This framework employs a generative-predictive neural network to generate factors, leveraging the robust spatial exploration capabilities inherent in deep learning while concurrently preserving diversity. The combination model within the framework incorporates the temporal performance of factors for selection and dynamically adjusts the weights assigned to each component alpha factor. Experiments conducted on real-world datasets demonstrate that our proposed model outperforms contemporary benchmarks in formulaic alpha factor mining. Furthermore, our model exhibits a notable enhancement in portfolio returns within the realm of quantitative investment.
☆ SAM: Semi-Active Mechanism for Extensible Continuum Manipulator and Real-time Hysteresis Compensation Control Algorithm
Cable-Driven Continuum Manipulators (CDCMs) enable scar-free procedures via natural orifices and improve target lesion accessibility through curved paths. However, CDCMs face limitations in workspace and control accuracy due to non-linear cable effects causing hysteresis. This paper introduces an extensible CDCM with a Semi-active Mechanism (SAM) to expand the workspace via translational motion without additional mechanical elements or actuation. We collect a hysteresis dataset using 8 fiducial markers and RGBD sensing. Based on this dataset, we develop a real-time hysteresis compensation control algorithm using the trained Temporal Convolutional Network (TCN) with a 1ms time latency, effectively estimating the manipulator's hysteresis behavior. Performance validation through random trajectory tracking tests and box pointing tasks shows the proposed controller significantly reduces hysteresis by up to 69.5% in joint space and approximately 26% in the box pointing task.
comment: 12 pages, 14 figures, 6 tables
☆ MALSIGHT: Exploring Malicious Source Code and Benign Pseudocode for Iterative Binary Malware Summarization
Binary malware summarization aims to automatically generate human-readable descriptions of malware behaviors from executable files, facilitating tasks like malware cracking and detection. Previous methods based on Large Language Models (LLMs) have shown great promise. However, they still face significant issues, including poor usability, inaccurate explanations, and incomplete summaries, primarily due to the obscure pseudocode structure and the lack of malware training summaries. Further, calling relationships between functions, which involve the rich interactions within a binary malware, remain largely underexplored. To this end, we propose MALSIGHT, a novel code summarization framework that can iteratively generate descriptions of binary malware by exploring malicious source code and benign pseudocode. Specifically, we construct the first malware summaries, MalS and MalP, using an LLM and manually refine this dataset with human effort. At the training stage, we tune our proposed MalT5, a novel LLM-based code model, on the MalS dataset and a benign pseudocode dataset. Then, at the test stage, we iteratively feed the pseudocode functions into MalT5 to obtain the summary. Such a procedure facilitates the understanding of pseudocode structure and captures the intricate interactions between functions, thereby benefiting the usability, accuracy, and completeness of summaries. Additionally, we propose a novel evaluation benchmark, BLEURT-sum, to measure the quality of summaries. Experiments on three datasets show the effectiveness of the proposed MALSIGHT. Notably, our proposed MalT5, with only 0.77B parameters, delivers comparable performance to much larger ChatGPT3.5.
comment: 17 pages, 14 figures
☆ Learning pure quantum states (almost) without regret
We initiate the study of quantum state tomography with minimal regret. A learner has sequential oracle access to an unknown pure quantum state, and in each round selects a pure probe state. Regret is incurred if the unknown state is measured orthogonal to this probe, and the learner's goal is to minimise the expected cumulative regret over $T$ rounds. The challenge is to find a balance between the most informative measurements and measurements incurring minimal regret. We show that the cumulative regret scales as $\Theta(\operatorname{polylog} T)$ using a new tomography algorithm based on a median of means least squares estimator. This algorithm employs measurements biased towards the unknown state and produces online estimates that are optimal (up to logarithmic terms) in the number of observed samples.
comment: 24 pages, 2 figures
☆ Research on Information Extraction of LCSTS Dataset Based on an Improved BERTSum-LSTM Model
With the continuous advancement of artificial intelligence, natural language processing technology has become widely utilized in various fields. At the same time, there are many challenges in creating Chinese news summaries. First of all, the semantics of Chinese news is complex, and the amount of information is enormous. Extracting critical information from Chinese news presents a significant challenge. Second, the news summary should be concise and clear, focusing on the main content and avoiding redundancy. In addition, the particularity of the Chinese language, such as polysemy, word segmentation, etc., makes it challenging to generate Chinese news summaries. Based on the above, this paper studies the information extraction method of the LCSTS dataset based on an improved BERTSum-LSTM model. We improve the BERTSum-LSTM model to make it perform better in generating Chinese news summaries. The experimental results show that the proposed method has a good effect on creating news summaries, which is of great importance to the construction of news summaries.
comment: submitted to ICMIII 2024
☆ Stable Diffusion Segmentation for Biomedical Images with Single-step Reverse Process MICCAI 2024
Diffusion models have demonstrated their effectiveness across various generative tasks. However, when applied to medical image segmentation, these models encounter several challenges, including significant resource and time requirements. They also necessitate a multi-step reverse process and multiple samples to produce reliable predictions. To address these challenges, we introduce the first latent diffusion segmentation model, named SDSeg, built upon stable diffusion (SD). SDSeg incorporates a straightforward latent estimation strategy to facilitate a single-step reverse process and utilizes latent fusion concatenation to remove the necessity for multiple samples. Extensive experiments indicate that SDSeg surpasses existing state-of-the-art methods on five benchmark datasets featuring diverse imaging modalities. Remarkably, SDSeg is capable of generating stable predictions with a solitary reverse step and sample, epitomizing the model's stability as implied by its name. The code is available at https://github.com/lin-tianyu/Stable-Diffusion-Seg
comment: Accepted at MICCAI 2024. Code and citation info see https://github.com/lin-tianyu/Stable-Diffusion-Seg
☆ Kolmogorov-Arnold Graph Neural Networks
Graph neural networks (GNNs) excel in learning from network-like data but often lack interpretability, making their application challenging in domains requiring transparent decision-making. We propose the Graph Kolmogorov-Arnold Network (GKAN), a novel GNN model leveraging spline-based activation functions on edges to enhance both accuracy and interpretability. Our experiments on five benchmark datasets demonstrate that GKAN outperforms state-of-the-art GNN models in node classification, link prediction, and graph classification tasks. In addition to the improved accuracy, GKAN's design inherently provides clear insights into the model's decision-making process, eliminating the need for post-hoc explainability techniques. This paper discusses the methodology, performance, and interpretability of GKAN, highlighting its potential for applications in domains where interpretability is crucial.
comment: 7 pages, 4 figures, under review
☆ Reinforcement Learning with Intrinsically Motivated Feedback Graph for Lost-sales Inventory Control
Reinforcement learning (RL) has proven to be well-performed and general-purpose in the inventory control (IC). However, further improvement of RL algorithms in the IC domain is impeded due to two limitations of online experience. First, online experience is expensive to acquire in real-world applications. With the low sample efficiency nature of RL algorithms, it would take extensive time to train the RL policy to convergence. Second, online experience may not reflect the true demand due to the lost sales phenomenon typical in IC, which makes the learning process more challenging. To address the above challenges, we propose a decision framework that combines reinforcement learning with feedback graph (RLFG) and intrinsically motivated exploration (IME) to boost sample efficiency. In particular, we first take advantage of the inherent properties of lost-sales IC problems and design the feedback graph (FG) specially for lost-sales IC problems to generate abundant side experiences aid RL updates. Then we conduct a rigorous theoretical analysis of how the designed FG reduces the sample complexity of RL methods. Based on the theoretical insights, we design an intrinsic reward to direct the RL agent to explore to the state-action space with more side experiences, further exploiting FG's power. Experimental results demonstrate that our method greatly improves the sample efficiency of applying RL in IC. Our code is available at https://anonymous.4open.science/r/RLIMFG4IC-811D/
☆ AI Alignment through Reinforcement Learning from Human Feedback? Contradictions and Limitations
This paper critically evaluates the attempts to align Artificial Intelligence (AI) systems, especially Large Language Models (LLMs), with human values and intentions through Reinforcement Learning from Feedback (RLxF) methods, involving either human feedback (RLHF) or AI feedback (RLAIF). Specifically, we show the shortcomings of the broadly pursued alignment goals of honesty, harmlessness, and helpfulness. Through a multidisciplinary sociotechnical critique, we examine both the theoretical underpinnings and practical implementations of RLxF techniques, revealing significant limitations in their approach to capturing the complexities of human ethics and contributing to AI safety. We highlight tensions and contradictions inherent in the goals of RLxF. In addition, we discuss ethically-relevant issues that tend to be neglected in discussions about alignment and RLxF, among which the trade-offs between user-friendliness and deception, flexibility and interpretability, and system safety. We conclude by urging researchers and practitioners alike to critically assess the sociotechnical ramifications of RLxF, advocating for a more nuanced and reflective approach to its application in AI development.
comment: 12 pages, 1 table, to be submitted
☆ Continuous Sign Language Recognition Using Intra-inter Gloss Attention
Many continuous sign language recognition (CSLR) studies adopt transformer-based architectures for sequence modeling due to their powerful capacity for capturing global contexts. Nevertheless, vanilla self-attention, which serves as the core module of the transformer, calculates a weighted average over all time steps; therefore, the local temporal semantics of sign videos may not be fully exploited. In this study, we introduce a novel module in sign language recognition studies, called intra-inter gloss attention module, to leverage the relationships among frames within glosses and the semantic and grammatical dependencies between glosses in the video. In the intra-gloss attention module, the video is divided into equally sized chunks and a self-attention mechanism is applied within each chunk. This localized self-attention significantly reduces complexity and eliminates noise introduced by considering non-relative frames. In the inter-gloss attention module, we first aggregate the chunk-level features within each gloss chunk by average pooling along the temporal dimension. Subsequently, multi-head self-attention is applied to all chunk-level features. Given the non-significance of the signer-environment interaction, we utilize segmentation to remove the background of the videos. This enables the proposed model to direct its focus toward the signer. Experimental results on the PHOENIX-2014 benchmark dataset demonstrate that our method can effectively extract sign language features in an end-to-end manner without any prior knowledge, improve the accuracy of CSLR, and achieve the word error rate (WER) of 20.4 on the test set which is a competitive result compare to the state-of-the-art which uses additional supervisions.
☆ PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. To address this benchmark contamination problem, we first propose a set of requirements that practical contamination detection methods should follow. Following these proposed requirements, we introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs. Our method constructs a counterpart for each piece of data with the same distribution, and performs statistical analysis of the corresponding confidence to test whether the model is significantly more confident under the original benchmark. We validate the effectiveness of PaCoST and apply it on popular open-source models and benchmarks. We find that almost all models and benchmarks we tested are suspected contaminated more or less. We finally call for new LLM evaluation methods.
☆ MathOdyssey: Benchmarking Mathematical Problem-Solving Skills in Large Language Models Using Odyssey Math Data
Large language models (LLMs) have significantly advanced natural language understanding and demonstrated strong problem-solving abilities. Despite these successes, most LLMs still struggle with solving mathematical problems due to the intricate reasoning required. This paper investigates the mathematical problem-solving capabilities of LLMs using the newly developed "MathOdyssey" dataset. The dataset includes diverse mathematical problems at high school and university levels, created by experts from notable institutions to rigorously test LLMs in advanced problem-solving scenarios and cover a wider range of subject areas. By providing the MathOdyssey dataset as a resource to the AI community, we aim to contribute to the understanding and improvement of AI capabilities in complex mathematical problem-solving. We conduct benchmarking on open-source models, such as Llama-3 and DBRX-Instruct, and closed-source models from the GPT series and Gemini models. Our results indicate that while LLMs perform well on routine and moderately difficult tasks, they face significant challenges with Olympiad-level problems and complex university-level questions. Our analysis shows a narrowing performance gap between open-source and closed-source models, yet substantial challenges remain, particularly with the most demanding problems. This study highlights the ongoing need for research to enhance the mathematical reasoning of LLMs. The dataset, results, and code are publicly available.
☆ AI-native Memory: A Pathway from LLMs Towards AGI
Large language models (LLMs) have demonstrated the world with the sparks of artificial general intelligence (AGI). One opinion, especially from some startups working on LLMs, argues that an LLM with nearly unlimited context length can realize AGI. However, they might be too optimistic about the long-context capability of (existing) LLMs -- (1) Recent literature has shown that their effective context length is significantly smaller than their claimed context length; and (2) Our reasoning-in-a-haystack experiments further demonstrate that simultaneously finding the relevant information from a long context and conducting (simple) reasoning is nearly impossible. In this paper, we envision a pathway from LLMs to AGI through the integration of \emph{memory}. We believe that AGI should be a system where LLMs serve as core processors. In addition to raw data, the memory in this system would store a large number of important conclusions derived from reasoning processes. Compared with retrieval-augmented generation (RAG) that merely processing raw data, this approach not only connects semantically related information closer, but also simplifies complex inferences at the time of querying. As an intermediate stage, the memory will likely be in the form of natural language descriptions, which can be directly consumed by users too. Ultimately, every agent/person should have its own large personal model, a deep neural network model (thus \emph{AI-native}) that parameterizes and compresses all types of memory, even the ones cannot be described by natural languages. Finally, we discuss the significant potential of AI-native memory as the transformative infrastructure for (proactive) engagement, personalization, distribution, and social in the AGI era, as well as the incurred privacy and security challenges with preliminary solutions.
☆ S3: A Simple Strong Sample-effective Multimodal Dialog System
In this work, we present a conceptually simple yet powerful baseline for the multimodal dialog task, an S3 model, that achieves near state-of-the-art results on two compelling leaderboards: MMMU and AI Journey Contest 2023. The system is based on a pre-trained large language model, pre-trained modality encoders for image and audio, and a trainable modality projector. The proposed effective data mixture for training such an architecture demonstrates that a multimodal model based on a strong language model and trained on a small amount of multimodal data can perform efficiently in the task of multimodal dialog.
☆ Combining Automated Optimisation of Hyperparameters and Reward Shape
There has been significant progress in deep reinforcement learning (RL) in recent years. Nevertheless, finding suitable hyperparameter configurations and reward functions remains challenging even for experts, and performance heavily relies on these design choices. Also, most RL research is conducted on known benchmarks where knowledge about these choices already exists. However, novel practical applications often pose complex tasks for which no prior knowledge about good hyperparameters and reward functions is available, thus necessitating their derivation from scratch. Prior work has examined automatically tuning either hyperparameters or reward functions individually. We demonstrate empirically that an RL algorithm's hyperparameter configurations and reward function are often mutually dependent, meaning neither can be fully optimised without appropriate values for the other. We then propose a methodology for the combined optimisation of hyperparameters and the reward function. Furthermore, we include a variance penalty as an optimisation objective to improve the stability of learned policies. We conducted extensive experiments using Proximal Policy Optimisation and Soft Actor-Critic on four environments. Our results show that combined optimisation significantly improves over baseline performance in half of the environments and achieves competitive performance in the others, with only a minor increase in computational costs. This suggests that combined optimisation should be best practice.
comment: Published in the Reinforcement Learning Journal 2024
☆ Detecting Machine-Generated Texts: Not Just "AI vs Humans" and Explainability is Complicated
As LLMs rapidly advance, increasing concerns arise regarding risks about actual authorship of texts we see online and in real world. The task of distinguishing LLM-authored texts is complicated by the nuanced and overlapping behaviors of both machines and humans. In this paper, we challenge the current practice of considering LLM-generated text detection a binary classification task of differentiating human from AI. Instead, we introduce a novel ternary text classification scheme, adding an "undecided" category for texts that could be attributed to either source, and we show that this new category is crucial to understand how to make the detection result more explainable to lay users. This research shifts the paradigm from merely classifying to explaining machine-generated texts, emphasizing need for detectors to provide clear and understandable explanations to users. Our study involves creating four new datasets comprised of texts from various LLMs and human authors. Based on new datasets, we performed binary classification tests to ascertain the most effective SOTA detection methods and identified SOTA LLMs capable of producing harder-to-detect texts. We constructed a new dataset of texts generated by two top-performing LLMs and human authors, and asked three human annotators to produce ternary labels with explanation notes. This dataset was used to investigate how three top-performing SOTA detectors behave in new ternary classification context. Our results highlight why "undecided" category is much needed from the viewpoint of explainability. Additionally, we conducted an analysis of explainability of the three best-performing detectors and the explanation notes of the human annotators, revealing insights about the complexity of explainable detection of machine-generated texts. Finally, we propose guidelines for developing future detection systems with improved explanatory power.
comment: 19 pages, 2 figures
☆ Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning KDD 2024
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.
comment: Accepted by KDD 2024 Research Track
☆ Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets
Filtering and annotating textual data are routine tasks in many areas, like social media or news analytics. Automating these tasks allows to scale the analyses wrt. speed and breadth of content covered and decreases the manual effort required. Due to technical advancements in Natural Language Processing, specifically the success of large foundation models, a new tool for automating such annotation processes by using a text-to-text interface given written guidelines without providing training samples has become available. In this work, we assess these advancements in-the-wild by empirically testing them in an annotation task on German Twitter data about social and political European crises. We compare the prompt-based results with our human annotation and preceding classification approaches, including Naive Bayes and a BERT-based fine-tuning/domain adaptation pipeline. Our results show that the prompt-based approach - despite being limited by local computation resources during the model selection - is comparable with the fine-tuned BERT but without any annotated training data. Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
comment: 10 pages, 2 tables, 1 figure
☆ PlaMo: Plan and Move in Rich 3D Physical Environments
Controlling humanoids in complex physically simulated worlds is a long-standing challenge with numerous applications in gaming, simulation, and visual content creation. In our setup, given a rich and complex 3D scene, the user provides a list of instructions composed of target locations and locomotion types. To solve this task we present PlaMo, a scene-aware path planner and a robust physics-based controller. The path planner produces a sequence of motion paths, considering the various limitations the scene imposes on the motion, such as location, height, and speed. Complementing the planner, our control policy generates rich and realistic physical motion adhering to the plan. We demonstrate how the combination of both modules enables traversing complex landscapes in diverse forms while responding to real-time changes in the environment. Video: https://youtu.be/wWlqSQlRZ9M .
☆ Enhancing Data Privacy in Large Language Models through Private Association Editing
Large Language Models (LLMs) are powerful tools with extensive applications, but their tendency to memorize private information raises significant concerns as private data leakage can easily happen. In this paper, we introduce Private Association Editing (PAE), a novel defense approach for private data leakage. PAE is designed to effectively remove Personally Identifiable Information (PII) without retraining the model. Our approach consists of a four-step procedure: detecting memorized PII, applying PAE cards to mitigate memorization of private data, verifying resilience to targeted data extraction (TDE) attacks, and ensuring consistency in the post-edit LLMs. The versatility and efficiency of PAE, which allows for batch modifications, significantly enhance data privacy in LLMs. Experimental results demonstrate the effectiveness of PAE in mitigating private data leakage. We believe PAE will serve as a critical tool in the ongoing effort to protect data privacy in LLMs, encouraging the development of safer models for real-world applications.
☆ Guiding Video Prediction with Explicit Procedural Knowledge ICCV
We propose a general way to integrate procedural knowledge of a domain into deep learning models. We apply it to the case of video prediction, building on top of object-centric deep models and show that this leads to a better performance than using data-driven models alone. We develop an architecture that facilitates latent space disentanglement in order to use the integrated procedural knowledge, and establish a setup that allows the model to learn the procedural interface in the latent space using the downstream task of video prediction. We contrast the performance to a state-of-the-art data-driven approach and show that problems where purely data-driven approaches struggle can be handled by using knowledge about the domain, providing an alternative to simply collecting more data.
comment: Published in 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW)
☆ AI Cards: Towards an Applied Framework for Machine-Readable AI and Risk Documentation Inspired by the EU AI Act
With the upcoming enforcement of the EU AI Act, documentation of high-risk AI systems and their risk management information will become a legal requirement playing a pivotal role in demonstration of compliance. Despite its importance, there is a lack of standards and guidelines to assist with drawing up AI and risk documentation aligned with the AI Act. This paper aims to address this gap by providing an in-depth analysis of the AI Act's provisions regarding technical documentation, wherein we particularly focus on AI risk management. On the basis of this analysis, we propose AI Cards as a novel holistic framework for representing a given intended use of an AI system by encompassing information regarding technical specifications, context of use, and risk management, both in human- and machine-readable formats. While the human-readable representation of AI Cards provides AI stakeholders with a transparent and comprehensible overview of the AI use case, its machine-readable specification leverages on state of the art Semantic Web technologies to embody the interoperability needed for exchanging documentation within the AI value chain. This brings the flexibility required for reflecting changes applied to the AI system and its context, provides the scalability needed to accommodate potential amendments to legal requirements, and enables development of automated tools to assist with legal compliance and conformity assessment tasks. To solidify the benefits, we provide an exemplar AI Card for an AI-based student proctoring system and further discuss its potential applications within and beyond the context of the AI Act.
☆ MammothModa: Multi-Modal Large Language Model
In this report, we introduce MammothModa, yet another multi-modal large language model (MLLM) designed to achieve state-of-the-art performance starting from an elementary baseline. We focus on three key design insights: (i) Integrating Visual Capabilities while Maintaining Complex Language Understanding: In addition to the vision encoder, we incorporated the Visual Attention Experts into the LLM to enhance its visual capabilities. (ii) Extending Context Window for High-Resolution and Long-Duration Visual Feature: We explore the Visual Merger Module to effectively reduce the token number of high-resolution images and incorporated frame position ids to avoid position interpolation. (iii) High-Quality Bilingual Datasets: We meticulously curated and filtered a high-quality bilingual multimodal dataset to reduce visual hallucinations. With above recipe we build MammothModa that consistently outperforms the state-of-the-art models, e.g., LLaVA-series, across main real-world visual language benchmarks without bells and whistles.
comment: Technical report
☆ Methodology of Adapting Large English Language Models for Specific Cultural Contexts
The rapid growth of large language models(LLMs) has emerged as a prominent trend in the field of artificial intelligence. However, current state-of-the-art LLMs are predominantly based on English. They encounter limitations when directly applied to tasks in specific cultural domains, due to deficiencies in domain-specific knowledge and misunderstandings caused by differences in cultural values. To address this challenge, our paper proposes a rapid adaptation method for large models in specific cultural contexts, which leverages instruction-tuning based on specific cultural knowledge and safety values data. Taking Chinese as the specific cultural context and utilizing the LLaMA3-8B as the experimental English LLM, the evaluation results demonstrate that the adapted LLM significantly enhances its capabilities in domain-specific knowledge and adaptability to safety values, while maintaining its original expertise advantages.
comment: 11 pages, 2 figures
☆ Selective Prompting Tuning for Personalized Conversations with LLMs ACL 2024
In conversational AI, personalizing dialogues with persona profiles and contextual understanding is essential. Despite large language models' (LLMs) improved response coherence, effective persona integration remains a challenge. In this work, we first study two common approaches for personalizing LLMs: textual prompting and direct fine-tuning. We observed that textual prompting often struggles to yield responses that are similar to the ground truths in datasets, while direct fine-tuning tends to produce repetitive or overly generic replies. To alleviate those issues, we propose \textbf{S}elective \textbf{P}rompt \textbf{T}uning (SPT), which softly prompts LLMs for personalized conversations in a selective way. Concretely, SPT initializes a set of soft prompts and uses a trainable dense retriever to adaptively select suitable soft prompts for LLMs according to different input contexts, where the prompt retriever is dynamically updated through feedback from the LLMs. Additionally, we propose context-prompt contrastive learning and prompt fusion learning to encourage the SPT to enhance the diversity of personalized conversations. Experiments on the CONVAI2 dataset demonstrate that SPT significantly enhances response diversity by up to 90\%, along with improvements in other critical performance indicators. Those results highlight the efficacy of SPT in fostering engaging and personalized dialogue generation. The SPT model code (https://github.com/hqsiswiliam/SPT) is publicly available for further exploration.
comment: Accepted to ACL 2024 findings
☆ Games of Knightian Uncertainty
Arguably, for the latter part of the late 20th and early 21st centuries, games have been seen as the drosophila of AI. Games are a set of exciting testbeds, whose solutions (in terms of identifying optimal players) would lead to machines that would possess some form of general intelligence, or at the very least help us gain insights toward building intelligent machines. Following impressive successes in traditional board games like Go, Chess, and Poker, but also video games like the Atari 2600 collection, it is clear that this is not the case. Games have been attacked successfully, but we are nowhere near AGI developments (or, as harsher critics might say, useful AI developments!). In this short vision paper, we argue that for game research to become again relevant to the AGI pathway, we need to be able to address \textit{Knightian uncertainty} in the context of games, i.e. agents need to be able to adapt to rapid changes in game rules on the fly with no warning, no previous data, and no model access.
☆ Galaxy spectroscopy without spectra: Galaxy properties from photometric images with conditional diffusion models
Modern spectroscopic surveys can only target a small fraction of the vast amount of photometrically cataloged sources in wide-field surveys. Here, we report the development of a generative AI method capable of predicting optical galaxy spectra from photometric broad-band images alone. This method draws from the latest advances in diffusion models in combination with contrastive networks. We pass multi-band galaxy images into the architecture to obtain optical spectra. From these, robust values for galaxy properties can be derived with any methods in the spectroscopic toolbox, such as standard population synthesis techniques and Lick indices. When trained and tested on 64x64-pixel images from the Sloan Digital Sky Survey, the global bimodality of star-forming and quiescent galaxies in photometric space is recovered, as well as a mass-metallicity relation of star-forming galaxies. The comparison between the observed and the artificially created spectra shows good agreement in overall metallicity, age, Dn4000, stellar velocity dispersion, and E(B-V) values. Photometric redshift estimates of our generative algorithm can compete with other current, specialized deep-learning techniques. Moreover, this work is the first attempt in the literature to infer velocity dispersion from photometric images. Additionally, we can predict the presence of an active galactic nucleus up to an accuracy of 82%. With our method, scientifically interesting galaxy properties, normally requiring spectroscopic inputs, can be obtained in future data sets from large-scale photometric surveys alone. The spectra prediction via AI can further assist in creating realistic mock catalogs.
comment: Code is available here: https://github.com/LarsDoorenbos/generate-spectra
☆ Start from Zero: Triple Set Prediction for Automatic Knowledge Graph Completion
Knowledge graph (KG) completion aims to find out missing triples in a KG. Some tasks, such as link prediction and instance completion, have been proposed for KG completion. They are triple-level tasks with some elements in a missing triple given to predict the missing element of the triple. However, knowing some elements of the missing triple in advance is not always a realistic setting. In this paper, we propose a novel graph-level automatic KG completion task called Triple Set Prediction (TSP) which assumes none of the elements in the missing triples is given. TSP is to predict a set of missing triples given a set of known triples. To properly and accurately evaluate this new task, we propose 4 evaluation metrics including 3 classification metrics and 1 ranking metric, considering both the partial-open-world and the closed-world assumptions. Furthermore, to tackle the huge candidate triples for prediction, we propose a novel and efficient subgraph-based method GPHT that can predict the triple set fast. To fairly compare the TSP results, we also propose two types of methods RuleTensor-TSP and KGE-TSP applying the existing rule- and embedding-based methods for TSP as baselines. During experiments, we evaluate the proposed methods on two datasets extracted from Wikidata following the relation-similarity partial-open-world assumption proposed by us, and also create a complete family data set to evaluate TSP results following the closed-world assumption. Results prove that the methods can successfully generate a set of missing triples and achieve reasonable scores on the new task, and GPHT performs better than the baselines with significantly shorter prediction time. The datasets and code for experiments are available at https://github.com/zjukg/GPHT-for-TSP.
comment: Paper accepted by TKDE in 2024
☆ Innovating for Tomorrow: The Convergence of SE and Green AI
The latest advancements in machine learning, specifically in foundation models, are revolutionizing the frontiers of existing software engineering (SE) processes. This is a bi-directional phenomona, where 1) software systems are now challenged to provide AI-enabled features to their users, and 2) AI is used to automate tasks within the software development lifecycle. In an era where sustainability is a pressing societal concern, our community needs to adopt a long-term plan enabling a conscious transformation that aligns with environmental sustainability values. In this paper, we reflect on the impact of adopting environmentally friendly practices to create AI-enabled software systems and make considerations on the environmental impact of using foundation models for software development.
comment: Accepted in SE 2030 - International Workshop on Software Engineering in 2030
☆ Exclusive Style Removal for Cross Domain Novel Class Discovery
As a promising field in open-world learning, \textit{Novel Class Discovery} (NCD) is usually a task to cluster unseen novel classes in an unlabeled set based on the prior knowledge of labeled data within the same domain. However, the performance of existing NCD methods could be severely compromised when novel classes are sampled from a different distribution with the labeled ones. In this paper, we explore and establish the solvability of NCD in cross domain setting with the necessary condition that style information must be removed. Based on the theoretical analysis, we introduce an exclusive style removal module for extracting style information that is distinctive from the baseline features, thereby facilitating inference. Moreover, this module is easy to integrate with other NCD methods, acting as a plug-in to improve performance on novel classes with different distributions compared to the seen labeled set. Additionally, recognizing the non-negligible influence of different backbones and pre-training strategies on the performance of the NCD methods, we build a fair benchmark for future NCD research. Extensive experiments on three common datasets demonstrate the effectiveness of our proposed module.
☆ ResumeAtlas: Revisiting Resume Classification with Large-Scale Datasets and Large Language Models
The increasing reliance on online recruitment platforms coupled with the adoption of AI technologies has highlighted the critical need for efficient resume classification methods. However, challenges such as small datasets, lack of standardized resume templates, and privacy concerns hinder the accuracy and effectiveness of existing classification models. In this work, we address these challenges by presenting a comprehensive approach to resume classification. We curated a large-scale dataset of 13,389 resumes from diverse sources and employed Large Language Models (LLMs) such as BERT and Gemma1.1 2B for classification. Our results demonstrate significant improvements over traditional machine learning approaches, with our best model achieving a top-1 accuracy of 92\% and a top-5 accuracy of 97.5\%. These findings underscore the importance of dataset quality and advanced model architectures in enhancing the accuracy and robustness of resume classification systems, thus advancing the field of online recruitment practices.
comment: 8 pages, 6 figures, 1 table, 6th International Conference on AI in Computational Linguistics
☆ Poisoned LangChain: Jailbreak LLMs by LangChain
With the development of natural language processing (NLP), large language models (LLMs) are becoming increasingly popular. LLMs are integrating more into everyday life, raising public concerns about their security vulnerabilities. Consequently, the security of large language models is becoming critically important. Currently, the techniques for attacking and defending against LLMs are continuously evolving. One significant method type of attack is the jailbreak attack, which designed to evade model safety mechanisms and induce the generation of inappropriate content. Existing jailbreak attacks primarily rely on crafting inducement prompts for direct jailbreaks, which are less effective against large models with robust filtering and high comprehension abilities. Given the increasing demand for real-time capabilities in large language models, real-time updates and iterations of new knowledge have become essential. Retrieval-Augmented Generation (RAG), an advanced technique to compensate for the model's lack of new knowledge, is gradually becoming mainstream. As RAG enables the model to utilize external knowledge bases, it provides a new avenue for jailbreak attacks. In this paper, we conduct the first work to propose the concept of indirect jailbreak and achieve Retrieval-Augmented Generation via LangChain. Building on this, we further design a novel method of indirect jailbreak attack, termed Poisoned-LangChain (PLC), which leverages a poisoned external knowledge base to interact with large language models, thereby causing the large models to generate malicious non-compliant dialogues.We tested this method on six different large language models across three major categories of jailbreak issues. The experiments demonstrate that PLC successfully implemented indirect jailbreak attacks under three different scenarios, achieving success rates of 88.56%, 79.04%, and 82.69% respectively.
comment: 6 pages,2 figures,This paper is a submission to ACM TURC. It has been accepted by the editor of the organizer
☆ ArzEn-LLM: Code-Switched Egyptian Arabic-English Translation and Speech Recognition Using LLMs
Motivated by the widespread increase in the phenomenon of code-switching between Egyptian Arabic and English in recent times, this paper explores the intricacies of machine translation (MT) and automatic speech recognition (ASR) systems, focusing on translating code-switched Egyptian Arabic-English to either English or Egyptian Arabic. Our goal is to present the methodologies employed in developing these systems, utilizing large language models such as LLama and Gemma. In the field of ASR, we explore the utilization of the Whisper model for code-switched Egyptian Arabic recognition, detailing our experimental procedures including data preprocessing and training techniques. Through the implementation of a consecutive speech-to-text translation system that integrates ASR with MT, we aim to overcome challenges posed by limited resources and the unique characteristics of the Egyptian Arabic dialect. Evaluation against established metrics showcases promising results, with our methodologies yielding a significant improvement of $56\%$ in English translation over the state-of-the-art and $9.3\%$ in Arabic translation. Since code-switching is deeply inherent in spoken languages, it is crucial that ASR systems can effectively handle this phenomenon. This capability is crucial for enabling seamless interaction in various domains, including business negotiations, cultural exchanges, and academic discourse. Our models and code are available as open-source resources. Code: \url{http://github.com/ahmedheakl/arazn-llm}}, Models: \url{http://huggingface.co/collections/ahmedheakl/arazn-llm-662ceaf12777656607b9524e}.
comment: 9 pages, 4 figures, 5 tables, 6th International Conference on AI in Computational Linguistics
☆ BADGE: BADminton report Generation and Evaluation with LLM IJCAI 2024
Badminton enjoys widespread popularity, and reports on matches generally include details such as player names, game scores, and ball types, providing audiences with a comprehensive view of the games. However, writing these reports can be a time-consuming task. This challenge led us to explore whether a Large Language Model (LLM) could automate the generation and evaluation of badminton reports. We introduce a novel framework named BADGE, designed for this purpose using LLM. Our method consists of two main phases: Report Generation and Report Evaluation. Initially, badminton-related data is processed by the LLM, which then generates a detailed report of the match. We tested different Input Data Types, In-Context Learning (ICL), and LLM, finding that GPT-4 performs best when using CSV data type and the Chain of Thought prompting. Following report generation, the LLM evaluates and scores the reports to assess their quality. Our comparisons between the scores evaluated by GPT-4 and human judges show a tendency to prefer GPT-4 generated reports. Since the application of LLM in badminton reporting remains largely unexplored, our research serves as a foundational step for future advancements in this area. Moreover, our method can be extended to other sports games, thereby enhancing sports promotion. For more details, please refer to https://github.com/AndyChiangSH/BADGE.
comment: Accepted by IJCAI 2024 Workshop: The 2nd International Workshop on Intelligent Technologies for Precision Sports Science (IT4PSS)
☆ Open-vocabulary Mobile Manipulation in Unseen Dynamic Environments with 3D Semantic Maps
Open-Vocabulary Mobile Manipulation (OVMM) is a crucial capability for autonomous robots, especially when faced with the challenges posed by unknown and dynamic environments. This task requires robots to explore and build a semantic understanding of their surroundings, generate feasible plans to achieve manipulation goals, adapt to environmental changes, and comprehend natural language instructions from humans. To address these challenges, we propose a novel framework that leverages the zero-shot detection and grounded recognition capabilities of pretraining visual-language models (VLMs) combined with dense 3D entity reconstruction to build 3D semantic maps. Additionally, we utilize large language models (LLMs) for spatial region abstraction and online planning, incorporating human instructions and spatial semantic context. We have built a 10-DoF mobile manipulation robotic platform JSR-1 and demonstrated in real-world robot experiments that our proposed framework can effectively capture spatial semantics and process natural language user instructions for zero-shot OVMM tasks under dynamic environment settings, with an overall navigation and task success rate of 80.95% and 73.33% over 105 episodes, and better SFT and SPL by 157.18% and 19.53% respectively compared to the baseline. Furthermore, the framework is capable of replanning towards the next most probable candidate location based on the spatial semantic context derived from the 3D semantic map when initial plans fail, keeping an average success rate of 76.67%.
comment: Open-vocabulary, Mobile Manipulation, Dynamic Environments, 3D Semantic Maps, Zero-shot, LLMs, VLMs, 18 pages, 2 figures
☆ LLM-Driven Multimodal Opinion Expression Identification
Opinion Expression Identification (OEI) is essential in NLP for applications ranging from voice assistants to depression diagnosis. This study extends OEI to encompass multimodal inputs, underlining the significance of auditory cues in delivering emotional subtleties beyond the capabilities of text. We introduce a novel multimodal OEI (MOEI) task, integrating text and speech to mirror real-world scenarios. Utilizing CMU MOSEI and IEMOCAP datasets, we construct the CI-MOEI dataset. Additionally, Text-to-Speech (TTS) technology is applied to the MPQA dataset to obtain the CIM-OEI dataset. We design a template for the OEI task to take full advantage of the generative power of large language models (LLMs). Advancing further, we propose an LLM-driven method STOEI, which combines speech and text modal to identify opinion expressions. Our experiments demonstrate that MOEI significantly improves the performance while our method outperforms existing methods by 9.20\% and obtains SOTA results.
comment: 6 pages, 3 Figures
☆ EHR-Based Mobile and Web Platform for Chronic Disease Risk Prediction Using Large Language Multimodal Models
Traditional diagnosis of chronic diseases involves in-person consultations with physicians to identify the disease. However, there is a lack of research focused on predicting and developing application systems using clinical notes and blood test values. We collected five years of Electronic Health Records (EHRs) from Taiwan's hospital database between 2017 and 2021 as an AI database. Furthermore, we developed an EHR-based chronic disease prediction platform utilizing Large Language Multimodal Models (LLMMs), successfully integrating with frontend web and mobile applications for prediction. This prediction platform can also connect to the hospital's backend database, providing physicians with real-time risk assessment diagnostics. The demonstration link can be found at https://www.youtube.com/watch?v=oqmL9DEDFgA.
☆ Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction ACL 2024
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review, which is the most representative and challenging task in aspect-based sentiment analysis. A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods. To tackle this issue, we propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels, aiming to filter out mismatches and thereby enhance the effectiveness of self-training. We highlight two critical aspects to ensure the scorer's effectiveness and reliability: the quality of the training dataset and its model architecture. To this end, we create a human-annotated comparison dataset and train a generative model on it using ranking-based objectives. Extensive experiments on public ASQP datasets reveal that using our scorer can greatly and consistently improve the effectiveness of self-training. Moreover, we explore the possibility of replacing humans with large language models for comparison dataset annotation, and experiments demonstrate its feasibility. We release our code and data at https://github.com/HITSZ-HLT/ST-w-Scorer-ABSA .
comment: Accepted to ACL 2024 Main Conference
☆ Few-Shot Medical Image Segmentation with High-Fidelity Prototypes
Few-shot Semantic Segmentation (FSS) aims to adapt a pretrained model to new classes with as few as a single labelled training sample per class. Despite the prototype based approaches have achieved substantial success, existing models are limited to the imaging scenarios with considerably distinct objects and not highly complex background, e.g., natural images. This makes such models suboptimal for medical imaging with both conditions invalid. To address this problem, we propose a novel Detail Self-refined Prototype Network (DSPNet) to constructing high-fidelity prototypes representing the object foreground and the background more comprehensively. Specifically, to construct global semantics while maintaining the captured detail semantics, we learn the foreground prototypes by modelling the multi-modal structures with clustering and then fusing each in a channel-wise manner. Considering that the background often has no apparent semantic relation in the spatial dimensions, we integrate channel-specific structural information under sparse channel-aware regulation. Extensive experiments on three challenging medical image benchmarks show the superiority of DSPNet over previous state-of-the-art methods.
☆ Large Language Models for Cuffless Blood Pressure Measurement From Wearable Biosignals
Large language models (LLMs) have captured significant interest from both academia and industry due to their impressive performance across various textual tasks. However, the potential of LLMs to analyze physiological time-series data remains an emerging research field. Particularly, there is a notable gap in the utilization of LLMs for analyzing wearable biosignals to achieve cuffless blood pressure (BP) measurement, which is critical for the management of cardiovascular diseases. This paper presents the first work to explore the capacity of LLMs to perform cuffless BP estimation based on wearable biosignals. We extracted physiological features from electrocardiogram (ECG) and photoplethysmogram (PPG) signals and designed context-enhanced prompts by combining these features with BP domain knowledge and user information. Subsequently, we adapted LLMs to BP estimation tasks through instruction tuning. To evaluate the proposed approach, we conducted assessments of ten advanced LLMs using a comprehensive public dataset of wearable biosignals from 1,272 participants. The experimental results demonstrate that the optimally fine-tuned LLM significantly surpasses conventional task-specific baselines, achieving an estimation error of 0.00 $\pm$ 9.25 mmHg for systolic BP and 1.29 $\pm$ 6.37 mmHg for diastolic BP. Notably, the ablation studies highlight the benefits of our context enhancement strategy, leading to an 8.9% reduction in mean absolute error for systolic BP estimation. This paper pioneers the exploration of LLMs for cuffless BP measurement, providing a potential solution to enhance the accuracy of cuffless BP measurement.
☆ Breaking the Barrier: Enhanced Utility and Robustness in Smoothed DRL Agents ICML 2024
Robustness remains a paramount concern in deep reinforcement learning (DRL), with randomized smoothing emerging as a key technique for enhancing this attribute. However, a notable gap exists in the performance of current smoothed DRL agents, often characterized by significantly low clean rewards and weak robustness. In response to this challenge, our study introduces innovative algorithms aimed at training effective smoothed robust DRL agents. We propose S-DQN and S-PPO, novel approaches that demonstrate remarkable improvements in clean rewards, empirical robustness, and robustness guarantee across standard RL benchmarks. Notably, our S-DQN and S-PPO agents not only significantly outperform existing smoothed agents by an average factor of $2.16\times$ under the strongest attack, but also surpass previous robustly-trained agents by an average factor of $2.13\times$. This represents a significant leap forward in the field. Furthermore, we introduce Smoothed Attack, which is $1.89\times$ more effective in decreasing the rewards of smoothed agents than existing adversarial attacks.
comment: Published in ICML 2024
☆ AdaZeta: Adaptive Zeroth-Order Tensor-Train Adaption for Memory-Efficient Large Language Models Fine-Tuning
Fine-tuning large language models (LLMs) has achieved remarkable performance across various natural language processing tasks, yet it demands more and more memory as model sizes keep growing. To address this issue, the recently proposed Memory-efficient Zeroth-order (MeZO) methods attempt to fine-tune LLMs using only forward passes, thereby avoiding the need for a backpropagation graph. However, significant performance drops and a high risk of divergence have limited their widespread adoption. In this paper, we propose the Adaptive Zeroth-order Tensor-Train Adaption (AdaZeta) framework, specifically designed to improve the performance and convergence of the ZO methods. To enhance dimension-dependent ZO estimation accuracy, we introduce a fast-forward, low-parameter tensorized adapter. To tackle the frequently observed divergence issue in large-scale ZO fine-tuning tasks, we propose an adaptive query number schedule that guarantees convergence. Detailed theoretical analysis and extensive experimental results on Roberta-Large and Llama-2-7B models substantiate the efficacy of our AdaZeta framework in terms of accuracy, memory efficiency, and convergence speed.
☆ Bidirectional-Reachable Hierarchical Reinforcement Learning with Mutually Responsive Policies
Hierarchical reinforcement learning (HRL) addresses complex long-horizon tasks by skillfully decomposing them into subgoals. Therefore, the effectiveness of HRL is greatly influenced by subgoal reachability. Typical HRL methods only consider subgoal reachability from the unilateral level, where a dominant level enforces compliance to the subordinate level. However, we observe that when the dominant level becomes trapped in local exploration or generates unattainable subgoals, the subordinate level is negatively affected and cannot follow the dominant level's actions. This can potentially make both levels stuck in local optima, ultimately hindering subsequent subgoal reachability. Allowing real-time bilateral information sharing and error correction would be a natural cure for this issue, which motivates us to propose a mutual response mechanism. Based on this, we propose the Bidirectional-reachable Hierarchical Policy Optimization (BrHPO)--a simple yet effective algorithm that also enjoys computation efficiency. Experiment results on a variety of long-horizon tasks showcase that BrHPO outperforms other state-of-the-art HRL baselines, coupled with a significantly higher exploration efficiency and robustness.
☆ Improving Entity Recognition Using Ensembles of Deep Learning and Fine-tuned Large Language Models: A Case Study on Adverse Event Extraction from Multiple Sources
Adverse event (AE) extraction following COVID-19 vaccines from text data is crucial for monitoring and analyzing the safety profiles of immunizations. Traditional deep learning models are adept at learning intricate feature representations and dependencies in sequential data, but often require extensive labeled data. In contrast, large language models (LLMs) excel in understanding contextual information, but exhibit unstable performance on named entity recognition tasks, possibly due to their broad but unspecific training. This study aims to evaluate the effectiveness of LLMs and traditional deep learning models in AE extraction, and to assess the impact of ensembling these models on performance. In this study, we utilized reports and posts from the VAERS (n=621), Twitter (n=9,133), and Reddit (n=131) as our corpora. Our goal was to extract three types of entities: "vaccine", "shot", and "ae". We explored and fine-tuned (except GPT-4) multiple LLMs, including GPT-2, GPT-3.5, GPT-4, and Llama-2, as well as traditional deep learning models like RNN and BioBERT. To enhance performance, we created ensembles of the three models with the best performance. For evaluation, we used strict and relaxed F1 scores to evaluate the performance for each entity type, and micro-average F1 was used to assess the overall performance. The ensemble model achieved the highest performance in "vaccine", "shot", and "ae" with strict F1-scores of 0.878, 0.930, and 0.925, respectively, along with a micro-average score of 0.903. In conclusion, this study demonstrates the effectiveness and robustness of ensembling fine-tuned traditional deep learning models and LLMs, for extracting AE-related information. This study contributes to the advancement of biomedical natural language processing, providing valuable insights into improving AE extraction from text data for pharmacovigilance and public health surveillance.
☆ PharmGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.
☆ Multimodal foundation world models for generalist embodied agents
Learning generalist embodied agents, able to solve multitudes of tasks in different domains is a long-standing problem. Reinforcement learning (RL) is hard to scale up as it requires a complex reward design for each task. In contrast, language can specify tasks in a more natural way. Current foundation vision-language models (VLMs) generally require fine-tuning or other adaptations to be functional, due to the significant domain gap. However, the lack of multimodal data in such domains represents an obstacle toward developing foundation models for embodied applications. In this work, we overcome these problems by presenting multimodal foundation world models, able to connect and align the representation of foundation VLMs with the latent space of generative world models for RL, without any language annotations. The resulting agent learning framework, GenRL, allows one to specify tasks through vision and/or language prompts, ground them in the embodied domain's dynamics, and learns the corresponding behaviors in imagination. As assessed through large-scale multi-task benchmarking, GenRL exhibits strong multi-task generalization performance in several locomotion and manipulation domains. Furthermore, by introducing a data-free RL strategy, it lays the groundwork for foundation model-based RL for generalist embodied agents.
☆ Boosting Soft Q-Learning by Bounding
An agent's ability to leverage past experience is critical for efficiently solving new tasks. Prior work has focused on using value function estimates to obtain zero-shot approximations for solutions to a new task. In soft Q-learning, we show how any value function estimate can also be used to derive double-sided bounds on the optimal value function. The derived bounds lead to new approaches for boosting training performance which we validate experimentally. Notably, we find that the proposed framework suggests an alternative method for updating the Q-function, leading to boosted performance.
comment: To appear in the 1st Reinforcement Learning Conference
☆ Automated Clinical Data Extraction with Knowledge Conditioned LLMs
The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Our knowledge-conditioned approach also improves the accuracy and reliability of LLM outputs by addressing the extraction task in two stages: (i) lung lesion finding detection and primary structured field parsing, followed by (ii) further parsing of lesion description text into additional structured fields. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
☆ AutoOPE: Automated Off-Policy Estimator Selection
The Off-Policy Evaluation (OPE) problem consists of evaluating the performance of counterfactual policies with data collected by another one. This problem is of utmost importance for various application domains, e.g., recommendation systems, medical treatments, and many others. To solve the OPE problem, we resort to estimators, which aim to estimate in the most accurate way possible the performance that the counterfactual policies would have had if they were deployed in place of the logging policy. In the literature, several estimators have been developed, all with different characteristics and theoretical guarantees. Therefore, there is no dominant estimator, and each estimator may be the best one for different OPE problems, depending on the characteristics of the dataset at hand. While the selection of the estimator is a crucial choice for an accurate OPE, this problem has been widely overlooked in the literature. We propose an automated data-driven OPE estimator selection method based on machine learning. In particular, the core idea we propose in this paper is to create several synthetic OPE tasks and use a machine learning model trained to predict the best estimator for those synthetic tasks. We empirically show how our method is able to generalize to unseen tasks and make a better estimator selection compared to a baseline method on several real-world datasets, with a computational cost significantly lower than the one of the baseline.
☆ MolFusion: Multimodal Fusion Learning for Molecular Representations via Multi-granularity Views
Artificial Intelligence predicts drug properties by encoding drug molecules, aiding in the rapid screening of candidates. Different molecular representations, such as SMILES and molecule graphs, contain complementary information for molecular encoding. Thus exploiting complementary information from different molecular representations is one of the research priorities in molecular encoding. Most existing methods for combining molecular multi-modalities only use molecular-level information, making it hard to encode intra-molecular alignment information between different modalities. To address this issue, we propose a multi-granularity fusion method that is MolFusion. The proposed MolFusion consists of two key components: (1) MolSim, a molecular-level encoding component that achieves molecular-level alignment between different molecular representations. and (2) AtomAlign, an atomic-level encoding component that achieves atomic-level alignment between different molecular representations. Experimental results show that MolFusion effectively utilizes complementary multimodal information, leading to significant improvements in performance across various classification and regression tasks.
comment: 8 pages, 5 figures
☆ View-Invariant Pixelwise Anomaly Detection in Multi-object Scenes with Adaptive View Synthesis
The inspection and monitoring of infrastructure assets typically requires identifying visual anomalies in scenes periodically photographed over time. Images collected manually or with robots such as unmanned aerial vehicles from the same scene at different instances in time are typically not perfectly aligned. Supervised segmentation methods can be applied to identify known problems, but unsupervised anomaly detection approaches are required when unknown anomalies occur. Current unsupervised pixel-level anomaly detection methods have mainly been developed for industrial settings where the camera position is known and constant. However, we find that these methods fail to generalize to the case when images are not perfectly aligned. We term the problem of unsupervised anomaly detection between two such imperfectly aligned sets of images as Scene Anomaly Detection (Scene AD). We present a novel network termed OmniAD to address the Scene AD problem posed. Specifically, we refine the anomaly detection method reverse distillation to achieve a 40% increase in pixel-level anomaly detection performance. The network's performance is further demonstrated to improve with two new data augmentation strategies proposed that leverage novel view synthesis and camera localization to improve generalization. We validate our approach with qualitative and quantitative results on a new dataset, ToyCity, the first Scene AD dataset with multiple objects, as well as on the established single object-centric dataset, MAD. https://drags99.github.io/OmniAD/
☆ Decoding with Limited Teacher Supervision Requires Understanding When to Trust the Teacher
How can sLLMs efficiently utilize the supervision of LLMs to improve their generative quality? This question has been well studied in scenarios where there is no restriction on the number of LLM supervisions one can use, giving birth to many decoding algorithms that utilize supervision without further training. However, it is still unclear what is an effective strategy under the limited supervision scenario, where we assume that no more than a few tokens can be generated by LLMs. To this end, we develop an algorithm to effectively aggregate the sLLM and LLM predictions on initial tokens so that the generated tokens can more accurately condition the subsequent token generation by sLLM only. Critically, we find that it is essential to adaptively overtrust or disregard the LLM prediction based on the confidence of the sLLM. Through our experiments on a wide range of models and datasets, we demonstrate that our method provides a consistent improvement over conventional decoding strategies.
comment: preprint
☆ Catching Chameleons: Detecting Evolving Disinformation Generated using Large Language Models
Despite recent advancements in detecting disinformation generated by large language models (LLMs), current efforts overlook the ever-evolving nature of this disinformation. In this work, we investigate a challenging yet practical research problem of detecting evolving LLM-generated disinformation. Disinformation evolves constantly through the rapid development of LLMs and their variants. As a consequence, the detection model faces significant challenges. First, it is inefficient to train separate models for each disinformation generator. Second, the performance decreases in scenarios when evolving LLM-generated disinformation is encountered in sequential order. To address this problem, we propose DELD (Detecting Evolving LLM-generated Disinformation), a parameter-efficient approach that jointly leverages the general fact-checking capabilities of pre-trained language models (PLM) and the independent disinformation generation characteristics of various LLMs. In particular, the learned characteristics are concatenated sequentially to facilitate knowledge accumulation and transformation. DELD addresses the issue of label scarcity by integrating the semantic embeddings of disinformation with trainable soft prompts to elicit model-specific knowledge. Our experiments show that \textit{DELD} significantly outperforms state-of-the-art methods. Moreover, our method provides critical insights into the unique patterns of disinformation generation across different LLMs, offering valuable perspectives in this line of research.
comment: 10 pages, 5 figures
☆ Explicit Diversity Conditions for Effective Question Answer Generation with Large Language Models COLING 2024
Question Answer Generation (QAG) is an effective data augmentation technique to improve the accuracy of question answering systems, especially in low-resource domains. While recent pretrained and large language model-based QAG methods have made substantial progress, they face the critical issue of redundant QA pair generation, affecting downstream QA systems. Implicit diversity techniques such as sampling and diverse beam search are proven effective solutions but often yield smaller diversity. We present explicit diversity conditions for QAG, focusing on spatial aspects, question types, and entities, substantially increasing diversity in QA generation. Our work emphasizes the need of explicit diversity conditions for generating diverse question-answer synthetic data by showing significant improvements in downstream QA task over existing widely adopted implicit diversity techniques. In particular, generated QA pairs from explicit diversity conditions when used to train the downstream QA model results in an average 4.1% exact match and 4.5% F1 improvement over QAG from implicit sampling techniques on SQuADDU. Our work emphasizes the need for explicit diversity conditions even more in low-resource datasets (SubjQA), where average downstream QA performance improvements are around 12% EM.
comment: Published at COLING 2024
☆ Multi-step Knowledge Retrieval and Inference over Unstructured Data
The advent of Large Language Models (LLMs) and Generative AI has revolutionized natural language applications across various domains. However, high-stakes decision-making tasks in fields such as medical, legal and finance require a level of precision, comprehensiveness, and logical consistency that pure LLM or Retrieval-Augmented-Generation (RAG) approaches often fail to deliver. At Elemental Cognition (EC), we have developed a neuro-symbolic AI platform to tackle these problems. The platform integrates fine-tuned LLMs for knowledge extraction and alignment with a robust symbolic reasoning engine for logical inference, planning and interactive constraint solving. We describe Cora, a Collaborative Research Assistant built on this platform, that is designed to perform complex research and discovery tasks in high-stakes domains. This paper discusses the multi-step inference challenges inherent in such domains, critiques the limitations of existing LLM-based methods, and demonstrates how Cora's neuro-symbolic approach effectively addresses these issues. We provide an overview of the system architecture, key algorithms for knowledge extraction and formal reasoning, and present preliminary evaluation results that highlight Cora's superior performance compared to well-known LLM and RAG baselines.
☆ MUMU: Bootstrapping Multimodal Image Generation from Text-to-Image Data
We train a model to generate images from multimodal prompts of interleaved text and images such as "a man and his dog in an animated style." We bootstrap a multimodal dataset by extracting semantically meaningful image crops corresponding to words in the image captions of synthetically generated and publicly available text-image data. Our model, MUMU, is composed of a vision-language model encoder with a diffusion decoder and is trained on a single 8xH100 GPU node. Despite being only trained on crops from the same image, MUMU learns to compose inputs from different images into a coherent output. For example, an input of a realistic person and a cartoon will output the same person in the cartoon style, and an input of a standing subject and a scooter will output the subject riding the scooter. As a result, our model generalizes to tasks such as style transfer and character consistency. Our results show the promise of using multimodal models as general purpose controllers for image generation.
☆ WV-Net: A foundation model for SAR WV-mode satellite imagery trained using contrastive self-supervised learning on 10 million images NeurIPS 2024
The European Space Agency's Copernicus Sentinel-1 (S-1) mission is a constellation of C-band synthetic aperture radar (SAR) satellites that provide unprecedented monitoring of the world's oceans. S-1's wave mode (WV) captures 20x20 km image patches at 5 m pixel resolution and is unaffected by cloud cover or time-of-day. The mission's open data policy has made SAR data easily accessible for a range of applications, but the need for manual image annotations is a bottleneck that hinders the use of machine learning methods. This study uses nearly 10 million WV-mode images and contrastive self-supervised learning to train a semantic embedding model called WV-Net. In multiple downstream tasks, WV-Net outperforms a comparable model that was pre-trained on natural images (ImageNet) with supervised learning. Experiments show improvements for estimating wave height (0.50 vs 0.60 RMSE using linear probing), estimating near-surface air temperature (0.90 vs 0.97 RMSE), and performing multilabel-classification of geophysical and atmospheric phenomena (0.96 vs 0.95 micro-averaged AUROC). WV-Net embeddings are also superior in an unsupervised image-retrieval task and scale better in data-sparse settings. Together, these results demonstrate that WV-Net embeddings can support geophysical research by providing a convenient foundation model for a variety of data analysis and exploration tasks.
comment: 20 pages, 9 figures, submitted to NeurIPS 2024
☆ Conformalized Link Prediction on Graph Neural Networks
Graph Neural Networks (GNNs) excel in diverse tasks, yet their applications in high-stakes domains are often hampered by unreliable predictions. Although numerous uncertainty quantification methods have been proposed to address this limitation, they often lack \textit{rigorous} uncertainty estimates. This work makes the first attempt to introduce a distribution-free and model-agnostic uncertainty quantification approach to construct a predictive interval with a statistical guarantee for GNN-based link prediction. We term it as \textit{conformalized link prediction.} Our approach builds upon conformal prediction (CP), a framework that promises to construct statistically robust prediction sets or intervals. We first theoretically and empirically establish a permutation invariance condition for the application of CP in link prediction tasks, along with an exact test-time coverage. Leveraging the important structural information in graphs, we then identify a novel and crucial connection between a graph's adherence to the power law distribution and the efficiency of CP. This insight leads to the development of a simple yet effective sampling-based method to align the graph structure with a power law distribution prior to the standard CP procedure. Extensive experiments demonstrate that for conformalized link prediction, our approach achieves the desired marginal coverage while significantly improving the efficiency of CP compared to baseline methods.
☆ The Impact of Feature Representation on the Accuracy of Photonic Neural Networks
Photonic Neural Networks (PNNs) are gaining significant interest in the research community due to their potential for high parallelization, low latency, and energy efficiency. PNNs compute using light, which leads to several differences in implementation when compared to electronics, such as the need to represent input features in the photonic domain before feeding them into the network. In this encoding process, it is common to combine multiple features into a single input to reduce the number of inputs and associated devices, leading to smaller and more energy-efficient PNNs. Although this alters the network's handling of input data, its impact on PNNs remains understudied. This paper addresses this open question, investigating the effect of commonly used encoding strategies that combine features on the performance and learning capabilities of PNNs. Here, using the concept of feature importance, we develop a mathematical framework for analyzing feature combination. Through this framework, we demonstrate that encoding multiple features together in a single input determines their relative importance, thus limiting the network's ability to learn from the data. Given some prior knowledge of the data, however, this can also be leveraged for higher accuracy. By selecting an optimal encoding method, we achieve up to a 12.3\% improvement in accuracy of PNNs trained on the Iris dataset compared to other encoding techniques, surpassing the performance of networks where features are not combined. These findings highlight the importance of carefully choosing the encoding to the accuracy and decision-making strategies of PNNs, particularly in size or power constrained applications.
☆ A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.
comment: Submitted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
☆ Decentralized Semantic Traffic Control in AVs Using RL and DQN for Dynamic Roadblocks
Autonomous Vehicles (AVs), furnished with sensors capable of capturing essential vehicle dynamics such as speed, acceleration, and precise location, possess the capacity to execute intelligent maneuvers, including lane changes, in anticipation of approaching roadblocks. Nevertheless, the sheer volume of sensory data and the processing necessary to derive informed decisions can often overwhelm the vehicles, rendering them unable to handle the task independently. Consequently, a common approach in traffic scenarios involves transmitting the data to servers for processing, a practice that introduces challenges, particularly in situations demanding real-time processing. In response to this challenge, we present a novel DL-based semantic traffic control system that entrusts semantic encoding responsibilities to the vehicles themselves. This system processes driving decisions obtained from a Reinforcement Learning (RL) agent, streamlining the decision-making process. Specifically, our framework envisions scenarios where abrupt roadblocks materialize due to factors such as road maintenance, accidents, or vehicle repairs, necessitating vehicles to make determinations concerning lane-keeping or lane-changing actions to navigate past these obstacles. To formulate this scenario mathematically, we employ a Markov Decision Process (MDP) and harness the Deep Q Learning (DQN) algorithm to unearth viable solutions.
☆ WavRx: a Disease-Agnostic, Generalizable, and Privacy-Preserving Speech Health Diagnostic Model
Speech is known to carry health-related attributes, which has emerged as a novel venue for remote and long-term health monitoring. However, existing models are usually tailored for a specific type of disease, and have been shown to lack generalizability across datasets. Furthermore, concerns have been raised recently towards the leakage of speaker identity from health embeddings. To mitigate these limitations, we propose WavRx, a speech health diagnostics model that captures the respiration and articulation related dynamics from a universal speech representation. Our in-domain and cross-domain experiments on six pathological speech datasets demonstrate WavRx as a new state-of-the-art health diagnostic model. Furthermore, we show that the amount of speaker identity entailed in the WavRx health embeddings is significantly reduced without extra guidance during training. An in-depth analysis of the model was performed, thus providing physiological interpretation of its improved generalizability and privacy-preserving ability.
comment: Under review; Model script available at https://github.com/zhu00121/WavRx
☆ Fast Optimizer Benchmark
In this paper, we present the Fast Optimizer Benchmark (FOB), a tool designed for evaluating deep learning optimizers during their development. The benchmark supports tasks from multiple domains such as computer vision, natural language processing, and graph learning. The focus is on convenient usage, featuring human-readable YAML configurations, SLURM integration, and plotting utilities. FOB can be used together with existing hyperparameter optimization (HPO) tools as it handles training and resuming of runs. The modular design enables integration into custom pipelines, using it simply as a collection of tasks. We showcase an optimizer comparison as a usage example of our tool. FOB can be found on GitHub: https://github.com/automl/FOB.
comment: 5 pages + 12 appendix pages, submitted to AutoML Conf 2024 Workshop Track
☆ Learning to Correct for QA Reasoning with Black-box LLMs
An open challenge in recent machine learning is about how to improve the reasoning capability of large language models (LLMs) in a black-box setting, i.e., without access to detailed information such as output token probabilities. Existing approaches either rely on accessibility (which is often unrealistic) or involve significantly increased train- and inference-time costs. This paper addresses those limitations or shortcomings by proposing a novel approach, namely CoBB (Correct for improving QA reasoning of Black-Box LLMs). It uses a trained adaptation model to perform a seq2seq mapping from the often-imperfect reasonings of the original black-box LLM to the correct or improved reasonings. Specifically, the adaptation model is initialized with a relatively small open-source LLM and adapted over a collection of sub-sampled training pairs. To select the representative pairs of correct and incorrect reasonings, we formulated the dataset construction as an optimization problem that minimizes the statistical divergence between the sampled subset and the entire collection, and solved it via a genetic algorithm. We then train the adaptation model over the sampled pairs by contrasting the likelihoods of correct and incorrect reasonings. Our experimental results demonstrate that CoBB significantly improves reasoning accuracy across various QA benchmarks, compared to the best-performing adaptation baselines.
comment: preprint, 18 pages
☆ Petal-X: Human-Centered Visual Explanations to Improve Cardiovascular Risk Communication
Cardiovascular diseases (CVDs), the leading cause of death worldwide, can be prevented in most cases through behavioral interventions. Therefore, effective communication of CVD risk and projected risk reduction by risk factor modification plays a crucial role in reducing CVD risk at the individual level. However, despite interest in refining risk estimation with improved prediction models such as SCORE2, the guidelines for presenting these risk estimations in clinical practice remained essentially unchanged in the last few years, with graphical score charts (GSCs) continuing to be one of the prevalent systems. This work describes the design and implementation of Petal-X, a novel tool to support clinician-patient shared decision-making by explaining the CVD risk contributions of different factors and facilitating what-if analysis. Petal-X relies on a novel visualization, Petal Product Plots, and a tailor-made global surrogate model of SCORE2, whose fidelity is comparable to that of the GSCs used in clinical practice. We evaluated Petal-X compared to GSCs in a controlled experiment with 88 healthcare students, all but one with experience with chronic patients. The results show that Petal-X outperforms GSC in critical tasks, such as comparing the contribution to the patient's 10-year CVD risk of each modifiable risk factor, without a significant loss of perceived transparency, trust, or intent to use. Our study provides an innovative approach to the visualization and explanation of risk in clinical practice that, due to its model-agnostic nature, could continue to support next-generation artificial intelligence risk assessment models.
☆ The Multilingual Alignment Prism: Aligning Global and Local Preferences to Reduce Harm
A key concern with the concept of "alignment" is the implicit question of "alignment to what?". AI systems are increasingly used across the world, yet safety alignment is often focused on homogeneous monolingual settings. Additionally, preference training and safety measures often overfit to harms common in Western-centric datasets. Here, we explore the viability of different alignment approaches when balancing dual objectives: addressing and optimizing for a non-homogeneous set of languages and cultural preferences while minimizing both global and local harms. We collect the first set of human annotated red-teaming prompts in different languages distinguishing between global and local harm, which serve as a laboratory for understanding the reliability of alignment techniques when faced with preference distributions that are non-stationary across geographies and languages. While this setting is seldom covered by the literature to date, which primarily centers on English harm mitigation, it captures real-world interactions with AI systems around the world. We establish a new precedent for state-of-the-art alignment techniques across 6 languages with minimal degradation in general performance. Our work provides important insights into cross-lingual transfer and novel optimization approaches to safeguard AI systems designed to serve global populations.
☆ Speakers Unembedded: Embedding-free Approach to Long-form Neural Diarization INTERSPEECH 2024
End-to-end neural diarization (EEND) models offer significant improvements over traditional embedding-based Speaker Diarization (SD) approaches but falls short on generalizing to long-form audio with large number of speakers. EEND-vector-clustering method mitigates this by combining local EEND with global clustering of speaker embeddings from local windows, but this requires an additional speaker embedding framework alongside the EEND module. In this paper, we propose a novel framework applying EEND both locally and globally for long-form audio without separate speaker embeddings. This approach achieves significant relative DER reduction of 13% and 10% over the conventional 1-pass EEND on Callhome American English and RT03-CTS datasets respectively and marginal improvements over EEND-vector-clustering without the need for additional speaker embeddings. Furthermore, we discuss the computational complexity of our proposed framework and explore strategies for reducing processing times.
comment: Accepted at INTERSPEECH 2024
☆ Few-shot Personalization of LLMs with Mis-aligned Responses
As the diversity of users increases, the capability of providing personalized responses by large language models (LLMs) has become increasingly important. Existing approaches have only limited successes in LLM personalization, due to the absence of personalized learning or the reliance on shared personal data. This paper proposes a new approach for a few-shot personalization of LLMs with their mis-aligned responses (Fermi). Our key idea is to learn a set of personalized prompts for each user by progressively improving the prompts using LLMs, based on user profile (e.g., demographic information) and a few examples of previous opinions. During an iterative process of prompt improvement, we incorporate the contexts of mis-aligned responses by LLMs, which are especially crucial for the effective personalization of LLMs. In addition, we develop an effective inference method to further leverage the context of the test query and the personalized prompts. Our experimental results demonstrate that Fermi significantly improves performance across various benchmarks, compared to the best-performing baselines.
comment: preprint, 30 pages
☆ Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) has demonstrated effectiveness in mitigating the hallucination problem of large language models (LLMs). However, the difficulty of aligning the retriever with the diverse LLMs' knowledge preferences inevitably poses an inevitable challenge in developing a reliable RAG system. To address this issue, we propose DPA-RAG, a universal framework designed to align diverse knowledge preferences within RAG systems. Specifically, we initially introduce a preference knowledge construction pipline and incorporate five novel query augmentation strategies to alleviate preference data scarcity. Based on preference data, DPA-RAG accomplishes both external and internal preference alignment: 1) It jointly integrate pair-wise, point-wise, and contrastive preference alignment abilities into the reranker, achieving external preference alignment among RAG components. 2) It further introduces a pre-aligned stage before vanilla Supervised Fine-tuning (SFT), enabling LLMs to implicitly capture knowledge aligned with their reasoning preferences, achieving LLMs' internal alignment. Experimental results across four knowledge-intensive QA datasets demonstrate that DPA-RAG outperforms all baselines and seamlessly integrates both black-box and open-sourced LLM readers. Further qualitative analysis and discussions also provide empirical guidance for achieving reliable RAG systems. Our code is publicly available at https://github.com/dongguanting/DPA-RAG.
comment: Work in progress
☆ Human-AI Collaborative Taxonomy Construction: A Case Study in Profession-Specific Writing Assistants
Large Language Models (LLMs) have assisted humans in several writing tasks, including text revision and story generation. However, their effectiveness in supporting domain-specific writing, particularly in business contexts, is relatively less explored. Our formative study with industry professionals revealed the limitations in current LLMs' understanding of the nuances in such domain-specific writing. To address this gap, we propose an approach of human-AI collaborative taxonomy development to perform as a guideline for domain-specific writing assistants. This method integrates iterative feedback from domain experts and multiple interactions between these experts and LLMs to refine the taxonomy. Through larger-scale experiments, we aim to validate this methodology and thus improve LLM-powered writing assistance, tailoring it to meet the unique requirements of different stakeholder needs.
comment: Accepted to CHI 2024 In2Writing Workshop
☆ RouteLLM: Learning to Route LLMs with Preference Data
Large language models (LLMs) exhibit impressive capabilities across a wide range of tasks, yet the choice of which model to use often involves a trade-off between performance and cost. More powerful models, though effective, come with higher expenses, while less capable models are more cost-effective. To address this dilemma, we propose several efficient router models that dynamically select between a stronger and a weaker LLM during inference, aiming to optimize the balance between cost and response quality. We develop a training framework for these routers leveraging human preference data and data augmentation techniques to enhance performance. Our evaluation on widely-recognized benchmarks shows that our approach significantly reduces costs-by over 2 times in certain cases-without compromising the quality of responses. Interestingly, our router models also demonstrate significant transfer learning capabilities, maintaining their performance even when the strong and weak models are changed at test time. This highlights the potential of these routers to provide a cost-effective yet high-performance solution for deploying LLMs.
☆ Improving Hyperparameter Optimization with Checkpointed Model Weights
When training deep learning models, the performance depends largely on the selected hyperparameters. However, hyperparameter optimization (HPO) is often one of the most expensive parts of model design. Classical HPO methods treat this as a black-box optimization problem. However, gray-box HPO methods, which incorporate more information about the setup, have emerged as a promising direction for more efficient optimization. For example, using intermediate loss evaluations to terminate bad selections. In this work, we propose an HPO method for neural networks using logged checkpoints of the trained weights to guide future hyperparameter selections. Our method, Forecasting Model Search (FMS), embeds weights into a Gaussian process deep kernel surrogate model, using a permutation-invariant graph metanetwork to be data-efficient with the logged network weights. To facilitate reproducibility and further research, we open-source our code at https://github.com/NVlabs/forecasting-model-search.
comment: See the project website at https://research.nvidia.com/labs/toronto-ai/FMS/
☆ Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs
Mathematical reasoning presents a significant challenge for Large Language Models (LLMs) due to the extensive and precise chain of reasoning required for accuracy. Ensuring the correctness of each reasoning step is critical. To address this, we aim to enhance the robustness and factuality of LLMs by learning from human feedback. However, Direct Preference Optimization (DPO) has shown limited benefits for long-chain mathematical reasoning, as models employing DPO struggle to identify detailed errors in incorrect answers. This limitation stems from a lack of fine-grained process supervision. We propose a simple, effective, and data-efficient method called Step-DPO, which treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Additionally, we have developed a data construction pipeline for Step-DPO, enabling the creation of a high-quality dataset containing 10K step-wise preference pairs. We also observe that in DPO, self-generated data is more effective than data generated by humans or GPT-4, due to the latter's out-of-distribution nature. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters. Notably, Step-DPO, when applied to Qwen2-72B-Instruct, achieves scores of 70.8% and 94.0% on the test sets of MATH and GSM8K, respectively, surpassing a series of closed-source models, including GPT-4-1106, Claude-3-Opus, and Gemini-1.5-Pro. Our code, data, and models are available at https://github.com/dvlab-research/Step-DPO.
comment: Code, data, and models are available at https://github.com/dvlab-research/Step-DPO
♻ ☆ Situational Awareness Matters in 3D Vision Language Reasoning CVPR 2024
Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work, we demonstrate that a critical and distinct challenge in 3D vision language reasoning is situational awareness, which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge, we introduce SIG3D, an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator, followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situation estimation and question answering by a large margin (e.g., an enhancement of over 30% on situation estimation accuracy). Subsequent analysis corroborates our architectural design choices, explores the distinct functions of visual and textual tokens, and highlights the importance of situational awareness in the domain of 3D question answering.
comment: CVPR 2024. Project Page: https://yunzeman.github.io/situation3d
♻ ☆ Large Language Models in the Clinic: A Comprehensive Benchmark
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
♻ ☆ Large Language Model Enhanced Clustering for News Event Detection
The news landscape is continuously evolving, with an ever-increasing volume of information from around the world. Automated event detection within this vast data repository is essential for monitoring, identifying, and categorizing significant news occurrences across diverse platforms. This paper presents an event detection framework that leverages Large Language Models (LLMs) combined with clustering analysis to detect news events from the Global Database of Events, Language, and Tone (GDELT). The framework enhances event clustering through both pre-event detection tasks (keyword extraction and text embedding) and post-event detection tasks (event summarization and topic labeling). We also evaluate the impact of various textual embeddings on the quality of clustering outcomes, ensuring robust news categorization. Additionally, we introduce a novel Cluster Stability Assessment Index (CSAI) to assess the validity and robustness of clustering results. CSAI utilizes latent feature vectors to provide a new way of measuring clustering quality. Our experiments indicate that combining LLM embeddings with clustering algorithms yields the best results, demonstrating greater robustness in terms of CSAI scores. Moreover, post-event detection tasks generate meaningful insights, facilitating effective interpretation of event clustering results. Overall, our experimental results indicate that the proposed framework offers valuable insights and could enhance the accuracy and depth of news reporting.
♻ ☆ Multimodal and Force-Matched Imitation Learning with a See-Through Visuotactile Sensor
Contact-rich tasks continue to present a variety of challenges for robotic manipulation. In this work, we leverage a multimodal visuotactile sensor within the framework of imitation learning (IL) to perform contact rich tasks that involve relative motion (slipping/sliding) between the end-effector and object. We introduce two algorithmic contributions, tactile force matching and learned mode switching, as complimentary methods for improving IL. Tactile force matching enhances kinesthetic teaching by reading approximate forces during the demonstration and generating an adapted robot trajectory that recreates the recorded forces. Learned mode switching uses IL to couple visual and tactile sensor modes with the learned motion policy, simplifying the transition from reaching to contacting. We perform robotic manipulation experiments on four door opening tasks with a variety of observation and method configurations to study the utility of our proposed improvements and multimodal visuotactile sensing. Our results show that the inclusion of force matching raises average policy success rates by 62.5%, visuotactile mode switching by 30.3%, and visuotactile data as a policy input by 42.5%, emphasizing the value of see-through tactile sensing for IL, both for data collection to allow force matching, and for policy execution to allow accurate task feedback.
comment: Submitted to IEEE Transactions on Robotics (T-RO): Special Section on Tactile Robotics
♻ ☆ AND: Audio Network Dissection for Interpreting Deep Acoustic Models ICML'24
Neuron-level interpretations aim to explain network behaviors and properties by investigating neurons responsive to specific perceptual or structural input patterns. Although there is emerging work in the vision and language domains, none is explored for acoustic models. To bridge the gap, we introduce $\textit{AND}$, the first $\textbf{A}$udio $\textbf{N}$etwork $\textbf{D}$issection framework that automatically establishes natural language explanations of acoustic neurons based on highly-responsive audio. $\textit{AND}$ features the use of LLMs to summarize mutual acoustic features and identities among audio. Extensive experiments are conducted to verify $\textit{AND}$'s precise and informative descriptions. In addition, we demonstrate a potential use of $\textit{AND}$ for audio machine unlearning by conducting concept-specific pruning based on the generated descriptions. Finally, we highlight two acoustic model behaviors with analysis by $\textit{AND}$: (i) models discriminate audio with a combination of basic acoustic features rather than high-level abstract concepts; (ii) training strategies affect model behaviors and neuron interpretability -- supervised training guides neurons to gradually narrow their attention, while self-supervised learning encourages neurons to be polysemantic for exploring high-level features.
comment: Accepted by ICML'24
♻ ☆ BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Automated software engineering has been greatly empowered by the recent advances in Large Language Models (LLMs) for programming. While current benchmarks have shown that LLMs can perform various software engineering tasks like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks. Solving challenging and practical programming tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs. To assess how well LLMs can solve challenging and practical programming tasks, we introduce Bench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained programming tasks. To evaluate LLMs rigorously, each programming task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of Bench, Benchi, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
comment: 44 pages, 14 figures, 7 tables, built with love by the BigCode community :)
♻ ☆ Mixtures of Experts Unlock Parameter Scaling for Deep RL
The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
♻ ☆ Fair, Manipulation-Robust, and Transparent Sortition
Sortition, the random selection of political representatives, is increasingly being used around the world to choose participants of deliberative processes like Citizens' Assemblies. Motivated by sortition's practical importance, there has been a recent flurry of research on sortition algorithms, whose task it is to select a panel from among a pool of volunteers. This panel must satisfy quotas enforcing representation of key population subgroups. Past work has contributed an algorithmic approach for fulfilling this task while ensuring that volunteers' chances of selection are maximally equal, as measured by any convex equality objective. The question, then, is: which equality objective is the right one? Past work has mainly studied the objectives Minimax and Leximin, which respectively minimize the maximum and maximize the minimum chance of selection given to any volunteer. Recent work showed that both of these objectives have key weaknesses: Minimax is highly robust to manipulation but is arbitrarily unfair; oppositely, Leximin is highly fair but arbitrarily manipulable. In light of this gap, we propose a new equality objective, Goldilocks, that aims to achieve these ideals simultaneously by ensuring that no volunteer receives too little or too much chance of selection. We theoretically bound the extent to which Goldilocks achieves these ideals, finding that in an important sense, Goldilocks recovers among the best available solutions in a given instance. We then extend our bounds to the case where the output of Goldilocks is transformed to achieve a third goal, Transparency. Our empirical analysis of Goldilocks in real data is even more promising: we find that this objective achieves nearly instance-optimal minimum and maximum selection probabilities simultaneously in most real instances -- an outcome not even guaranteed to be possible for any algorithm.
♻ ☆ Large Knowledge Model: Perspectives and Challenges
Humankind's understanding of the world is fundamentally linked to our perception and cognition, with \emph{human languages} serving as one of the major carriers of \emph{world knowledge}. In this vein, \emph{Large Language Models} (LLMs) like ChatGPT epitomize the pre-training of extensive, sequence-based world knowledge into neural networks, facilitating the processing and manipulation of this knowledge in a parametric space. This article explores large models through the lens of "knowledge". We initially investigate the role of symbolic knowledge such as Knowledge Graphs (KGs) in enhancing LLMs, covering aspects like knowledge-augmented language model, structure-inducing pre-training, knowledgeable prompts, structured CoT, knowledge editing, semantic tools for LLM and knowledgeable AI agents. Subsequently, we examine how LLMs can boost traditional symbolic knowledge bases, encompassing aspects like using LLM as KG builder and controller, structured knowledge pretraining, and LLM-enhanced symbolic reasoning. Considering the intricate nature of human knowledge, we advocate for the creation of \emph{Large Knowledge Models} (LKM), specifically engineered to manage diversified spectrum of knowledge structures. This promising undertaking would entail several key challenges, such as disentangling knowledge base from language models, cognitive alignment with human knowledge, integration of perception and cognition, and building large commonsense models for interacting with physical world, among others. We finally propose a five-"A" principle to distinguish the concept of LKM.
comment: Data Intelligence, Published: Jun 18, 2024
♻ ☆ MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
♻ ☆ ToM-LM: Delegating Theory of Mind Reasoning to External Symbolic Executors in Large Language Models
Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others. While Large Language Models (LLMs) have shown some promise with ToM ability, they still struggle with complex ToM reasoning. Our approach leverages an external symbolic executor, specifically the SMCDEL model checker, and fine-tuning to improve the ToM reasoning ability of LLMs. In our approach, an LLM is first fine-tuned through pairs of natural language and symbolic formulation representation of ToM problems and is then instructed to generate the symbolic formulation with a one-shot in-context example. The generated symbolic formulation is then executed by the SMCDEL model checker to perform transparent and verifiable ToM reasoning and give the final result. We demonstrate that our approach, ToM-LM, shows a significant improvement over all the constructed baselines. Our study proposes a novel view about externalizing a particular component of ToM reasoning, mainly reasoning about beliefs, and suggests generalizing it to other aspects of ToM reasoning.
comment: Accepted at NeSy 2024
♻ ☆ RL in Latent MDPs is Tractable: Online Guarantees via Off-Policy Evaluation
In many real-world decision problems there is partially observed, hidden or latent information that remains fixed throughout an interaction. Such decision problems can be modeled as Latent Markov Decision Processes (LMDPs), where a latent variable is selected at the beginning of an interaction and is not disclosed to the agent. In the last decade, there has been significant progress in solving LMDPs under different structural assumptions. However, for general LMDPs, there is no known learning algorithm that provably matches the existing lower bound (Kwon et al., 2021). We introduce the first sample-efficient algorithm for LMDPs without any additional structural assumptions. Our result builds off a new perspective on the role of off-policy evaluation guarantees and coverage coefficients in LMDPs, a perspective, that has been overlooked in the context of exploration in partially observed environments. Specifically, we establish a novel off-policy evaluation lemma and introduce a new coverage coefficient for LMDPs. Then, we show how these can be used to derive near-optimal guarantees of an optimistic exploration algorithm. These results, we believe, can be valuable for a wide range of interactive learning problems beyond LMDPs, and especially, for partially observed environments.
comment: Fixed typos + alpha
♻ ☆ Cultural Bias and Cultural Alignment of Large Language Models
Culture fundamentally shapes people's reasoning, behavior, and communication. As people increasingly use generative artificial intelligence (AI) to expedite and automate personal and professional tasks, cultural values embedded in AI models may bias people's authentic expression and contribute to the dominance of certain cultures. We conduct a disaggregated evaluation of cultural bias for five widely used large language models (OpenAI's GPT-4o/4-turbo/4/3.5-turbo/3) by comparing the models' responses to nationally representative survey data. All models exhibit cultural values resembling English-speaking and Protestant European countries. We test cultural prompting as a control strategy to increase cultural alignment for each country/territory. For recent models (GPT-4, 4-turbo, 4o), this improves the cultural alignment of the models' output for 71-81% of countries and territories. We suggest using cultural prompting and ongoing evaluation to reduce cultural bias in the output of generative AI.
♻ ☆ Active Preference Inference using Language Models and Probabilistic Reasoning
Actively inferring user preferences, for example by asking good questions, is important for any human-facing decision-making system. Active inference allows such systems to adapt and personalize themselves to nuanced individual preferences. To enable this ability for instruction-tuned large language models (LLMs), one may prompt them to ask users questions to infer their preferences, transforming the language models into more robust, interactive systems. However, out of the box, these models are not efficient at extracting preferences: the questions they generate are not informative, requiring a high number of user interactions and impeding the usability of the downstream system. In this work, we introduce an inference-time algorithm that helps LLMs quickly infer preferences by using more informative questions. Our algorithm uses a probabilistic model whose conditional distributions are defined by prompting an LLM, and returns questions that optimize expected entropy and expected model change. Results in a simplified interactive web shopping setting with real product items show that an LLM equipped with our entropy reduction algorithm outperforms baselines with the same underlying LLM on task performance while using fewer user interactions.
♻ ☆ OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).
comment: 10 pages
♻ ☆ WhaleNet: a Novel Deep Learning Architecture for Marine Mammals Vocalizations on Watkins Marine Mammal Sound Database
Marine mammal communication is a complex field, hindered by the diversity of vocalizations and environmental factors. The Watkins Marine Mammal Sound Database (WMMD) constitutes a comprehensive labeled dataset employed in machine learning applications. Nevertheless, the methodologies for data preparation, preprocessing, and classification documented in the literature exhibit considerable variability and are typically not applied to the dataset in its entirety. This study initially undertakes a concise review of the state-of-the-art benchmarks pertaining to the dataset, with a particular focus on clarifying data preparation and preprocessing techniques. Subsequently, we explore the utilization of the Wavelet Scattering Transform (WST) and Mel spectrogram as preprocessing mechanisms for feature extraction. In this paper, we introduce \textbf{WhaleNet} (Wavelet Highly Adaptive Learning Ensemble Network), a sophisticated deep ensemble architecture for the classification of marine mammal vocalizations, leveraging both WST and Mel spectrogram for enhanced feature discrimination. By integrating the insights derived from WST and Mel representations, we achieved an improvement in classification accuracy by $8-10\%$ over existing architectures, corresponding to a classification accuracy of $97.61\%$.
♻ ☆ Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task ICLR 2023
Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
comment: ICLR 2023 oral (notable-top-5%): https://openreview.net/forum?id=DeG07_TcZvT ; code: https://github.com/likenneth/othello_world
♻ ☆ Inference-Time Intervention: Eliciting Truthful Answers from a Language Model NeurIPS 2023
We introduce Inference-Time Intervention (ITI), a technique designed to enhance the "truthfulness" of large language models (LLMs). ITI operates by shifting model activations during inference, following a set of directions across a limited number of attention heads. This intervention significantly improves the performance of LLaMA models on the TruthfulQA benchmark. On an instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from 32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and demonstrate how to balance it by tuning the intervention strength. ITI is minimally invasive and computationally inexpensive. Moreover, the technique is data efficient: while approaches like RLHF require extensive annotations, ITI locates truthful directions using only few hundred examples. Our findings suggest that LLMs may have an internal representation of the likelihood of something being true, even as they produce falsehoods on the surface.
comment: NeurIPS 2023 spotlight; code: https://github.com/likenneth/honest_llama
♻ ☆ Topological data quality via 0-dimensional persistence matching
Data quality is crucial for the successful training, generalization and performance of artificial intelligence models. We propose to measure data quality for supervised learning using topological data analysis techniques. Specifically, we provide a novel topological invariant based on persistence matchings induced by inclusions and using $0$-dimensional persistent homology. We show that such an invariant is stable. We provide an algorithm and relate it to images, kernels, and cokernels of the induced morphisms. Also, we show that the invariant allows us to understand whether the subset "represents well" the clusters from the larger dataset or not, and we also use it to estimate bounds for the Hausdorff distance between the subset and the complete dataset. This approach enables us to explain why the chosen dataset will lead to poor performance.
♻ ☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90% of these learning rates in $v$ could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on $2\times$ A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
♻ ☆ Are AI-Generated Text Detectors Robust to Adversarial Perturbations? ACL 2024
The widespread use of large language models (LLMs) has sparked concerns about the potential misuse of AI-generated text, as these models can produce content that closely resembles human-generated text. Current detectors for AI-generated text (AIGT) lack robustness against adversarial perturbations, with even minor changes in characters or words causing a reversal in distinguishing between human-created and AI-generated text. This paper investigates the robustness of existing AIGT detection methods and introduces a novel detector, the Siamese Calibrated Reconstruction Network (SCRN). The SCRN employs a reconstruction network to add and remove noise from text, extracting a semantic representation that is robust to local perturbations. We also propose a siamese calibration technique to train the model to make equally confidence predictions under different noise, which improves the model's robustness against adversarial perturbations. Experiments on four publicly available datasets show that the SCRN outperforms all baseline methods, achieving 6.5\%-18.25\% absolute accuracy improvement over the best baseline method under adversarial attacks. Moreover, it exhibits superior generalizability in cross-domain, cross-genre, and mixed-source scenarios. The code is available at \url{https://github.com/CarlanLark/Robust-AIGC-Detector}.
comment: Accepted to ACL 2024 main conference
♻ ☆ SUB-PLAY: Adversarial Policies against Partially Observed Multi-Agent Reinforcement Learning Systems CCS'24
Recent advancements in multi-agent reinforcement learning (MARL) have opened up vast application prospects, such as swarm control of drones, collaborative manipulation by robotic arms, and multi-target encirclement. However, potential security threats during the MARL deployment need more attention and thorough investigation. Recent research reveals that attackers can rapidly exploit the victim's vulnerabilities, generating adversarial policies that result in the failure of specific tasks. For instance, reducing the winning rate of a superhuman-level Go AI to around 20%. Existing studies predominantly focus on two-player competitive environments, assuming attackers possess complete global state observation. In this study, we unveil, for the first time, the capability of attackers to generate adversarial policies even when restricted to partial observations of the victims in multi-agent competitive environments. Specifically, we propose a novel black-box attack (SUB-PLAY) that incorporates the concept of constructing multiple subgames to mitigate the impact of partial observability and suggests sharing transitions among subpolicies to improve attackers' exploitative ability. Extensive evaluations demonstrate the effectiveness of SUB-PLAY under three typical partial observability limitations. Visualization results indicate that adversarial policies induce significantly different activations of the victims' policy networks. Furthermore, we evaluate three potential defenses aimed at exploring ways to mitigate security threats posed by adversarial policies, providing constructive recommendations for deploying MARL in competitive environments.
comment: To appear in the ACM Conference on Computer and Communications Security (CCS'24), October 14-18, 2024, Salt Lake City, UT, USA
♻ ☆ CLARA: Classifying and Disambiguating User Commands for Reliable Interactive Robotic Agents
In this paper, we focus on inferring whether the given user command is clear, ambiguous, or infeasible in the context of interactive robotic agents utilizing large language models (LLMs). To tackle this problem, we first present an uncertainty estimation method for LLMs to classify whether the command is certain (i.e., clear) or not (i.e., ambiguous or infeasible). Once the command is classified as uncertain, we further distinguish it between ambiguous or infeasible commands leveraging LLMs with situational aware context in a zero-shot manner. For ambiguous commands, we disambiguate the command by interacting with users via question generation with LLMs. We believe that proper recognition of the given commands could lead to a decrease in malfunction and undesired actions of the robot, enhancing the reliability of interactive robot agents. We present a dataset for robotic situational awareness, consisting pair of high-level commands, scene descriptions, and labels of command type (i.e., clear, ambiguous, or infeasible). We validate the proposed method on the collected dataset, pick-and-place tabletop simulation. Finally, we demonstrate the proposed approach in real-world human-robot interaction experiments, i.e., handover scenarios.
comment: Project Page: https://clararobot.github.io/
♻ ☆ Jina CLIP: Your CLIP Model Is Also Your Text Retriever ICML2024
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
comment: 4 pages, MFM-EAI@ICML2024
♻ ☆ 360$^\circ$REA: Towards A Reusable Experience Accumulation with 360° Assessment for Multi-Agent System
Large language model agents have demonstrated remarkable advancements across various complex tasks. Recent works focus on optimizing the agent team or employing self-reflection to iteratively solve complex tasks. Since these agents are all based on the same LLM, only conducting self-evaluation or removing underperforming agents does not substantively enhance the capability of the agents. We argue that a comprehensive evaluation and accumulating experience from evaluation feedback is an effective approach to improving system performance. In this paper, we propose Reusable Experience Accumulation with 360$^\circ$ Assessment (360$^\circ$REA), a hierarchical multi-agent framework inspired by corporate organizational practices. The framework employs a novel 360$^\circ$ performance assessment method for multi-perspective performance evaluation with fine-grained assessment. To enhance the capability of agents in addressing complex tasks, we introduce dual-level experience pool for agents to accumulate experience through fine-grained assessment. Extensive experiments on complex task datasets demonstrate the effectiveness of 360$^\circ$REA.
♻ ☆ Enhancing Geometric Ontology Embeddings for $\mathcal{EL}^{++}$ with Negative Sampling and Deductive Closure Filtering
Ontology embeddings map classes, relations, and individuals in ontologies into $\mathbb{R}^n$, and within $\mathbb{R}^n$ similarity between entities can be computed or new axioms inferred. For ontologies in the Description Logic $\mathcal{EL}^{++}$, several embedding methods have been developed that explicitly generate models of an ontology. However, these methods suffer from some limitations; they do not distinguish between statements that are unprovable and provably false, and therefore they may use entailed statements as negatives. Furthermore, they do not utilize the deductive closure of an ontology to identify statements that are inferred but not asserted. We evaluated a set of embedding methods for $\mathcal{EL}^{++}$ ontologies based on high-dimensional ball representation of concept descriptions, incorporating several modifications that aim to make use of the ontology deductive closure. In particular, we designed novel negative losses that account both for the deductive closure and different types of negatives. We demonstrate that our embedding methods improve over the baseline ontology embedding in the task of knowledge base or ontology completion.
comment: Revised version
♻ ☆ A Survey of Generative AI for de novo Drug Design: New Frontiers in Molecule and Protein Generation
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
♻ ☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
♻ ☆ ChangeMamba: Remote Sensing Change Detection with Spatio-Temporal State Space Model
Convolutional neural networks (CNN) and Transformers have made impressive progress in the field of remote sensing change detection (CD). However, both architectures have inherent shortcomings: CNN are constrained by a limited receptive field that may hinder their ability to capture broader spatial contexts, while Transformers are computationally intensive, making them costly to train and deploy on large datasets. Recently, the Mamba architecture, based on state space models, has shown remarkable performance in a series of natural language processing tasks, which can effectively compensate for the shortcomings of the above two architectures. In this paper, we explore for the first time the potential of the Mamba architecture for remote sensing CD tasks. We tailor the corresponding frameworks, called MambaBCD, MambaSCD, and MambaBDA, for binary change detection (BCD), semantic change detection (SCD), and building damage assessment (BDA), respectively. All three frameworks adopt the cutting-edge Visual Mamba architecture as the encoder, which allows full learning of global spatial contextual information from the input images. For the change decoder, which is available in all three architectures, we propose three spatio-temporal relationship modeling mechanisms, which can be naturally combined with the Mamba architecture and fully utilize its attribute to achieve spatio-temporal interaction of multi-temporal features, thereby obtaining accurate change information. On five benchmark datasets, our proposed frameworks outperform current CNN- and Transformer-based approaches without using any complex training strategies or tricks, fully demonstrating the potential of the Mamba architecture in CD tasks. Further experiments show that our architecture is quite robust to degraded data. The source code will be available in https://github.com/ChenHongruixuan/MambaCD
comment: Accepted by IEEE TGRS
♻ ☆ ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer
Optical high-resolution imagery and OSM data are two important data sources of change detection (CD). Previous related studies focus on utilizing the information in OSM data to aid the CD on optical high-resolution images. This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical imagery, thereby expanding the scope of CD tasks. To this end, we propose an object-guided Transformer (ObjFormer) by naturally combining the object-based image analysis (OBIA) technique with the advanced vision Transformer architecture. This combination can significantly reduce the computational overhead in the self-attention module without adding extra parameters or layers. ObjFormer has a hierarchical pseudo-siamese encoder consisting of object-guided self-attention modules that extracts multi-level heterogeneous features from OSM data and optical images; a decoder consisting of object-guided cross-attention modules can recover land-cover changes from the extracted heterogeneous features. Beyond basic binary change detection, this paper raises a new semi-supervised semantic change detection task that does not require any manually annotated land-cover labels to train semantic change detectors. Two lightweight semantic decoders are added to ObjFormer to accomplish this task efficiently. A converse cross-entropy loss is designed to fully utilize negative samples, contributing to the great performance improvement in this task. A large-scale benchmark dataset called OpenMapCD containing 1,287 samples covering 40 regions on six continents is constructed to conduct detailed experiments. The results show the effectiveness of our methods in this new kind of CD task. Additionally, case studies in Japanese cities demonstrate the framework's generalizability and practical potential. The OpenMapCD and source code are available in https://github.com/ChenHongruixuan/ObjFormer
comment: Accepted by IEEE TGRS
♻ ☆ Visual Odometry with Neuromorphic Resonator Networks
Visual Odometry (VO) is a method to estimate self-motion of a mobile robot using visual sensors. Unlike odometry based on integrating differential measurements that can accumulate errors, such as inertial sensors or wheel encoders, visual odometry is not compromised by drift. However, image-based VO is computationally demanding, limiting its application in use cases with low-latency, -memory, and -energy requirements. Neuromorphic hardware offers low-power solutions to many vision and AI problems, but designing such solutions is complicated and often has to be assembled from scratch. Here we propose to use Vector Symbolic Architecture (VSA) as an abstraction layer to design algorithms compatible with neuromorphic hardware. Building from a VSA model for scene analysis, described in our companion paper, we present a modular neuromorphic algorithm that achieves state-of-the-art performance on two-dimensional VO tasks. Specifically, the proposed algorithm stores and updates a working memory of the presented visual environment. Based on this working memory, a resonator network estimates the changing location and orientation of the camera. We experimentally validate the neuromorphic VSA-based approach to VO with two benchmarks: one based on an event camera dataset and the other in a dynamic scene with a robotic task.
comment: 19 pages, 5 figures, minor revisions, added results for shapes_translation dataset
♻ ☆ Improving Local Training in Federated Learning via Temperature Scaling
Federated learning is inherently hampered by data heterogeneity: non-i.i.d. training data over local clients. We propose a novel model training approach for federated learning, FLex&Chill, which exploits the Logit Chilling method. Through extensive evaluations, we demonstrate that, in the presence of non-i.i.d. data characteristics inherent in federated learning systems, this approach can expedite model convergence and improve inference accuracy. Quantitatively, from our experiments, we observe up to 6X improvement in the global federated learning model convergence time, and up to 3.37% improvement in inference accuracy.
comment: 24 pages
♻ ☆ Neuromorphic Visual Scene Understanding with Resonator Networks
Analyzing a visual scene by inferring the configuration of a generative model is widely considered the most flexible and generalizable approach to scene understanding. Yet, one major problem is the computational challenge of the inference procedure, involving a combinatorial search across object identities and poses. Here we propose a neuromorphic solution exploiting three key concepts: (1) a computational framework based on Vector Symbolic Architectures (VSA) with complex-valued vectors; (2) the design of Hierarchical Resonator Networks (HRN) to factorize the non-commutative transforms translation and rotation in visual scenes; (3) the design of a multi-compartment spiking phasor neuron model for implementing complex-valued resonator networks on neuromorphic hardware. The VSA framework uses vector binding operations to form a generative image model in which binding acts as the equivariant operation for geometric transformations. A scene can, therefore, be described as a sum of vector products, which can then be efficiently factorized by a resonator network to infer objects and their poses. The HRN features a partitioned architecture in which vector binding is equivariant for horizontal and vertical translation within one partition and for rotation and scaling within the other partition. The spiking neuron model allows mapping the resonator network onto efficient and low-power neuromorphic hardware. Our approach is demonstrated on synthetic scenes composed of simple 2D shapes undergoing rigid geometric transformations and color changes. A companion paper demonstrates the same approach in real-world application scenarios for machine vision and robotics.
comment: 23 pages, 8 figures, minor revisions and extended supplementary material
♻ ☆ Case-Based or Rule-Based: How Do Transformers Do the Math?
Despite the impressive performance in a variety of complex tasks, modern large language models (LLMs) still have trouble dealing with some math problems that are simple and intuitive for humans, such as addition. While we can easily learn basic rules of addition and apply them to new problems of any length, LLMs struggle to do the same. Instead, they may rely on similar cases seen in the training corpus for help. We define these two different reasoning mechanisms as "rule-based reasoning" and "case-based reasoning". Since rule-based reasoning is essential for acquiring systematic generalization ability, we aim to explore exactly whether transformers use rule-based or case-based reasoning for math problems. Through carefully designed intervention experiments on five math tasks, we confirm that transformers are performing case-based reasoning, no matter whether scratchpad is used, which aligns with the previous observations that transformers use subgraph matching/shortcut learning to reason. To mitigate such problems, we propose a Rule-Following Fine-Tuning (RFFT) technique to teach transformers to perform rule-based reasoning. Specifically, we provide explicit rules in the input and then instruct transformers to recite and follow the rules step by step. Through RFFT, we successfully enable LLMs fine-tuned on 1-5 digit addition to generalize to up to 12-digit addition with over 95% accuracy, which is over 40% higher than scratchpad. The significant improvement demonstrates that teaching LLMs to use rules explicitly helps them learn rule-based reasoning and generalize better in length.
♻ ☆ ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction WWW 2024
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
comment: Accepted by WWW 2024
♻ ☆ Super Tiny Language Models
The rapid advancement of large language models (LLMs) has led to significant improvements in natural language processing but also poses challenges due to their high computational and energy demands. This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs), which aim to deliver high performance with significantly reduced parameter counts. We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies. These methods aim to significantly reduce reduce the parameter count compared to traditional models -- in future works, we aim to build on these in a way that maintains and improves upon the performance of base transformer models. This series of papers will explore into various subproblems, including tokenizer-free models, self-play based training, and alternative training objectives. We will target models with 10M, 50M, and 100M parameters. Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
comment: 11 pages, 4 figures
♻ ☆ Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
Open Information Extraction (OpenIE) is a fundamental yet challenging task in Natural Language Processing, which involves extracting all triples (subject, predicate, object) from a given sentence. While labeling-based methods have their merits, generation-based techniques offer unique advantages, such as the ability to generate tokens not present in the original sentence. However, these generation-based methods often require a significant amount of training data to learn the task form of OpenIE and substantial training time to overcome slow model convergence due to the order penalty. In this paper, we introduce a novel framework, OK-IE, that ingeniously transforms the task form of OpenIE into the pre-training task form of the T5 model, thereby reducing the need for extensive training data. Furthermore, we introduce an innovative concept of Anchor to control the sequence of model outputs, effectively eliminating the impact of order penalty on model convergence and significantly reducing training time. Experimental results indicate that, compared to previous SOTA methods, OK-IE requires only 1/100 of the training data (900 instances) and 1/120 of the training time (3 minutes) to achieve comparable results.
♻ ☆ Layer-Wise Quantization: A Pragmatic and Effective Method for Quantizing LLMs Beyond Integer Bit-Levels EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
♻ ☆ Benchmarking General-Purpose In-Context Learning
In-context learning (ICL) empowers generative models to address new tasks effectively and efficiently on the fly, without relying on any artificially crafted optimization techniques. In this paper, we study extending ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, namely General-Purpose In-Context Learning (GPICL). To this end, we introduce two lightweight benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark encompasses a vast number of tasks characterized by significant task variance, facilitating meta-training that minimizes inductive bias. These tasks are also crafted to promote long-horizon in-context learning through continuous generation and interaction. These characteristics necessitate the models to leverage contexts and history interactions to enhance their capabilities, across domains such as language modeling, decision-making, and world modeling. Our experiments on the baseline models demonstrate that meta-training with minimal inductive bias and ICL from the ground up is feasible across all the domains we've discussed. Additionally, our findings indicate that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.
♻ ☆ SPHINX-X: Scaling Data and Parameters for a Family of Multi-modal Large Language Models ICML 2024
We propose SPHINX-X, an extensive Multimodality Large Language Model (MLLM) series developed upon SPHINX. To improve the architecture and training efficiency, we modify the SPHINX framework by removing redundant visual encoders, bypassing fully-padded sub-images with skip tokens, and simplifying multi-stage training into a one-stage all-in-one paradigm. To fully unleash the potential of MLLMs, we assemble a comprehensive multi-domain and multimodal dataset covering publicly available resources in language, vision, and vision-language tasks. We further enrich this collection with our curated OCR intensive and Set-of-Mark datasets, extending the diversity and generality. By training over different base LLMs including TinyLlama1.1B, InternLM2-7B, LLaMA2-13B, and Mixtral8x7B, we obtain a spectrum of MLLMs that vary in parameter size and multilingual capabilities. Comprehensive benchmarking reveals a strong correlation between the multi-modal performance with the data and parameter scales. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
comment: Accepted by ICML 2024. Code and models are released at https://github.com/Alpha-VLLM/LLaMA2-Accessory
♻ ☆ Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry
Tandem mass spectrometry (MS/MS) stands as the predominant high-throughput technique for comprehensively analyzing protein content within biological samples. This methodology is a cornerstone driving the advancement of proteomics. In recent years, substantial strides have been made in Data-Independent Acquisition (DIA) strategies, facilitating impartial and non-targeted fragmentation of precursor ions. The DIA-generated MS/MS spectra present a formidable obstacle due to their inherent high multiplexing nature. Each spectrum encapsulates fragmented product ions originating from multiple precursor peptides. This intricacy poses a particularly acute challenge in de novo peptide/protein sequencing, where current methods are ill-equipped to address the multiplexing conundrum. In this paper, we introduce DiaTrans, a deep-learning model based on transformer architecture. It deciphers peptide sequences from DIA mass spectrometry data. Our results show significant improvements over existing STOA methods, including DeepNovo-DIA and PepNet. Casanovo-DIA enhances precision by 15.14% to 34.8%, recall by 11.62% to 31.94% at the amino acid level, and boosts precision by 59% to 81.36% at the peptide level. Integrating DIA data and our DiaTrans model holds considerable promise to uncover novel peptides and more comprehensive profiling of biological samples. Casanovo-DIA is freely available under the GNU GPL license at https://github.com/Biocomputing-Research-Group/DiaTrans.
comment: Ebrahimi S., Guo X. Transformer-based de novo peptide sequencing for data-independent acquisition mass spectrometry. In 2023 IEEE 23rd International Conference on Bioinformatics and Bioengineering (BIBE) 2022 Dec 6 (pp. 17-22). IEEE
♻ ☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
♻ ☆ CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation ACL 2024
Since the natural language processing (NLP) community started to make large language models (LLMs) act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
comment: Accepted by ACL 2024 (Main Conference)
♻ ☆ FaithLM: Towards Faithful Explanations for Large Language Models
Large Language Models (LLMs) have become proficient in addressing complex tasks by leveraging their extensive internal knowledge and reasoning capabilities. However, the black-box nature of these models complicates the task of explaining their decision-making processes. While recent advancements demonstrate the potential of leveraging LLMs to self-explain their predictions through natural language (NL) explanations, their explanations may not accurately reflect the LLMs' decision-making process due to a lack of fidelity optimization on the derived explanations. Measuring the fidelity of NL explanations is a challenging issue, as it is difficult to manipulate the input context to mask the semantics of these explanations. To this end, we introduce FaithLM to explain the decision of LLMs with NL explanations. Specifically, FaithLM designs a method for evaluating the fidelity of NL explanations by incorporating the contrary explanations to the query process. Moreover, FaithLM conducts an iterative process to improve the fidelity of derived explanations. Experiment results on three datasets from multiple domains demonstrate that FaithLM can significantly improve the fidelity of derived explanations, which also provides a better alignment with the ground-truth explanations.
♻ ☆ $\text{Alpha}^2$: Discovering Logical Formulaic Alphas using Deep Reinforcement Learning
Alphas are pivotal in providing signals for quantitative trading. The industry highly values the discovery of formulaic alphas for their interpretability and ease of analysis, compared with the expressive yet overfitting-prone black-box alphas. In this work, we focus on discovering formulaic alphas. Prior studies on automatically generating a collection of formulaic alphas were mostly based on genetic programming (GP), which is known to suffer from the problems of being sensitive to the initial population, converting to local optima, and slow computation speed. Recent efforts employing deep reinforcement learning (DRL) for alpha discovery have not fully addressed key practical considerations such as alpha correlations and validity, which are crucial for their effectiveness. In this work, we propose a novel framework for alpha discovery using DRL by formulating the alpha discovery process as program construction. Our agent, $\text{Alpha}^2$, assembles an alpha program optimized for an evaluation metric. A search algorithm guided by DRL navigates through the search space based on value estimates for potential alpha outcomes. The evaluation metric encourages both the performance and the diversity of alphas for a better final trading strategy. Our formulation of searching alphas also brings the advantage of pre-calculation dimensional analysis, ensuring the logical soundness of alphas, and pruning the vast search space to a large extent. Empirical experiments on real-world stock markets demonstrates $\text{Alpha}^2$'s capability to identify a diverse set of logical and effective alphas, which significantly improves the performance of the final trading strategy. The code of our method is available at https://github.com/x35f/alpha2.
♻ ☆ Implications of the AI Act for Non-Discrimination Law and Algorithmic Fairness
The topic of fairness in AI, as debated in the FATE (Fairness, Accountability, Transparency, and Ethics in AI) communities, has sparked meaningful discussions in the past years. However, from a legal perspective, particularly from the perspective of European Union law, many open questions remain. Whereas algorithmic fairness aims to mitigate structural inequalities at design-level, European non-discrimination law is tailored to individual cases of discrimination after an AI model has been deployed. The AI Act might present a tremendous step towards bridging these two approaches by shifting non-discrimination responsibilities into the design stage of AI models. Based on an integrative reading of the AI Act, we comment on legal as well as technical enforcement problems and propose practical implications on bias detection and bias correction in order to specify and comply with specific technical requirements.
♻ ☆ A Survey of Privacy-Preserving Model Explanations: Privacy Risks, Attacks, and Countermeasures
As the adoption of explainable AI (XAI) continues to expand, the urgency to address its privacy implications intensifies. Despite a growing corpus of research in AI privacy and explainability, there is little attention on privacy-preserving model explanations. This article presents the first thorough survey about privacy attacks on model explanations and their countermeasures. Our contribution to this field comprises a thorough analysis of research papers with a connected taxonomy that facilitates the categorisation of privacy attacks and countermeasures based on the targeted explanations. This work also includes an initial investigation into the causes of privacy leaks. Finally, we discuss unresolved issues and prospective research directions uncovered in our analysis. This survey aims to be a valuable resource for the research community and offers clear insights for those new to this domain. To support ongoing research, we have established an online resource repository, which will be continuously updated with new and relevant findings. Interested readers are encouraged to access our repository at https://github.com/tamlhp/awesome-privex.
comment: Revision
♻ ☆ Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code
In this work we systematically review the recent advancements in software engineering with language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. Unlike previous works, we integrate software engineering (SE) with natural language processing (NLP) by discussing the perspectives of both sides: SE applies language models for development automation, while NLP adopts SE tasks for language model evaluation. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also go beyond programming and review LLMs' application in other software engineering activities including requirement engineering, testing, deployment, and operations in an endeavor to provide a global view of NLP in SE, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on GitHub at https://github.com/codefuse-ai/Awesome-Code-LLM.
comment: Repo: https://github.com/codefuse-ai/Awesome-Code-LLM. 9 figures, 18 tables, and 902 references. Under review
♻ ☆ Truthful Aggregation of LLMs with an Application to Online Advertising
Online platforms generate hundreds of billions of dollars in revenue per year by showing advertisements alongside their own content. Currently, these platforms are integrating Large Language Models (LLMs) into their services. This makes revenue generation from LLM-generated content the next major challenge in online advertising. We consider a scenario where advertisers aim to influence the responses of an LLM to align with their interests, while platforms seek to maximize advertiser value and ensure user satisfaction. We introduce an auction mechanism for this problem that operates without LLM fine-tuning or access to model weights and provably converges to the output of the optimally fine-tuned LLM for the platform's objective as computational resources increase. Our mechanism ensures that truthful reporting is a dominant strategy for advertisers and it aligns each advertiser's utility with their contribution to social welfare - an essential feature for long-term viability. Additionally, it can incorporate contextual information about the advertisers, significantly accelerating convergence. Via experiments with a publicly available LLM, we show that our mechanism significantly boosts advertiser value and platform revenue, with low computational overhead. While our motivating application is online advertising, our mechanism can be applied in any setting with monetary transfers, making it a general-purpose solution for truthfully aggregating the preferences of self-interested agents over LLM-generated replies.
♻ ☆ Xmodel-LM Technical Report
We introduce Xmodel-LM, a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. Trained on our self-built dataset (Xdata), which balances Chinese and English corpora based on downstream task optimization, Xmodel-LM exhibits remarkable performance despite its smaller size. It notably surpasses existing open-source language models of similar scale. Our model checkpoints and code are publicly accessible on GitHub at https://github.com/XiaoduoAILab/XmodelLM.
♻ ☆ Generating Chain-of-Thoughts with a Pairwise-Comparison Approach to Searching for the Most Promising Intermediate Thought ICML 2024
To improve the ability of the large language model (LLMs) to tackle complex reasoning problems, chain-of-thoughts (CoT) methods were proposed to guide LLMs to reason step-by-step, enabling problem solving from simple to complex. State-of-the-art methods for generating such a chain involve interactive collaboration, where the learner generates candidate intermediate thoughts, evaluated by the LLM, guiding the generation of subsequent thoughts. However, a widespread yet understudied problem is that the evaluation from the LLM is typically noisy and unreliable, potentially misleading the generation process in selecting promising intermediate thoughts. In this paper, motivated by Vapnik's principle, we use pairwise-comparison evaluation instead of point-wise scoring to search for promising intermediate thoughts with the noisy feedback from the LLM. In each round, we randomly pair intermediate thoughts and directly prompt the LLM to select the more promising one from each pair, allowing us to identify the most promising thoughts through an iterative process. To further alleviate the noise in the comparison, we incorporate techniques from ensemble learning and dueling bandits, proposing two variants of the algorithm. Experiments on three real-world tasks demonstrate the effectiveness of our proposed algorithm and verify the rationale of the pairwise comparison mechanism.
comment: ICML 2024
♻ ☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
comment: All the supplementary files mentioned in the manuscript will be open-source upon acceptance
♻ ☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at https://github.com/CoderChen01/InterCLIP-MEP.
comment: 8 pages, 6 figures, 6 tables
♻ ☆ Are Protein Language Models Compute Optimal? ICML 2024
While protein language models (pLMs) have transformed biological research, the scaling laws governing their improvement remain underexplored. By adapting methodologies from NLP scaling laws, we investigated the optimal ratio between model parameters and training tokens within a fixed compute budget. Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases, and we identify a performance plateau in training loss comparable to the one found in relevant works in the field. Our findings suggest that widely-used pLMs might not be compute-optimal, indicating that larger models could achieve convergence more efficiently. Training a 35M model on a reduced token set, we attained perplexity results comparable to larger models like ESM-2 (15B) and xTrimoPGLM (100B) with a single dataset pass. This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
comment: Proceedings of the ICML 2024 Workshop on Accessible and Efficient Foundation Models for Biological Discovery, Vienna, Austria. 2024
♻ ☆ Pre-Calc: Learning to Use the Calculator Improves Numeracy in Language Models ICML 2024
Quantitative and numerical comprehension in language is an important task in many fields like education and finance, but still remains a challenging task for language models. While tool and calculator usage has shown to be helpful to improve mathematical reasoning in large pretrained decoder-only language models, this remains unexplored for smaller language models with encoders. In this paper, we propose Pre-Calc, a simple pre-finetuning objective of learning to use the calculator for both encoder-only and encoder-decoder architectures, formulated as a discriminative and generative task respectively. We pre-train BERT and RoBERTa for discriminative calculator use and Flan-T5 for generative calculator use on the MAWPS, SVAMP, and AsDiv-A datasets, which improves performance on downstream tasks that require numerical understanding. Our code and data are available at https://github.com/calc-cmu/pre-calc.
comment: AI4Math workshop, ICML 2024
♻ ☆ DeCoF: Generated Video Detection via Frame Consistency: The First Benchmark Dataset
The escalating quality of video generated by advanced video generation methods results in new security challenges, while there have been few relevant research efforts: 1) There is no open-source dataset for generated video detection, 2) No generated video detection method has been proposed so far. To this end, we propose an open-source dataset and a detection method for generated video for the first time. First, we propose a scalable dataset consisting of 964 prompts, covering various forgery targets, scenes, behaviors, and actions, as well as various generation models with different architectures and generation methods, including the most popular commercial models like OpenAI's Sora and Google's Veo. Second, we found via probing experiments that spatial artifact-based detectors lack generalizability. Hence, we propose a simple yet effective \textbf{de}tection model based on \textbf{f}rame \textbf{co}nsistency (\textbf{DeCoF}), which focuses on temporal artifacts by eliminating the impact of spatial artifacts during feature learning. Extensive experiments demonstrate the efficacy of DeCoF in detecting videos generated by unseen video generation models and confirm its powerful generalizability across several commercially proprietary models. Our code and dataset will be released at \url{https://github.com/wuwuwuyue/DeCoF}.
♻ ☆ Slot State Space Models
Recent State Space Models (SSMs) such as S4, S5, and Mamba have shown remarkable computational benefits in long-range temporal dependency modeling. However, in many sequence modeling problems, the underlying process is inherently modular and it is of interest to have inductive biases that mimic this modular structure. In this paper, we introduce SlotSSMs, a novel framework for incorporating independent mechanisms into SSMs to preserve or encourage separation of information. Unlike conventional SSMs that maintain a monolithic state vector, SlotSSMs maintains the state as a collection of multiple vectors called slots. Crucially, the state transitions are performed independently per slot with sparse interactions across slots implemented via the bottleneck of self-attention. In experiments, we evaluate our model in object-centric video understanding, 3D visual reasoning, and video prediction tasks, which involve modeling multiple objects and their long-range temporal dependencies. We find that our proposed design offers substantial performance gains over existing sequence modeling methods.
♻ ☆ Reinforcement Learning from Delayed Observations via World Models
In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.
♻ ☆ Key-Element-Informed sLLM Tuning for Document Summarization
Remarkable advances in large language models (LLMs) have enabled high-quality text summarization. However, this capability is currently accessible only through LLMs of substantial size or proprietary LLMs with usage fees. In response, smaller-scale LLMs (sLLMs) of easy accessibility and low costs have been extensively studied, yet they often suffer from missing key information and entities, i.e., low relevance, in particular, when input documents are long. We hence propose a key-element-informed instruction tuning for summarization, so-called KEITSum, which identifies key elements in documents and instructs sLLM to generate summaries capturing these key elements. Experimental results on dialogue and news datasets demonstrate that sLLM with KEITSum indeed provides high-quality summarization with higher relevance and less hallucinations, competitive to proprietary LLM.
comment: Interspeech 2024
♻ ☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
♻ ☆ Multimodal Physiological Signals Representation Learning via Multiscale Contrasting for Depression Recognition
Depression recognition based on physiological signals such as functional near-infrared spectroscopy (fNIRS) and electroencephalogram (EEG) has made considerable progress. However, most existing studies ignore the complementarity and semantic consistency of multimodal physiological signals under the same stimulation task in complex spatio-temporal patterns. In this paper, we introduce a multimodal physiological signals representation learning framework using Siamese architecture via multiscale contrasting for depression recognition (MRLMC). First, fNIRS and EEG are transformed into different but correlated data based on a time-domain data augmentation strategy. Then, we design a spatio-temporal contrasting module to learn the representation of fNIRS and EEG through weight-sharing multiscale spatio-temporal convolution. Furthermore, to enhance the learning of semantic representation associated with stimulation tasks, a semantic consistency contrast module is proposed, aiming to maximize the semantic similarity of fNIRS and EEG. Extensive experiments on publicly available and self-collected multimodal physiological signals datasets indicate that MRLMC outperforms the state-of-the-art models. Moreover, our proposed framework is capable of transferring to multimodal time series downstream tasks.
♻ ☆ Iterative Reasoning Preference Optimization
Iterative preference optimization methods have recently been shown to perform well for general instruction tuning tasks, but typically make little improvement on reasoning tasks (Yuan et al., 2024, Chen et al., 2024). In this work we develop an iterative approach that optimizes the preference between competing generated Chain-of-Thought (CoT) candidates by optimizing for winning vs. losing reasoning steps that lead to the correct answer. We train using a modified DPO loss (Rafailov et al., 2023) with an additional negative log-likelihood term, which we find to be crucial. We show reasoning improves across repeated iterations of this scheme. While only relying on examples in the training set, our approach results in increasing accuracy on GSM8K, MATH, and ARC-Challenge for Llama-2-70B-Chat, outperforming other Llama-2-based models not relying on additionally sourced datasets. For example, we see a large improvement from 55.6% to 81.6% on GSM8K and an accuracy of 88.7% with majority voting out of 32 samples.
♻ ☆ Enhancing Explainability of Knowledge Learning Paths: Causal Knowledge Networks
A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendation technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.
comment: 8 pages, 3 figures, Educational Data Mining 2024, Human-Centric eXplainable AI in Education
♻ ☆ A Survey on Human-AI Teaming with Large Pre-Trained Models
In the rapidly evolving landscape of artificial intelligence (AI), the collaboration between human intelligence and AI systems, known as Human-AI (HAI) Teaming, has emerged as a cornerstone for advancing problem-solving and decision-making processes. The advent of Large Pre-trained Models (LPtM) has significantly transformed this landscape, offering unprecedented capabilities by leveraging vast amounts of data to understand and predict complex patterns. This paper surveys the pivotal integration of LPtMs with HAI, emphasizing how these models enhance collaborative intelligence beyond traditional approaches. It examines the potential of LPtMs in augmenting human capabilities, discussing this collaboration for AI model improvements, effective teaming, ethical considerations, and their broad applied implications in various sectors. Through this exploration, the study sheds light on the transformative impact of LPtM-enhanced HAI Teaming, providing insights for future research, policy development, and strategic implementations aimed at harnessing the full potential of this collaboration for research and societal benefit.
♻ ☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
comment: - Moved Table 1 from before to after Section 4.2 heading - Renamed output pdf file with paper title
♻ ☆ Nonparametric Strategy Test
We present a nonparametric statistical test for determining whether an agent is following a given mixed strategy in a repeated strategic-form game given samples of the agent's play. This involves two components: determining whether the agent's frequencies of pure strategies are sufficiently close to the target frequencies, and determining whether the pure strategies selected are independent between different game iterations. Our integrated test involves applying a chi-squared goodness of fit test for the first component and a generalized Wald-Wolfowitz runs test for the second component. The results from both tests are combined using Bonferroni correction to produce a complete test for a given significance level $\alpha.$ We applied the test to publicly available data of human rock-paper-scissors play. The data consists of 50 iterations of play for 500 human players. We test with a null hypothesis that the players are following a uniform random strategy independently at each game iteration. Using a significance level of $\alpha = 0.05$, we conclude that 305 (61%) of the subjects are following the target strategy.
♻ ☆ Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation
This paper introduces a novel approach to leverage the generalizability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DMSFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then use a diffusion model-based image mixup strategy to bridge the domain gap between the source and target domains. We validate our approach through comprehensive experiments across a range of datasets, including Office-31, Office-Home, and VisDA. The results demonstrate significant improvements in SFDA performance, highlighting the potential of diffusion models in generating contextually relevant, domain-specific images.
comment: arXiv admin note: substantial text overlap with arXiv:2310.01701
♻ ☆ Outlier-Efficient Hopfield Layers for Large Transformer-Based Models ICML 2024
We introduce an Outlier-Efficient Modern Hopfield Model (termed $\mathrm{OutEffHop}$) and use it to address the outlier inefficiency problem of {training} gigantic transformer-based models. Our main contribution is a novel associative memory model facilitating \textit{outlier-efficient} associative memory retrievals. Interestingly, this memory model manifests a model-based interpretation of an outlier-efficient attention mechanism (${\rm Softmax}_1$): it is an approximation of the memory retrieval process of $\mathrm{OutEffHop}$. Methodologically, this allows us to introduce novel outlier-efficient Hopfield layers as powerful alternatives to traditional attention mechanisms, with superior post-quantization performance. Theoretically, the Outlier-Efficient Modern Hopfield Model retains and improves the desirable properties of standard modern Hopfield models, including fixed point convergence and exponential storage capacity. Empirically, we demonstrate the efficacy of the proposed model across large-scale transformer-based and Hopfield-based models (including BERT, OPT, ViT, and STanHop-Net), benchmarking against state-of-the-art methods like $\mathtt{Clipped\_Softmax}$ and $\mathtt{Gated\_Attention}$. Notably, $\mathrm{OutEffHop}$ achieves an average reduction of 22+\% in average kurtosis and 26+\% in the maximum infinity norm of model outputs across four models. Code is available at \href{https://github.com/MAGICS-LAB/OutEffHop}{GitHub}; models are on \href{https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f}{Hugging Face Hub}; future updates are on \href{https://arxiv.org/abs/2404.03828}{arXiv}.
comment: Accepted at ICML 2024; v2 updated to camera-ready version; Code available at https://github.com/MAGICS-LAB/OutEffHop; Models are on Hugging Face: https://huggingface.co/collections/magicslabnu/outeffhop-6610fcede8d2cda23009a98f
♻ ☆ Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated Text
Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
♻ ☆ AutoProSAM: Automated Prompting SAM for 3D Multi-Organ Segmentation
Segment Anything Model (SAM) is one of the pioneering prompt-based foundation models for image segmentation and has been rapidly adopted for various medical imaging applications. However, in clinical settings, creating effective prompts is notably challenging and time-consuming, requiring the expertise of domain specialists such as physicians. This requirement significantly diminishes SAM's primary advantage - its interactive capability with end users - in medical applications. Moreover, recent studies have indicated that SAM, originally designed for 2D natural images, performs sub optimally on 3D medical image segmentation tasks. This subpar performance is attributed to the domain gaps between natural and medical images and the disparities in spatial arrangements between 2D and 3D images, particularly in multi-organ segmentation applications. To overcome these challenges, we present a novel technique termed AutoProSAM. This method automates 3D multi-organ CT-based segmentation by leveraging SAM's foundational model capabilities without relying on domain experts for prompts. The approach utilizes parameter-efficient adaptation techniques to adapt SAM for 3D medical imagery and incorporates an effective automatic prompt learning paradigm specific to this domain. By eliminating the need for manual prompts, it enhances SAM's capabilities for 3D medical image segmentation and achieves state-of-the-art (SOTA) performance in CT-based multi-organ segmentation tasks.
♻ ☆ Theory of Mind for Multi-Agent Collaboration via Large Language Models EMNLP 2023
While Large Language Models (LLMs) have demonstrated impressive accomplishments in both reasoning and planning, their abilities in multi-agent collaborations remains largely unexplored. This study evaluates LLM-based agents in a multi-agent cooperative text game with Theory of Mind (ToM) inference tasks, comparing their performance with Multi-Agent Reinforcement Learning (MARL) and planning-based baselines. We observed evidence of emergent collaborative behaviors and high-order Theory of Mind capabilities among LLM-based agents. Our results reveal limitations in LLM-based agents' planning optimization due to systematic failures in managing long-horizon contexts and hallucination about the task state. We explore the use of explicit belief state representations to mitigate these issues, finding that it enhances task performance and the accuracy of ToM inferences for LLM-based agents.
comment: Accepted to EMNLP 2023 (Main Conference). Code available at https://github.com/romanlee6/multi_LLM_comm
♻ ☆ Disentangled Representation Learning
Disentangled Representation Learning (DRL) aims to learn a model capable of identifying and disentangling the underlying factors hidden in the observable data in representation form. The process of separating underlying factors of variation into variables with semantic meaning benefits in learning explainable representations of data, which imitates the meaningful understanding process of humans when observing an object or relation. As a general learning strategy, DRL has demonstrated its power in improving the model explainability, controlability, robustness, as well as generalization capacity in a wide range of scenarios such as computer vision, natural language processing, and data mining. In this article, we comprehensively investigate DRL from various aspects including motivations, definitions, methodologies, evaluations, applications, and model designs. We first present two well-recognized definitions, i.e., Intuitive Definition and Group Theory Definition for disentangled representation learning. We further categorize the methodologies for DRL into four groups from the following perspectives, the model type, representation structure, supervision signal, and independence assumption. We also analyze principles to design different DRL models that may benefit different tasks in practical applications. Finally, we point out challenges in DRL as well as potential research directions deserving future investigations. We believe this work may provide insights for promoting the DRL research in the community.
comment: Accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence
♻ ☆ Leveraging Large Language Models for Software Model Completion: Results from Industrial and Public Datasets
Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, retrieval-augmented generation, leveraging large language models, model histories, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion with retrieval-augmented generation. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). The general inference capabilities of large language models are particularly useful when dealing with concepts for which there are few, noisy, or no examples at all.
♻ ☆ Empirical Analysis of Fictitious Play for Nash Equilibrium Computation in Multiplayer Games
While fictitious play is guaranteed to converge to Nash equilibrium in certain game classes, such as two-player zero-sum games, it is not guaranteed to converge in non-zero-sum and multiplayer games. We show that fictitious play in fact leads to improved Nash equilibrium approximation over a variety of game classes and sizes than (counterfactual) regret minimization, which has recently produced superhuman play for multiplayer poker. We also show that when fictitious play is run several times using random initializations it is able to solve several known challenge problems in which the standard version is known to not converge, including Shapley's classic counterexample. These provide some of the first positive results for fictitious play in these settings, despite the fact that worst-case theoretical results are negative.
comment: Subsumes 2020 article "Fictitious Play Outperforms Counterfactual Regret Minimization" and 2022 article "Random Initialization Solves Shapley's Fictitious Play Counterexample."
Information Retrieval 14
☆ UniRec: A Dual Enhancement of Uniformity and Frequency in Sequential Recommendations
Representation learning in sequential recommendation is critical for accurately modeling user interaction patterns and improving recommendation precision. However, existing approaches predominantly emphasize item-to-item transitions, often neglecting the time intervals between interactions, which are closely related to behavior pattern changes. Additionally, broader interaction attributes, such as item frequency, are frequently overlooked. We found that both sequences with more uniform time intervals and items with higher frequency yield better prediction performance. Conversely, non-uniform sequences exacerbate user interest drift and less-frequent items are difficult to model due to sparse sampling, presenting unique challenges inadequately addressed by current methods. In this paper, we propose UniRec, a novel bidirectional enhancement sequential recommendation method. UniRec leverages sequence uniformity and item frequency to enhance performance, particularly improving the representation of non-uniform sequences and less-frequent items. These two branches mutually reinforce each other, driving comprehensive performance optimization in complex sequential recommendation scenarios. Additionally, we present a multidimensional time module to further enhance adaptability. To the best of our knowledge, UniRec is the first method to utilize the characteristics of uniformity and frequency for feature augmentation. Comparing with eleven advanced models across four datasets, we demonstrate that UniRec outperforms SOTA models significantly. The code is available at https://github.com/Linxi000/UniRec.
comment: 15 pages, 8 figures, for source code, see https://github.com/Linxi000/UniRec
☆ The Effects of Data Split Strategies on the Offline Experiments for CTR Prediction
Click-through rate (CTR) prediction is a crucial task in online advertising to recommend products that users are likely to be interested in. To identify the best-performing models, rigorous model evaluation is necessary. Offline experimentation plays a significant role in selecting models for live user-item interactions, despite the value of online experimentation like A/B testing, which has its own limitations and risks. Often, the correlation between offline performance metrics and actual online model performance is inadequate. One main reason for this discrepancy is the common practice of using random splits to create training, validation, and test datasets in CTR prediction. In contrast, real-world CTR prediction follows a temporal order. Therefore, the methodology used in offline evaluation, particularly the data splitting strategy, is crucial. This study aims to address the inconsistency between current offline evaluation methods and real-world use cases, by focusing on data splitting strategies. To examine the impact of different data split strategies on offline performance, we conduct extensive experiments using both random and temporal splits on a large open benchmark dataset, Criteo.
☆ Effects of Using Synthetic Data on Deep Recommender Models' Performance
Recommender systems are essential for enhancing user experiences by suggesting items based on individual preferences. However, these systems frequently face the challenge of data imbalance, characterized by a predominance of negative interactions over positive ones. This imbalance can result in biased recommendations favoring popular items. This study investigates the effectiveness of synthetic data generation in addressing data imbalances within recommender systems. Six different methods were used to generate synthetic data. Our experimental approach involved generating synthetic data using these methods and integrating the generated samples into the original dataset. Our results show that the inclusion of generated negative samples consistently improves the Area Under the Curve (AUC) scores. The significant impact of synthetic negative samples highlights the potential of data augmentation strategies to address issues of data sparsity and imbalance, ultimately leading to improved performance of recommender systems.
☆ Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning KDD 2024
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.
comment: Accepted by KDD 2024 Research Track
☆ Concordance in basal cell carcinoma diagnosis. Building a proper ground truth to train Artificial Intelligence tools
Background: The existence of different basal cell carcinoma (BCC) clinical criteria cannot be objectively validated. An adequate ground-truth is needed to train an artificial intelligence (AI) tool that explains the BCC diagnosis by providing its dermoscopic features. Objectives: To determine the consensus among dermatologists on dermoscopic criteria of 204 BCC. To analyze the performance of an AI tool when the ground-truth is inferred. Methods: A single center, diagnostic and prospective study was conducted to analyze the agreement in dermoscopic criteria by four dermatologists and then derive a reference standard. 1434 dermoscopic images have been used, that were taken by a primary health physician, sent via teledermatology, and diagnosed by a dermatologist. They were randomly selected from the teledermatology platform (2019-2021). 204 of them were tested with an AI tool; the remainder trained it. The performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists was analyzed using McNemar's test and Hamming distance. Results: Dermatologists achieve perfect agreement in the diagnosis of BCC (Fleiss-Kappa=0.9079), and a high correlation with the biopsy (PPV=0.9670). However, there is low agreement in detecting some dermoscopic criteria. Statistical differences were found in the performance of the AI tool trained using the ground-truth of one dermatologist versus the ground-truth statistically inferred from the consensus of four dermatologists. Conclusions: Care should be taken when training an AI tool to determine the BCC patterns present in a lesion. Ground-truth should be established from multiple dermatologists.
comment: Manuscript word count: 3000, Number of figures: 2, Number of tables: 3
☆ Knowledge Graph Enhanced Retrieval-Augmented Generation for Failure Mode and Effects Analysis
Failure mode and effects analysis (FMEA) is a critical tool for mitigating potential failures, particular during ramp-up phases of new products. However, its effectiveness is often limited by the missing reasoning capabilities of the FMEA tools, which are usually tabular structured. Meanwhile, large language models (LLMs) offer novel prospects for fine-tuning on custom datasets for reasoning within FMEA contexts. However, LLMs face challenges in tasks that require factual knowledge, a gap that retrieval-augmented generation (RAG) approaches aim to fill. RAG retrieves information from a non-parametric data store and uses a language model to generate responses. Building on this idea, we propose to advance the non-parametric data store with a knowledge graph (KG). By enhancing the RAG framework with a KG, our objective is to leverage analytical and semantic question-answering capabilities on FMEA data. This paper contributes by presenting a new ontology for FMEA observations, an algorithm for creating vector embeddings from the FMEA KG, and a KG enhanced RAG framework. Our approach is validated through a human study and we measure the performance of the context retrieval recall and precision.
☆ A Stem-Agnostic Single-Decoder System for Music Source Separation Beyond Four Stems
Despite significant recent progress across multiple subtasks of audio source separation, few music source separation systems support separation beyond the four-stem vocals, drums, bass, and other (VDBO) setup. Of the very few current systems that support source separation beyond this setup, most continue to rely on an inflexible decoder setup that can only support a fixed pre-defined set of stems. Increasing stem support in these inflexible systems correspondingly requires increasing computational complexity, rendering extensions of these systems computationally infeasible for long-tail instruments. In this work, we propose Banquet, a system that allows source separation of multiple stems using just one decoder. A bandsplit source separation model is extended to work in a query-based setup in tandem with a music instrument recognition PaSST model. On the MoisesDB dataset, Banquet, at only 24.9 M trainable parameters, approached the performance level of the significantly more complex 6-stem Hybrid Transformer Demucs on VDBO stems and outperformed it on guitar and piano. The query-based setup allows for the separation of narrow instrument classes such as clean acoustic guitars, and can be successfully applied to the extraction of less common stems such as reeds and organs. Implementation is available at https://github.com/kwatcharasupat/query-bandit.
comment: Submitted to the 25th International Society for Music Information Retrieval Conference (ISMIR 2024)
☆ Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models
Large Language Models (LLMs) have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. Indeed, existing work has shown that LLMs can be used to great effect for many tasks, such as information retrieval (IR), and passage ranking. However, current state-of-the-art results heavily lean on the capabilities of the LLM being used. Currently, proprietary, and very large LLMs such as GPT-4 are the highest performing passage re-rankers. Hence, users without the resources to leverage top of the line LLMs, or ones that are closed source, are at a disadvantage. In this paper, we investigate the use of a pre-filtering step before passage re-ranking in IR. Our experiments show that by using a small number of human generated relevance scores, coupled with LLM relevance scoring, it is effectively possible to filter out irrelevant passages before re-ranking. Our experiments also show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task. Indeed, our results show that smaller models such as Mixtral can become competitive with much larger proprietary models (e.g., ChatGPT and GPT-4).
♻ ☆ Generate then Retrieve: Conversational Response Retrieval Using LLMs as Answer and Query Generators
CIS is a prominent area in IR which focuses on developing interactive knowledge assistants. These systems must adeptly comprehend the user's information requirements within the conversational context and retrieve the relevant information. To this aim, the existing approaches model the user's information needs by generating a single query rewrite or a single representation of the query in the query space embedding. However, to answer complex questions, a single query rewrite or representation is often ineffective. To address this, a system needs to do reasoning over multiple passages. In this work, we propose using a generate-then-retrieve approach to improve the passage retrieval performance for complex user queries. In this approach, we utilize large language models (LLMs) to (i) generate an initial answer to the user's information need by doing reasoning over the context of the conversation, and (ii) ground this answer to the collection. Based on the experiments, our proposed approach significantly improves the retrieval performance on TREC iKAT 23, TREC CAsT 20 and 22 datasets, under various setups. Also, we show that grounding the LLM's answer requires more than one searchable query, where an average of 3 queries outperforms human rewrites.
♻ ☆ General Distribution Learning: A theoretical framework for Deep Learning
There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima for generalization, and the exceptional performance of deep architectures in solving physical problems. This paper introduces General Distribution Learning (GD Learning), a novel theoretical learning framework designed to address a comprehensive range of machine learning and statistical tasks, including classification, regression and parameter estimation. Departing from traditional statistical machine learning, GD Learning focuses on the true underlying distribution. In GD Learning, learning error, corresponding to the expected error in classical statistical learning framework, is divided into fitting errors due to models and algorithms, as well as sampling errors introduced by limited sampling data. The framework significantly incorporates prior knowledge, especially in scenarios characterized by data scarcity, thereby enhancing performance. Within the GD Learning framework, we demonstrate that the global optimal solutions in non-convex optimization can be approached by minimizing the gradient norm and the non-uniformity of the eigenvalues of the model's Jacobian matrix. This insight leads to the development of the gradient structure control algorithm. GD Learning also offers fresh insights into the questions on deep learning, including overparameterization and non-convex optimization, bias-variance trade-off, and the mechanism of flat minima.
comment: arXiv admin note: text overlap with arXiv:2105.04026 by other authors. arXiv admin note: text overlap with arXiv:2105.04026 by other authors
♻ ☆ Jina CLIP: Your CLIP Model Is Also Your Text Retriever ICML2024
Contrastive Language-Image Pretraining (CLIP) is widely used to train models to align images and texts in a common embedding space by mapping them to fixed-sized vectors. These models are key to multimodal information retrieval and related tasks. However, CLIP models generally underperform in text-only tasks compared to specialized text models. This creates inefficiencies for information retrieval systems that keep separate embeddings and models for text-only and multimodal tasks. We propose a novel, multi-task contrastive training method to address this issue, which we use to train the jina-clip-v1 model to achieve the state-of-the-art performance on both text-image and text-text retrieval tasks.
comment: 4 pages, MFM-EAI@ICML2024
♻ ☆ ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction WWW 2024
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
comment: Accepted by WWW 2024
♻ ☆ Continual Collaborative Distillation for Recommender System KDD 2024
Knowledge distillation (KD) has emerged as a promising technique for addressing the computational challenges associated with deploying large-scale recommender systems. KD transfers the knowledge of a massive teacher system to a compact student model, to reduce the huge computational burdens for inference while retaining high accuracy. The existing KD studies primarily focus on one-time distillation in static environments, leaving a substantial gap in their applicability to real-world scenarios dealing with continuously incoming users, items, and their interactions. In this work, we delve into a systematic approach to operating the teacher-student KD in a non-stationary data stream. Our goal is to enable efficient deployment through a compact student, which preserves the high performance of the massive teacher, while effectively adapting to continuously incoming data. We propose Continual Collaborative Distillation (CCD) framework, where both the teacher and the student continually and collaboratively evolve along the data stream. CCD facilitates the student in effectively adapting to new data, while also enabling the teacher to fully leverage accumulated knowledge. We validate the effectiveness of CCD through extensive quantitative, ablative, and exploratory experiments on two real-world datasets. We expect this research direction to contribute to narrowing the gap between existing KD studies and practical applications, thereby enhancing the applicability of KD in real-world systems.
comment: Accepted by KDD 2024 research track. 9 main pages + 1 appendix page, 5 figures
♻ ☆ Retrieval Augmented Zero-Shot Text Classification SIGIR
Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information.
comment: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA
Multimedia 3
☆ Improving the Consistency in Cross-Lingual Cross-Modal Retrieval with 1-to-K Contrastive Learning KDD 2024
Cross-lingual Cross-modal Retrieval (CCR) is an essential task in web search, which aims to break the barriers between modality and language simultaneously and achieves image-text retrieval in the multi-lingual scenario with a single model. In recent years, excellent progress has been made based on cross-lingual cross-modal pre-training; particularly, the methods based on contrastive learning on large-scale data have significantly improved retrieval tasks. However, these methods directly follow the existing pre-training methods in the cross-lingual or cross-modal domain, leading to two problems of inconsistency in CCR: The methods with cross-lingual style suffer from the intra-modal error propagation, resulting in inconsistent recall performance across languages in the whole dataset. The methods with cross-modal style suffer from the inter-modal optimization direction bias, resulting in inconsistent rank across languages within each instance, which cannot be reflected by Recall@K. To solve these problems, we propose a simple but effective 1-to-K contrastive learning method, which treats each language equally and eliminates error propagation and optimization bias. In addition, we propose a new evaluation metric, Mean Rank Variance (MRV), to reflect the rank inconsistency across languages within each instance. Extensive experiments on four CCR datasets show that our method improves both recall rates and MRV with smaller-scale pre-trained data, achieving the new state-of-art.
comment: Accepted by KDD 2024 Research Track
☆ A Study on Synthesizing Expressive Violin Performances: Approaches and Comparisons
Expressive music synthesis (EMS) for violin performance is a challenging task due to the disagreement among music performers in the interpretation of expressive musical terms (EMTs), scarcity of labeled recordings, and limited generalization ability of the synthesis model. These challenges create trade-offs between model effectiveness, diversity of generated results, and controllability of the synthesis system, making it essential to conduct a comparative study on EMS model design. This paper explores two violin EMS approaches. The end-to-end approach is a modification of a state-of-the-art text-to-speech generator. The parameter-controlled approach is based on a simple parameter sampling process that can render note lengths and other parameters compatible with MIDI-DDSP. We study these two approaches (in total, three model variants) through objective and subjective experiments and discuss several key issues of EMS based on the results.
comment: 15 pages, 2 figures, 3 tables
♻ ☆ ObjFormer: Learning Land-Cover Changes From Paired OSM Data and Optical High-Resolution Imagery via Object-Guided Transformer
Optical high-resolution imagery and OSM data are two important data sources of change detection (CD). Previous related studies focus on utilizing the information in OSM data to aid the CD on optical high-resolution images. This paper pioneers the direct detection of land-cover changes utilizing paired OSM data and optical imagery, thereby expanding the scope of CD tasks. To this end, we propose an object-guided Transformer (ObjFormer) by naturally combining the object-based image analysis (OBIA) technique with the advanced vision Transformer architecture. This combination can significantly reduce the computational overhead in the self-attention module without adding extra parameters or layers. ObjFormer has a hierarchical pseudo-siamese encoder consisting of object-guided self-attention modules that extracts multi-level heterogeneous features from OSM data and optical images; a decoder consisting of object-guided cross-attention modules can recover land-cover changes from the extracted heterogeneous features. Beyond basic binary change detection, this paper raises a new semi-supervised semantic change detection task that does not require any manually annotated land-cover labels to train semantic change detectors. Two lightweight semantic decoders are added to ObjFormer to accomplish this task efficiently. A converse cross-entropy loss is designed to fully utilize negative samples, contributing to the great performance improvement in this task. A large-scale benchmark dataset called OpenMapCD containing 1,287 samples covering 40 regions on six continents is constructed to conduct detailed experiments. The results show the effectiveness of our methods in this new kind of CD task. Additionally, case studies in Japanese cities demonstrate the framework's generalizability and practical potential. The OpenMapCD and source code are available in https://github.com/ChenHongruixuan/ObjFormer
comment: Accepted by IEEE TGRS
Computer Vision and Pattern Recognition 142
☆ Text-Animator: Controllable Visual Text Video Generation
Video generation is a challenging yet pivotal task in various industries, such as gaming, e-commerce, and advertising. One significant unresolved aspect within T2V is the effective visualization of text within generated videos. Despite the progress achieved in Text-to-Video~(T2V) generation, current methods still cannot effectively visualize texts in videos directly, as they mainly focus on summarizing semantic scene information, understanding, and depicting actions. While recent advances in image-level visual text generation show promise, transitioning these techniques into the video domain faces problems, notably in preserving textual fidelity and motion coherence. In this paper, we propose an innovative approach termed Text-Animator for visual text video generation. Text-Animator contains a text embedding injection module to precisely depict the structures of visual text in generated videos. Besides, we develop a camera control module and a text refinement module to improve the stability of generated visual text by controlling the camera movement as well as the motion of visualized text. Quantitative and qualitative experimental results demonstrate the superiority of our approach to the accuracy of generated visual text over state-of-the-art video generation methods. The project page can be found at https://laulampaul.github.io/text-animator.html.
comment: Project Page: https://laulampaul.github.io/text-animator.html
☆ Fast and Uncertainty-Aware SVBRDF Recovery from Multi-View Capture using Frequency Domain Analysis
Relightable object acquisition is a key challenge in simplifying digital asset creation. Complete reconstruction of an object typically requires capturing hundreds to thousands of photographs under controlled illumination, with specialized equipment. The recent progress in differentiable rendering improved the quality and accessibility of inverse rendering optimization. Nevertheless, under uncontrolled illumination and unstructured viewpoints, there is no guarantee that the observations contain enough information to reconstruct the appearance properties of the captured object. We thus propose to consider the acquisition process from a signal-processing perspective. Given an object's geometry and a lighting environment, we estimate the properties of the materials on the object's surface in seconds. We do so by leveraging frequency domain analysis, considering the recovery of material properties as a deconvolution, enabling fast error estimation. We then quantify the uncertainty of the estimation, based on the available data, highlighting the areas for which priors or additional samples would be required for improved acquisition quality. We compare our approach to previous work and quantitatively evaluate our results, showing similar quality as previous work in a fraction of the time, and providing key information about the certainty of the results.
comment: Project page: https://brdf-uncertainty.github.io
☆ MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
Multi-modal large language models (MLLMs) have made significant strides in various visual understanding tasks. However, the majority of these models are constrained to process low-resolution images, which limits their effectiveness in perception tasks that necessitate detailed visual information. In our study, we present MG-LLaVA, an innovative MLLM that enhances the model's visual processing capabilities by incorporating a multi-granularity vision flow, which includes low-resolution, high-resolution, and object-centric features. We propose the integration of an additional high-resolution visual encoder to capture fine-grained details, which are then fused with base visual features through a Conv-Gate fusion network. To further refine the model's object recognition abilities, we incorporate object-level features derived from bounding boxes identified by offline detectors. Being trained solely on publicly available multimodal data through instruction tuning, MG-LLaVA demonstrates exceptional perception skills. We instantiate MG-LLaVA with a wide variety of language encoders, ranging from 3.8B to 34B, to evaluate the model's performance comprehensively. Extensive evaluations across multiple benchmarks demonstrate that MG-LLaVA outperforms existing MLLMs of comparable parameter sizes, showcasing its remarkable efficacy. The code will be available at https://github.com/PhoenixZ810/MG-LLaVA.
☆ DiffusionPDE: Generative PDE-Solving Under Partial Observation
We introduce a general framework for solving partial differential equations (PDEs) using generative diffusion models. In particular, we focus on the scenarios where we do not have the full knowledge of the scene necessary to apply classical solvers. Most existing forward or inverse PDE approaches perform poorly when the observations on the data or the underlying coefficients are incomplete, which is a common assumption for real-world measurements. In this work, we propose DiffusionPDE that can simultaneously fill in the missing information and solve a PDE by modeling the joint distribution of the solution and coefficient spaces. We show that the learned generative priors lead to a versatile framework for accurately solving a wide range of PDEs under partial observation, significantly outperforming the state-of-the-art methods for both forward and inverse directions.
comment: Project page: https://jhhuangchloe.github.io/Diffusion-PDE/
☆ MotionBooth: Motion-Aware Customized Text-to-Video Generation
In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth
comment: Project page at https://jianzongwu.github.io/projects/motionbooth
☆ Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation
The proliferation of complex deep learning (DL) models has revolutionized various applications, including computer vision-based solutions, prompting their integration into real-time systems. However, the resource-intensive nature of these models poses challenges for deployment on low-computational power and low-memory devices, like embedded and edge devices. This work empirically investigates the optimization of such complex DL models to analyze their functionality on an embedded device, particularly on the NVIDIA Jetson Nano. It evaluates the effectiveness of the optimized models in terms of their inference speed for image classification and video action detection. The experimental results reveal that, on average, optimized models exhibit a 16.11% speed improvement over their non-optimized counterparts. This not only emphasizes the critical need to consider hardware constraints and environmental sustainability in model development and deployment but also underscores the pivotal role of model optimization in enabling the widespread deployment of AI-assisted technologies on resource-constrained computational systems. It also serves as proof that prioritizing hardware-specific model optimization leads to efficient and scalable solutions that substantially decrease energy consumption and carbon footprint.
comment: 7 pages, 4 figures
☆ Point-SAM: Promptable 3D Segmentation Model for Point Clouds
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, lightweight models, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model (Point-SAM) focusing on point clouds. Our approach utilizes a transformer-based method, extending SAM to the 3D domain. We leverage part-level and object-level annotations and introduce a data engine to generate pseudo labels from SAM, thereby distilling 2D knowledge into our 3D model. Our model outperforms state-of-the-art models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as 3D annotation. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
☆ Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning
Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. In this work, we propose a general framework for parameter efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices (SURM) which can serve as a drop-in replacement for popular approaches such as Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more flexibility in finding the right balance between compactness and expressiveness. This is achieved by using low displacement rank matrices (LDRMs), which hasn't been used in this context before. SURMs remain competitive with baselines, often providing significant quality improvements while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on various image classification tasks while replacing low-rank matrices in LoRA. It also results in up to 12x reduction of the number of parameters in adapters (with virtually no loss in quality) on the GLUE benchmark.
comment: Work in progress
☆ Arboretum: A Large Multimodal Dataset Enabling AI for Biodiversity
We introduce Arboretum, the largest publicly accessible dataset designed to advance AI for biodiversity applications. This dataset, curated from the iNaturalist community science platform and vetted by domain experts to ensure accuracy, includes 134.6 million images, surpassing existing datasets in scale by an order of magnitude. The dataset encompasses image-language paired data for a diverse set of species from birds (Aves), spiders/ticks/mites (Arachnida), insects (Insecta), plants (Plantae), fungus/mushrooms (Fungi), snails (Mollusca), and snakes/lizards (Reptilia), making it a valuable resource for multimodal vision-language AI models for biodiversity assessment and agriculture research. Each image is annotated with scientific names, taxonomic details, and common names, enhancing the robustness of AI model training. We showcase the value of Arboretum by releasing a suite of CLIP models trained using a subset of 40 million captioned images. We introduce several new benchmarks for rigorous assessment, report accuracy for zero-shot learning, and evaluations across life stages, rare species, confounding species, and various levels of the taxonomic hierarchy. We anticipate that Arboretum will spur the development of AI models that can enable a variety of digital tools ranging from pest control strategies, crop monitoring, and worldwide biodiversity assessment and environmental conservation. These advancements are critical for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. Arboretum is publicly available, easily accessible, and ready for immediate use. Please see the \href{https://baskargroup.github.io/Arboretum/}{project website} for links to our data, models, and code.
comment: Preprint under review
☆ Mask-Guided Attention U-Net for Enhanced Neonatal Brain Extraction and Image Preprocessing
In this study, we introduce MGA-Net, a novel mask-guided attention neural network, which extends the U-net model for precision neonatal brain imaging. MGA-Net is designed to extract the brain from other structures and reconstruct high-quality brain images. The network employs a common encoder and two decoders: one for brain mask extraction and the other for brain region reconstruction. A key feature of MGA-Net is its high-level mask-guided attention module, which leverages features from the brain mask decoder to enhance image reconstruction. To enable the same encoder and decoder to process both MRI and ultrasound (US) images, MGA-Net integrates sinusoidal positional encoding. This encoding assigns distinct positional values to MRI and US images, allowing the model to effectively learn from both modalities. Consequently, features learned from a single modality can aid in learning a modality with less available data, such as US. We extensively validated the proposed MGA-Net on diverse datasets from varied clinical settings and neonatal age groups. The metrics used for assessment included the DICE similarity coefficient, recall, and accuracy for image segmentation; structural similarity for image reconstruction; and root mean squared error for total brain volume estimation from 3D ultrasound images. Our results demonstrate that MGA-Net significantly outperforms traditional methods, offering superior performance in brain extraction and segmentation while achieving high precision in image reconstruction and volumetric analysis. Thus, MGA-Net represents a robust and effective preprocessing tool for MRI and 3D ultrasound images, marking a significant advance in neuroimaging that enhances both research and clinical diagnostics in the neonatal period and beyond.
☆ SurgeMOD: Translating image-space tissue motions into vision-based surgical forces
We present a new approach for vision-based force estimation in Minimally Invasive Robotic Surgery based on frequency domain basis of motion of organs derived directly from video. Using internal movements generated by natural processes like breathing or the cardiac cycle, we infer the image-space basis of the motion on the frequency domain. As we are working with this representation, we discretize the problem to a limited amount of low-frequencies to build an image-space mechanical model of the environment. We use this pre-built model to define our force estimation problem as a dynamic constraint problem. We demonstrate that this method can estimate point contact forces reliably for silicone phantom and ex-vivo experiments, matching real readings from a force sensor. In addition, we perform qualitative experiments in which we synthesize coherent force textures from surgical videos over a certain region of interest selected by the user. Our method demonstrates good results for both quantitative and qualitative analysis, providing a good starting point for a purely vision-based method for surgical force estimation.
☆ HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction
Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DTA prediction method, termed HGTDP-DTA, which utilizes dynamic prompts within a hybrid Graph-Transformer framework. Our method generates context-specific prompts for each drug-target pair, enhancing the model's ability to capture unique interactions. The introduction of prompt tuning further optimizes the prediction process by filtering out irrelevant noise and emphasizing task-relevant information, dynamically adjusting the input features of the molecular graph. The proposed hybrid Graph-Transformer architecture combines structural information from Graph Convolutional Networks (GCNs) with sequence information captured by Transformers, facilitating the interaction between global and local information. Additionally, we adopted the multi-view feature fusion method to project molecular graph views and affinity subgraph views into a common feature space, effectively combining structural and contextual information. Experiments on two widely used public datasets, Davis and KIBA, show that HGTDP-DTA outperforms state-of-the-art DTA prediction methods in both prediction performance and generalization ability.
☆ Unified Auto-Encoding with Masked Diffusion
At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.
comment: 19 Pages, 8 Figures, 3Tables
☆ End-to-End Autonomous Driving without Costly Modularization and 3D Manual Annotation
We propose UAD, a method for vision-based end-to-end autonomous driving (E2EAD), achieving the best open-loop evaluation performance in nuScenes, meanwhile showing robust closed-loop driving quality in CARLA. Our motivation stems from the observation that current E2EAD models still mimic the modular architecture in typical driving stacks, with carefully designed supervised perception and prediction subtasks to provide environment information for oriented planning. Although achieving groundbreaking progress, such design has certain drawbacks: 1) preceding subtasks require massive high-quality 3D annotations as supervision, posing a significant impediment to scaling the training data; 2) each submodule entails substantial computation overhead in both training and inference. To this end, we propose UAD, an E2EAD framework with an unsupervised proxy to address all these issues. Firstly, we design a novel Angular Perception Pretext to eliminate the annotation requirement. The pretext models the driving scene by predicting the angular-wise spatial objectness and temporal dynamics, without manual annotation. Secondly, a self-supervised training strategy, which learns the consistency of the predicted trajectories under different augment views, is proposed to enhance the planning robustness in steering scenarios. Our UAD achieves 38.7% relative improvements over UniAD on the average collision rate in nuScenes and surpasses VAD for 41.32 points on the driving score in CARLA's Town05 Long benchmark. Moreover, the proposed method only consumes 44.3% training resources of UniAD and runs 3.4 times faster in inference. Our innovative design not only for the first time demonstrates unarguable performance advantages over supervised counterparts, but also enjoys unprecedented efficiency in data, training, and inference. Code and models will be released at https://github.com/KargoBot_Research/UAD.
comment: 17 pages, 10 figures and 15 tables
☆ Local-to-Global Cross-Modal Attention-Aware Fusion for HSI-X Semantic Segmentation
Hyperspectral image (HSI) classification has recently reached its performance bottleneck. Multimodal data fusion is emerging as a promising approach to overcome this bottleneck by providing rich complementary information from the supplementary modality (X-modality). However, achieving comprehensive cross-modal interaction and fusion that can be generalized across different sensing modalities is challenging due to the disparity in imaging sensors, resolution, and content of different modalities. In this study, we propose a Local-to-Global Cross-modal Attention-aware Fusion (LoGoCAF) framework for HSI-X classification that jointly considers efficiency, accuracy, and generalizability. LoGoCAF adopts a pixel-to-pixel two-branch semantic segmentation architecture to learn information from HSI and X modalities. The pipeline of LoGoCAF consists of a local-to-global encoder and a lightweight multilayer perceptron (MLP) decoder. In the encoder, convolutions are used to encode local and high-resolution fine details in shallow layers, while transformers are used to integrate global and low-resolution coarse features in deeper layers. The MLP decoder aggregates information from the encoder for feature fusion and prediction. In particular, two cross-modality modules, the feature enhancement module (FEM) and the feature interaction and fusion module (FIFM), are introduced in each encoder stage. The FEM is used to enhance complementary information by combining the feature from the other modality across direction-aware, position-sensitive, and channel-wise dimensions. With the enhanced features, the FIFM is designed to promote cross-modality information interaction and fusion for the final semantic prediction. Extensive experiments demonstrate that our LoGoCAF achieves superior performance and generalizes well. The code will be made publicly available.
☆ Brain Tumor Classification using Vision Transformer with Selective Cross-Attention Mechanism and Feature Calibration
Brain tumor classification is a challenging task in medical image analysis. In this paper, we propose a novel approach to brain tumor classification using a vision transformer with a novel cross-attention mechanism. Our approach leverages the strengths of transformers in modeling long-range dependencies and multi-scale feature fusion. We introduce two new mechanisms to improve the performance of the cross-attention fusion module: Feature Calibration Mechanism (FCM) and Selective Cross-Attention (SCA). FCM calibrates the features from different branches to make them more compatible, while SCA selectively attends to the most informative features. Our experiments demonstrate that the proposed approach outperforms other state-of-the-art methods in brain tumor classification, achieving improved accuracy and efficiency. The proposed FCM and SCA mechanisms can be easily integrated into other vision transformer architectures, making them a promising direction for future research in medical image analysis. Experimental results confirm that our approach surpasses existing methods, achieving state-of-the-art performance in brain tumor classification tasks.
☆ Time-varying Extremum Graphs
We introduce time-varying extremum graph (TVEG), a topological structure to support visualization and analysis of a time-varying scalar field. The extremum graph is a substructure of the Morse-Smale complex. It captures the adjacency relationship between cells in the Morse decomposition of a scalar field. We define the TVEG as a time-varying extension of the extremum graph and demonstrate how it captures salient feature tracks within a dynamic scalar field. We formulate the construction of the TVEG as an optimization problem and describe an algorithm for computing the graph. We also demonstrate the capabilities of \TVEG towards identification and exploration of topological events such as deletion, generation, split, and merge within a dynamic scalar field via comprehensive case studies including a viscous fingers and a 3D von K\'arm\'an vortex street dataset.
☆ BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging
Test-time augmentation (TTA) is a well-known technique employed during the testing phase of computer vision tasks. It involves aggregating multiple augmented versions of input data. Combining predictions using a simple average formulation is a common and straightforward approach after performing TTA. This paper introduces a novel framework for optimizing TTA, called BayTTA (Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First, we generate a model list associated with different variations of the input data created through TTA. Then, we use BMA to combine model predictions weighted by their respective posterior probabilities. Such an approach allows one to take into account model uncertainty, and thus to enhance the predictive performance of the related machine learning or deep learning model. We evaluate the performance of BayTTA on various public data, including three medical image datasets comprising skin cancer, breast cancer, and chest X-ray images and two well-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental results indicate that BayTTA can be effectively integrated into state-of-the-art deep learning models used in medical image analysis as well as into some popular pre-trained CNN models such as VGG-16, MobileNetV2, DenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in their accuracy and robustness performance.
☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
☆ Aligning Diffusion Models with Noise-Conditioned Perception
Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8\% general preference, 62.2\% visual appeal, and 52.1\% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: https://huggingface.co/alexgambashidze/SDXL\_NCP-DPO\_v0.1
☆ Video Inpainting Localization with Contrastive Learning
Deep video inpainting is typically used as malicious manipulation to remove important objects for creating fake videos. It is significant to identify the inpainted regions blindly. This letter proposes a simple yet effective forensic scheme for Video Inpainting LOcalization with ContrAstive Learning (ViLocal). Specifically, a 3D Uniformer encoder is applied to the video noise residual for learning effective spatiotemporal forensic features. To enhance the discriminative power, supervised contrastive learning is adopted to capture the local inconsistency of inpainted videos through attracting/repelling the positive/negative pristine and forged pixel pairs. A pixel-wise inpainting localization map is yielded by a lightweight convolution decoder with a specialized two-stage training strategy. To prepare enough training samples, we build a video object segmentation dataset of 2500 videos with pixel-level annotations per frame. Extensive experimental results validate the superiority of ViLocal over state-of-the-arts. Code and dataset will be available at https://github.com/multimediaFor/ViLocal.
comment: arXiv admin note: substantial text overlap with arXiv:2406.13576
☆ Embedded event based object detection with spiking neural network
The complexity of event-based object detection (OD) poses considerable challenges. Spiking Neural Networks (SNNs) show promising results and pave the way for efficient event-based OD. Despite this success, the path to efficient SNNs on embedded devices remains a challenge. This is due to the size of the networks required to accomplish the task and the ability of devices to take advantage of SNNs benefits. Even when "edge" devices are considered, they typically use embedded GPUs that consume tens of watts. In response to these challenges, our research introduces an embedded neuromorphic testbench that utilizes the SPiking Low-power Event-based ArchiTecture (SPLEAT) accelerator. Using an extended version of the Qualia framework, we can train, evaluate, quantize, and deploy spiking neural networks on an FPGA implementation of SPLEAT. We used this testbench to load a state-of-the-art SNN solution, estimate the performance loss associated with deploying the network on dedicated hardware, and run real-world event-based OD on neuromorphic hardware specifically designed for low-power spiking neural networks. Remarkably, our embedded spiking solution, which includes a model with 1.08 million parameters, operates efficiently with 490 mJ per prediction.
comment: Result link: https://youtu.be/TsolUDaMY7Y
☆ MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
comment: Accepted at Interspeech 2024
☆ Test-Time Generative Augmentation for Medical Image Segmentation
In this paper, we propose a novel approach to enhance medical image segmentation during test time. Instead of employing hand-crafted transforms or functions on the input test image to create multiple views for test-time augmentation, we advocate for the utilization of an advanced domain-fine-tuned generative model (GM), e.g., stable diffusion (SD), for test-time augmentation. Given that the GM has been trained to comprehend and encapsulate comprehensive domain data knowledge, it is superior than segmentation models in terms of representing the data characteristics and distribution. Hence, by integrating the GM into test-time augmentation, we can effectively generate multiple views of a given test sample, aligning with the content and appearance characteristics of the sample and the related local data distribution. This approach renders the augmentation process more adaptable and resilient compared to conventional handcrafted transforms. Comprehensive experiments conducted across three medical image segmentation tasks (nine datasets) demonstrate the efficacy and versatility of the proposed TTGA in enhancing segmentation outcomes. Moreover, TTGA significantly improves pixel-wise error estimation, thereby facilitating the deployment of a more reliable segmentation system. Code will be released at: https://github.com/maxiao0234/TTGA.
comment: 12pages, 2figures
☆ Director3D: Real-world Camera Trajectory and 3D Scene Generation from Text
Recent advancements in 3D generation have leveraged synthetic datasets with ground truth 3D assets and predefined cameras. However, the potential of adopting real-world datasets, which can produce significantly more realistic 3D scenes, remains largely unexplored. In this work, we delve into the key challenge of the complex and scene-specific camera trajectories found in real-world captures. We introduce Director3D, a robust open-world text-to-3D generation framework, designed to generate both real-world 3D scenes and adaptive camera trajectories. To achieve this, (1) we first utilize a Trajectory Diffusion Transformer, acting as the Cinematographer, to model the distribution of camera trajectories based on textual descriptions. (2) Next, a Gaussian-driven Multi-view Latent Diffusion Model serves as the Decorator, modeling the image sequence distribution given the camera trajectories and texts. This model, fine-tuned from a 2D diffusion model, directly generates pixel-aligned 3D Gaussians as an immediate 3D scene representation for consistent denoising. (3) Lastly, the 3D Gaussians are refined by a novel SDS++ loss as the Detailer, which incorporates the prior of the 2D diffusion model. Extensive experiments demonstrate that Director3D outperforms existing methods, offering superior performance in real-world 3D generation.
comment: Code: https://github.com/imlixinyang/director3d
☆ DocParseNet: Advanced Semantic Segmentation and OCR Embeddings for Efficient Scanned Document Annotation
Automating the annotation of scanned documents is challenging, requiring a balance between computational efficiency and accuracy. DocParseNet addresses this by combining deep learning and multi-modal learning to process both text and visual data. This model goes beyond traditional OCR and semantic segmentation, capturing the interplay between text and images to preserve contextual nuances in complex document structures. Our evaluations show that DocParseNet significantly outperforms conventional models, achieving mIoU scores of 49.12 on validation and 49.78 on the test set. This reflects a 58% accuracy improvement over state-of-the-art baseline models and an 18% gain compared to the UNext baseline. Remarkably, DocParseNet achieves these results with only 2.8 million parameters, reducing the model size by approximately 25 times and speeding up training by 5 times compared to other models. These metrics, coupled with a computational efficiency of 0.034 TFLOPs (BS=1), highlight DocParseNet's high performance in document annotation. The model's adaptability and scalability make it well-suited for real-world corporate document processing applications. The code is available at https://github.com/ahmad-shirazi/DocParseNet
☆ Advancing Cell Detection in Anterior Segment Optical Coherence Tomography Images
Anterior uveitis, a common form of eye inflammation, can lead to permanent vision loss if not promptly diagnosed. Monitoring this condition involves quantifying inflammatory cells in the anterior chamber (AC) of the eye, which can be captured using Anterior Segment Optical Coherence Tomography (AS-OCT). However, manually identifying cells in AS-OCT images is time-consuming and subjective. Moreover, existing automated approaches may have limitations in both the effectiveness of detecting cells and the reliability of their detection results. To address these challenges, we propose an automated framework to detect cells in the AS-OCT images. This framework consists of a zero-shot chamber segmentation module and a cell detection module. The first module segments the AC area in the image without requiring human-annotated training data. Subsequently, the second module identifies individual cells within the segmented AC region. Through experiments, our framework demonstrates superior performance compared to current state-of-the-art methods for both AC segmentation and cell detection tasks. Notably, we find that previous cell detection approaches could suffer from low recall, potentially overlooking a significant number of cells. In contrast, our framework offers an improved solution, which could benefit the diagnosis and study of anterior uveitis. Our code for cell detection is publicly available at: https://github.com/joeybyc/cell_detection.
☆ Toward Universal Medical Image Registration via Sharpness-Aware Meta-Continual Learning MICCAI 2024
Current deep learning approaches in medical image registration usually face the challenges of distribution shift and data collection, hindering real-world deployment. In contrast, universal medical image registration aims to perform registration on a wide range of clinically relevant tasks simultaneously, thus having tremendous potential for clinical applications. In this paper, we present the first attempt to achieve the goal of universal 3D medical image registration in sequential learning scenarios by proposing a continual learning method. Specifically, we utilize meta-learning with experience replay to mitigating the problem of catastrophic forgetting. To promote the generalizability of meta-continual learning, we further propose sharpness-aware meta-continual learning (SAMCL). We validate the effectiveness of our method on four datasets in a continual learning setup, including brain MR, abdomen CT, lung CT, and abdomen MR-CT image pairs. Results have shown the potential of SAMCL in realizing universal image registration, which performs better than or on par with vanilla sequential or centralized multi-task training strategies.The source code will be available from https://github.com/xzluo97/Continual-Reg.
comment: Accepted by MICCAI 2024
☆ Minimal Interaction Edge Tuning: A New Paradigm for Visual Adaptation
The rapid scaling of large vision pretrained models makes fine-tuning tasks more and more difficult on edge devices with low computational resources. We explore a new visual adaptation paradigm called edge tuning, which treats large pretrained models as standalone feature extractors that run on powerful cloud servers. The fine-tuning carries out on edge devices with small networks which require low computational resources. Existing methods that are potentially suitable for our edge tuning paradigm are discussed. But, three major drawbacks hinder their application in edge tuning: low adaptation capability, large adapter network, and high information transfer overhead. To address these issues, we propose Minimal Interaction Edge Tuning, or MIET, which reveals that the sum of intermediate features from pretrained models not only has minimal information transfer but also has high adaptation capability. With a lightweight attention-based adaptor network, MIET achieves information transfer efficiency, parameter efficiency, computational and memory efficiency, and at the same time demonstrates competitive results on various visual adaptation benchmarks.
comment: 9 pages
☆ Detection of Synthetic Face Images: Accuracy, Robustness, Generalization
An experimental study on detecting synthetic face images is presented. We collected a dataset, called FF5, of five fake face image generators, including recent diffusion models. We find that a simple model trained on a specific image generator can achieve near-perfect accuracy in separating synthetic and real images. The model handles common image distortions (reduced resolution, compression) by using data augmentation. Moreover, partial manipulations, where synthetic images are blended into real ones by inpainting, are identified and the area of the manipulation is localized by a simple model of YOLO architecture. However, the model turned out to be vulnerable to adversarial attacks and does not generalize to unseen generators. Failure to generalize to detect images produced by a newer generator also occurs for recent state-of-the-art methods, which we tested on Realistic Vision, a fine-tuned version of StabilityAI's Stable Diffusion image generator.
☆ Principal Component Clustering for Semantic Segmentation in Synthetic Data Generation CVPR
This technical report outlines our method for generating a synthetic dataset for semantic segmentation using a latent diffusion model. Our approach eliminates the need for additional models specifically trained on segmentation data and is part of our submission to the CVPR 2024 workshop challenge, entitled CVPR 2024 workshop challenge "SyntaGen Harnessing Generative Models for Synthetic Visual Datasets". Our methodology uses self-attentions to facilitate a novel head-wise semantic information condensation, thereby enabling the direct acquisition of class-agnostic image segmentation from the Stable Diffusion latents. Furthermore, we employ non-prompt-influencing cross-attentions from text to pixel, thus facilitating the classification of the previously generated masks. Finally, we propose a mask refinement step by using only the output image by Stable Diffusion.
comment: This is a technical report for a submission to the CVPR "SyntaGen - Harnessing Generative Models for Synthetic Visual Datasets" workshop challenge. The report is already uploaded to the workshop's homepage https://syntagen.github.io/
☆ SKD-TSTSAN: Three-Stream Temporal-Shift Attention Network Based on Self-Knowledge Distillation for Micro-Expression Recognition
Micro-expressions (MEs) are subtle facial movements that occur spontaneously when people try to conceal the real emotions. Micro-expression recognition (MER) is crucial in many fields, including criminal analysis and psychotherapy. However, MER is challenging since MEs have low intensity and ME datasets are small in size. To this end, a three-stream temporal-shift attention network based on self-knowledge distillation (SKD-TSTSAN) is proposed in this paper. Firstly, to address the low intensity of ME muscle movements, we utilize learning-based motion magnification modules to enhance the intensity of ME muscle movements. Secondly, we employ efficient channel attention (ECA) modules in the local-spatial stream to make the network focus on facial regions that are highly relevant to MEs. In addition, temporal shift modules (TSMs) are used in the dynamic-temporal stream, which enables temporal modeling with no additional parameters by mixing ME motion information from two different temporal domains. Furthermore, we introduce self-knowledge distillation (SKD) into the MER task by introducing auxiliary classifiers and using the deepest section of the network for supervision, encouraging all blocks to fully explore the features of the training set. Finally, extensive experiments are conducted on four ME datasets: CASME II, SAMM, MMEW, and CAS(ME)3. The experimental results demonstrate that our SKD-TSTSAN outperforms other existing methods and achieves new state-of-the-art performance. Our code will be available at https://github.com/GuanghaoZhu663/SKD-TSTSAN.
☆ MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions
The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at https://github.com/francescodisalvo05/medmnistc-api}{github.com/francescodisalvo05/medmnistc-api.
☆ Point Tree Transformer for Point Cloud Registration
Point cloud registration is a fundamental task in the fields of computer vision and robotics. Recent developments in transformer-based methods have demonstrated enhanced performance in this domain. However, the standard attention mechanism utilized in these methods often integrates many low-relevance points, thereby struggling to prioritize its attention weights on sparse yet meaningful points. This inefficiency leads to limited local structure modeling capabilities and quadratic computational complexity. To overcome these limitations, we propose the Point Tree Transformer (PTT), a novel transformer-based approach for point cloud registration that efficiently extracts comprehensive local and global features while maintaining linear computational complexity. The PTT constructs hierarchical feature trees from point clouds in a coarse-to-dense manner, and introduces a novel Point Tree Attention (PTA) mechanism, which follows the tree structure to facilitate the progressive convergence of attended regions towards salient points. Specifically, each tree layer selectively identifies a subset of key points with the highest attention scores. Subsequent layers focus attention on areas of significant relevance, derived from the child points of the selected point set. The feature extraction process additionally incorporates coarse point features that capture high-level semantic information, thus facilitating local structure modeling and the progressive integration of multiscale information. Consequently, PTA empowers the model to concentrate on crucial local structures and derive detailed local information while maintaining linear computational complexity. Extensive experiments conducted on the 3DMatch, ModelNet40, and KITTI datasets demonstrate that our method achieves superior performance over the state-of-the-art methods.
☆ Tell Me Where You Are: Multimodal LLMs Meet Place Recognition
Large language models (LLMs) exhibit a variety of promising capabilities in robotics, including long-horizon planning and commonsense reasoning. However, their performance in place recognition is still underexplored. In this work, we introduce multimodal LLMs (MLLMs) to visual place recognition (VPR), where a robot must localize itself using visual observations. Our key design is to use vision-based retrieval to propose several candidates and then leverage language-based reasoning to carefully inspect each candidate for a final decision. Specifically, we leverage the robust visual features produced by off-the-shelf vision foundation models (VFMs) to obtain several candidate locations. We then prompt an MLLM to describe the differences between the current observation and each candidate in a pairwise manner, and reason about the best candidate based on these descriptions. Our results on three datasets demonstrate that integrating the general-purpose visual features from VFMs with the reasoning capabilities of MLLMs already provides an effective place recognition solution, without any VPR-specific supervised training. We believe our work can inspire new possibilities for applying and designing foundation models, i.e., VFMs, LLMs, and MLLMs, to enhance the localization and navigation of mobile robots.
☆ TRIP: Trainable Region-of-Interest Prediction for Hardware-Efficient Neuromorphic Processing on Event-based Vision
Neuromorphic processors are well-suited for efficiently handling sparse events from event-based cameras. However, they face significant challenges in the growth of computing demand and hardware costs as the input resolution increases. This paper proposes the Trainable Region-of-Interest Prediction (TRIP), the first hardware-efficient hard attention framework for event-based vision processing on a neuromorphic processor. Our TRIP framework actively produces low-resolution Region-of-Interest (ROIs) for efficient and accurate classification. The framework exploits sparse events' inherent low information density to reduce the overhead of ROI prediction. We introduced extensive hardware-aware optimizations for TRIP and implemented the hardware-optimized algorithm on the SENECA neuromorphic processor. We utilized multiple event-based classification datasets for evaluation. Our approach achieves state-of-the-art accuracies in all datasets and produces reasonable ROIs with varying locations and sizes. On the DvsGesture dataset, our solution requires 46x less computation than the state-of-the-art while achieving higher accuracy. Furthermore, TRIP enables more than 2x latency and energy improvements on the SENECA neuromorphic processor compared to the conventional solution.
comment: Accepted in ICONS 2024
☆ TSynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification
The usage of medical image data for the training of large-scale machine learning approaches is particularly challenging due to its scarce availability and the costly generation of data annotations, typically requiring the engagement of medical professionals. The rapid development of generative models allows towards tackling this problem by leveraging large amounts of realistic synthetically generated data for the training process. However, randomly choosing synthetic samples, might not be an optimal strategy. In this work, we investigate the targeted generation of synthetic training data, in order to improve the accuracy and robustness of image classification. Therefore, our approach aims to guide the generative model to synthesize data with high epistemic uncertainty, since large measures of epistemic uncertainty indicate underrepresented data points in the training set. During the image generation we feed images reconstructed by an auto encoder into the classifier and compute the mutual information over the class-probability distribution as a measure for uncertainty.We alter the feature space of the autoencoder through an optimization process with the objective of maximizing the classifier uncertainty on the decoded image. By training on such data we improve the performance and robustness against test time data augmentations and adversarial attacks on several classifications tasks.
☆ UHD-IQA Benchmark Database: Pushing the Boundaries of Blind Photo Quality Assessment
We introduce a novel Image Quality Assessment (IQA) dataset comprising 6073 UHD-1 (4K) images, annotated at a fixed width of 3840 pixels. Contrary to existing No-Reference (NR) IQA datasets, ours focuses on highly aesthetic photos of high technical quality, filling a gap in the literature. The images, carefully curated to exclude synthetic content, are sufficiently diverse to train general NR-IQA models. The dataset is annotated with perceptual quality ratings obtained through a crowdsourcing study. Ten expert raters, comprising photographers and graphics artists, assessed each image at least twice in multiple sessions spanning several days, resulting in highly reliable labels. Annotators were rigorously selected based on several metrics, including self-consistency, to ensure their reliability. The dataset includes rich metadata with user and machine-generated tags from over 5,000 categories and popularity indicators such as favorites, likes, downloads, and views. With its unique characteristics, such as its focus on high-quality images, reliable crowdsourced annotations, and high annotation resolution, our dataset opens up new opportunities for advancing perceptual image quality assessment research and developing practical NR-IQA models that apply to modern photos. Our dataset is available at https://database.mmsp-kn.de/uhd-iqa-benchmark-database.html
☆ Medical Image Segmentation Using Directional Window Attention
Accurate segmentation of medical images is crucial for diagnostic purposes, including cell segmentation, tumor identification, and organ localization. Traditional convolutional neural network (CNN)-based approaches struggled to achieve precise segmentation results due to their limited receptive fields, particularly in cases involving multi-organ segmentation with varying shapes and sizes. The transformer-based approaches address this limitation by leveraging the global receptive field, but they often face challenges in capturing local information required for pixel-precise segmentation. In this work, we introduce DwinFormer, a hierarchical encoder-decoder architecture for medical image segmentation comprising a directional window (Dwin) attention and global self-attention (GSA) for feature encoding. The focus of our design is the introduction of Dwin block within DwinFormer that effectively captures local and global information along the horizontal, vertical, and depthwise directions of the input feature map by separately performing attention in each of these directional volumes. To this end, our Dwin block introduces a nested Dwin attention (NDA) that progressively increases the receptive field in horizontal, vertical, and depthwise directions and a convolutional Dwin attention (CDA) that captures local contextual information for the attention computation. While the proposed Dwin block captures local and global dependencies at the first two high-resolution stages of DwinFormer, the GSA block encodes global dependencies at the last two lower-resolution stages. Experiments over the challenging 3D Synapse Multi-organ dataset and Cell HMS dataset demonstrate the benefits of our DwinFormer over the state-of-the-art approaches. Our source code will be publicly available at \url{https://github.com/Daniyanaj/DWINFORMER}.
comment: 5 pages
☆ Cross-Modal Spherical Aggregation for Weakly Supervised Remote Sensing Shadow Removal
Remote sensing shadow removal, which aims to recover contaminated surface information, is tricky since shadows typically display overwhelmingly low illumination intensities. In contrast, the infrared image is robust toward significant light changes, providing visual clues complementary to the visible image. Nevertheless, the existing methods ignore the collaboration between heterogeneous modalities, leading to undesired quality degradation. To fill this gap, we propose a weakly supervised shadow removal network with a spherical feature space, dubbed S2-ShadowNet, to explore the best of both worlds for visible and infrared modalities. Specifically, we employ a modal translation (visible-to-infrared) model to learn the cross-domain mapping, thus generating realistic infrared samples. Then, Swin Transformer is utilized to extract strong representational visible/infrared features. Simultaneously, the extracted features are mapped to the smooth spherical manifold, which alleviates the domain shift through regularization. Well-designed similarity loss and orthogonality loss are embedded into the spherical space, prompting the separation of private visible/infrared features and the alignment of shared visible/infrared features through constraints on both representation content and orientation. Such a manner encourages implicit reciprocity between modalities, thus providing a novel insight into shadow removal. Notably, ground truth is not available in practice, thus S2-ShadowNet is trained by cropping shadow and shadow-free patches from the shadow image itself, avoiding stereotypical and strict pair data acquisition. More importantly, we contribute a large-scale weakly supervised shadow removal benchmark, including 4000 shadow images with corresponding shadow masks.
comment: 9pages, 11 figures
☆ The Tree of Diffusion Life: Evolutionary Embeddings to Understand the Generation Process of Diffusion Models
Diffusion models generate high-quality samples by corrupting data with Gaussian noise and iteratively reconstructing it with deep learning, slowly transforming noisy images into refined outputs. Understanding this data evolution is important for interpretability but is complex due to its high-dimensional evolutionary nature. While traditional dimensionality reduction methods like t-distributed stochastic neighborhood embedding (t-SNE) aid in understanding high-dimensional spaces, they neglect evolutionary structure preservation. Hence, we propose Tree of Diffusion Life (TDL), a method to understand data evolution in the generative process of diffusion models. TDL samples a diffusion model's generative space via instances with varying prompts and employs image encoders to extract semantic meaning from these samples, projecting them to an intermediate space. It employs a novel evolutionary embedding algorithm that explicitly encodes the iterations while preserving the high-dimensional relations, facilitating the visualization of data evolution. This embedding leverages three metrics: a standard t-SNE loss to group semantically similar elements, a displacement loss to group elements from the same iteration step, and an instance alignment loss to align elements of the same instance across iterations. We present rectilinear and radial layouts to represent iterations, enabling comprehensive exploration. We assess various feature extractors and highlight TDL's potential with prominent diffusion models like GLIDE and Stable Diffusion with different prompt sets. TDL simplifies understanding data evolution within diffusion models, offering valuable insights into their functioning.
☆ Investigating Self-Supervised Methods for Label-Efficient Learning
Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.
☆ Continuous Urban Change Detection from Satellite Image Time Series with Temporal Feature Refinement and Multi-Task Integration
Urbanization advances at unprecedented rates, resulting in negative effects on the environment and human well-being. Remote sensing has the potential to mitigate these effects by supporting sustainable development strategies with accurate information on urban growth. Deep learning-based methods have achieved promising urban change detection results from optical satellite image pairs using convolutional neural networks (ConvNets), transformers, and a multi-task learning setup. However, transformers have not been leveraged for urban change detection with multi-temporal data, i.e., >2 images, and multi-task learning methods lack integration approaches that combine change and segmentation outputs. To fill this research gap, we propose a continuous urban change detection method that identifies changes in each consecutive image pair of a satellite image time series. Specifically, we propose a temporal feature refinement (TFR) module that utilizes self-attention to improve ConvNet-based multi-temporal building representations. Furthermore, we propose a multi-task integration (MTI) module that utilizes Markov networks to find an optimal building map time series based on segmentation and dense change outputs. The proposed method effectively identifies urban changes based on high-resolution satellite image time series acquired by the PlanetScope constellation (F1 score 0.551) and Gaofen-2 (F1 score 0.440). Moreover, our experiments on two challenging datasets demonstrate the effectiveness of the proposed method compared to bi-temporal and multi-temporal urban change detection and segmentation methods.
comment: Submitted to IEEE Transactions on Geoscience and Remote Sensing, Code will be available at https://github.com/SebastianHafner/ContUrbanCD.git
☆ Pseudo Labelling for Enhanced Masked Autoencoders
Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architectural components. In this paper, we propose an enhanced approach that boosts MAE performance by integrating pseudo labelling for both class and data tokens, alongside replacing the traditional pixel-level reconstruction with token-level reconstruction. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network, while token reconstruction requires generation of discrete tokens encapturing local context. The targets for pseudo labelling and reconstruction needs to be generated by a teacher network. To disentangle the generation of target pseudo labels and the reconstruction of the token features, we decouple the teacher into two distinct models, where one serves as a labelling teacher and the other as a reconstruction teacher. This separation proves empirically superior to a single teacher, while having negligible impact on throughput and memory consumption. Incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks, including classification, semantic segmentation, and detection.
☆ Using joint angles based on the international biomechanical standards for human action recognition and related tasks
Keypoint data has received a considerable amount of attention in machine learning for tasks like action detection and recognition. However, human experts in movement such as doctors, physiotherapists, sports scientists and coaches use a notion of joint angles standardised by the International Society of Biomechanics to precisely and efficiently communicate static body poses and movements. In this paper, we introduce the basic biomechanical notions and show how they can be used to convert common keypoint data into joint angles that uniquely describe the given pose and have various desirable mathematical properties, such as independence of both the camera viewpoint and the person performing the action. We experimentally demonstrate that the joint angle representation of keypoint data is suitable for machine learning applications and can in some cases bring an immediate performance gain. The use of joint angles as a human meaningful representation of kinematic data is in particular promising for applications where interpretability and dialog with human experts is important, such as many sports and medical applications. To facilitate further research in this direction, we will release a python package to convert keypoint data into joint angles as outlined in this paper.
☆ Mamba24/8D: Enhancing Global Interaction in Point Clouds via State Space Model
Transformers have demonstrated impressive results for 3D point cloud semantic segmentation. However, the quadratic complexity of transformer makes computation cost high, limiting the number of points that can be processed simultaneously and impeding the modeling of long-range dependencies. Drawing inspiration from the great potential of recent state space models (SSM) for long sequence modeling, we introduce Mamba, a SSM-based architecture, to the point cloud domain and propose Mamba24/8D, which has strong global modeling capability under linear complexity. Specifically, to make disorderness of point clouds fit in with the causal nature of Mamba, we propose a multi-path serialization strategy applicable to point clouds. Besides, we propose the ConvMamba block to compensate for the shortcomings of Mamba in modeling local geometries and in unidirectional modeling. Mamba24/8D obtains state of the art results on several 3D point cloud segmentation tasks, including ScanNet v2, ScanNet200 and nuScenes, while its effectiveness is validated by extensive experiments.
☆ Implicit-Zoo: A Large-Scale Dataset of Neural Implicit Functions for 2D Images and 3D Scenes
Neural implicit functions have demonstrated significant importance in various areas such as computer vision, graphics. Their advantages include the ability to represent complex shapes and scenes with high fidelity, smooth interpolation capabilities, and continuous representations. Despite these benefits, the development and analysis of implicit functions have been limited by the lack of comprehensive datasets and the substantial computational resources required for their implementation and evaluation. To address these challenges, we introduce "Implicit-Zoo": a large-scale dataset requiring thousands of GPU training days designed to facilitate research and development in this field. Our dataset includes diverse 2D and 3D scenes, such as CIFAR-10, ImageNet-1K, and Cityscapes for 2D image tasks, and the OmniObject3D dataset for 3D vision tasks. We ensure high quality through strict checks, refining or filtering out low-quality data. Using Implicit-Zoo, we showcase two immediate benefits as it enables to: (1) learn token locations for transformer models; (2) directly regress 3D cameras poses of 2D images with respect to NeRF models. This in turn leads to an improved performance in all three task of image classification, semantic segmentation, and 3D pose regression, thereby unlocking new avenues for research.
☆ Advancing Question Answering on Handwritten Documents: A State-of-the-Art Recognition-Based Model for HW-SQuAD
Question-answering handwritten documents is a challenging task with numerous real-world applications. This paper proposes a novel recognition-based approach that improves upon the previous state-of-the-art on the HW-SQuAD and BenthamQA datasets. Our model incorporates transformer-based document retrieval and ensemble methods at the model level, achieving an Exact Match score of 82.02% and 92.55% in HW-SQuAD and BenthamQA datasets, respectively, surpassing the previous best recognition-based approach by 10.89% and 26%. We also enhance the document retrieval component, boosting the top-5 retrieval accuracy from 90% to 95.30%. Our results demonstrate the significance of our proposed approach in advancing question answering on handwritten documents. The code and trained models will be publicly available to facilitate future research in this critical area of natural language.
comment: 16 pages
☆ Deep learning-based brain segmentation model performance validation with clinical radiotherapy CT
Manual segmentation of medical images is labor intensive and especially challenging for images with poor contrast or resolution. The presence of disease exacerbates this further, increasing the need for an automated solution. To this extent, SynthSeg is a robust deep learning model designed for automatic brain segmentation across various contrasts and resolutions. This study validates the SynthSeg robust brain segmentation model on computed tomography (CT), using a multi-center dataset. An open access dataset of 260 paired CT and magnetic resonance imaging (MRI) from radiotherapy patients treated in 5 centers was collected. Brain segmentations from CT and MRI were obtained with SynthSeg model, a component of the Freesurfer imaging suite. These segmentations were compared and evaluated using Dice scores and Hausdorff 95 distance (HD95), treating MRI-based segmentations as the ground truth. Brain regions that failed to meet performance criteria were excluded based on automated quality control (QC) scores. Dice scores indicate a median overlap of 0.76 (IQR: 0.65-0.83). The median HD95 is 2.95 mm (IQR: 1.73-5.39). QC score based thresholding improves median dice by 0.1 and median HD95 by 0.05mm. Morphological differences related to sex and age, as detected by MRI, were also replicated with CT, with an approximate 17% difference between the CT and MRI results for sex and 10% difference between the results for age. SynthSeg can be utilized for CT-based automatic brain segmentation, but only in applications where precision is not essential. CT performance is lower than MRI based on the integrated QC scores, but low-quality segmentations can be excluded with QC-based thresholding. Additionally, performing CT-based neuroanatomical studies is encouraged, as the results show correlations in sex- and age-based analyses similar to those found with MRI.
comment: 15 pages, 9 figures, 3 supplementary data csv's, 1 supplementary file with 1 figure
☆ Real-Time Remote Control via VR over Limited Wireless Connectivity SC
This work introduces a solution to enhance human-robot interaction over limited wireless connectivity. The goal is toenable remote control of a robot through a virtual reality (VR)interface, ensuring a smooth transition to autonomous mode in the event of connectivity loss. The VR interface provides accessto a dynamic 3D virtual map that undergoes continuous updatesusing real-time sensor data collected and transmitted by therobot. Furthermore, the robot monitors wireless connectivity and automatically switches to a autonomous mode in scenarios with limited connectivity. By integrating four key functionalities: real-time mapping, remote control through glasses VR, continuous monitoring of wireless connectivity, and autonomous navigation during limited connectivity, we achieve seamless end-to-end operation.
comment: Accepted in ISCC 2024 conference
☆ Consensus Learning with Deep Sets for Essential Matrix Estimation
Robust estimation of the essential matrix, which encodes the relative position and orientation of two cameras, is a fundamental step in structure from motion pipelines. Recent deep-based methods achieved accurate estimation by using complex network architectures that involve graphs, attention layers, and hard pruning steps. Here, we propose a simpler network architecture based on Deep Sets. Given a collection of point matches extracted from two images, our method identifies outlier point matches and models the displacement noise in inlier matches. A weighted DLT module uses these predictions to regress the essential matrix. Our network achieves accurate recovery that is superior to existing networks with significantly more complex architectures.
☆ Depth-Guided Semi-Supervised Instance Segmentation
Semi-Supervised Instance Segmentation (SSIS) aims to leverage an amount of unlabeled data during training. Previous frameworks primarily utilized the RGB information of unlabeled images to generate pseudo-labels. However, such a mechanism often introduces unstable noise, as a single instance can display multiple RGB values. To overcome this limitation, we introduce a Depth-Guided (DG) SSIS framework. This framework uses depth maps extracted from input images, which represent individual instances with closely associated distance values, offering precise contours for distinct instances. Unlike RGB data, depth maps provide a unique perspective, making their integration into the SSIS process complex. To this end, we propose Depth Feature Fusion, which integrates features extracted from depth estimation. This integration allows the model to understand depth information better and ensure its effective utilization. Additionally, to manage the variability of depth images during training, we introduce the Depth Controller. This component enables adaptive adjustments of the depth map, enhancing convergence speed and dynamically balancing the loss weights between RGB and depth maps. Extensive experiments conducted on the COCO and Cityscapes datasets validate the efficacy of our proposed method. Our approach establishes a new benchmark for SSIS, outperforming previous methods. Specifically, our DG achieves 22.29%, 31.47%, and 35.14% mAP for 1%, 5%, and 10% labeled data on the COCO dataset, respectively.
comment: 12 pages, 6 figures, 4 tables
☆ Less can be more: representational vs. stereotypical gender bias in facial expression recognition
Machine learning models can inherit biases from their training data, leading to discriminatory or inaccurate predictions. This is particularly concerning with the increasing use of large, unsupervised datasets for training foundational models. Traditionally, demographic biases within these datasets have not been well-understood, limiting our ability to understand how they propagate to the models themselves. To address this issue, this paper investigates the propagation of demographic biases from datasets into machine learning models. We focus on the gender demographic component, analyzing two types of bias: representational and stereotypical. For our analysis, we consider the domain of facial expression recognition (FER), a field known to exhibit biases in most popular datasets. We use Affectnet, one of the largest FER datasets, as our baseline for carefully designing and generating subsets that incorporate varying strengths of both representational and stereotypical bias. Subsequently, we train several models on these biased subsets, evaluating their performance on a common test set to assess the propagation of bias into the models' predictions. Our results show that representational bias has a weaker impact than expected. Models exhibit a good generalization ability even in the absence of one gender in the training dataset. Conversely, stereotypical bias has a significantly stronger impact, primarily concentrated on the biased class, although it can also influence predictions for unbiased classes. These results highlight the need for a bias analysis that differentiates between types of bias, which is crucial for the development of effective bias mitigation strategies.
comment: 21 pages including appendix, 11 figures
☆ SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing
Text-based 2D diffusion models have demonstrated impressive capabilities in image generation and editing. Meanwhile, the 2D diffusion models also exhibit substantial potentials for 3D editing tasks. However, how to achieve consistent edits across multiple viewpoints remains a challenge. While the iterative dataset update method is capable of achieving global consistency, it suffers from slow convergence and over-smoothed textures. We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent, which ensures global consistency in both semantic structure and low-frequency appearance. To further enhance local consistency in high-frequency details, we set a group of anchor views and propagate them to their neighboring frames through cross-view reprojection. To improve the reliability of multi-view correspondences, we introduce depth supervision during training to enhance the reconstruction of precise geometries. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures, by enhancing geometric consistency at the noise and pixel levels.
comment: 16 pages, 13 figures
☆ Automatic infant 2D pose estimation from videos: comparing seven deep neural network methods
Automatic markerless estimation of infant posture and motion from ordinary videos carries great potential for movement studies "in the wild", facilitating understanding of motor development and massively increasing the chances of early diagnosis of disorders. There is rapid development of human pose estimation methods in computer vision thanks to advances in deep learning and machine learning. However, these methods are trained on datasets featuring adults in different contexts. This work tests and compares seven popular methods (AlphaPose, DeepLabCut/DeeperCut, Detectron2, HRNet, MediaPipe/BlazePose, OpenPose, and ViTPose) on videos of infants in supine position. Surprisingly, all methods except DeepLabCut and MediaPipe have competitive performance without additional finetuning, with ViTPose performing best. Next to standard performance metrics (object keypoint similarity, average precision and recall), we introduce errors expressed in the neck-mid-hip ratio and additionally study missed and redundant detections and the reliability of the internal confidence ratings of the different methods, which are relevant for downstream tasks. Among the networks with competitive performance, only AlphaPose could run close to real time (27 fps) on our machine. We provide documented Docker containers or instructions for all the methods we used, our analysis scripts, and processed data at https://hub.docker.com/u/humanoidsctu and https://osf.io/x465b/.
comment: 21 pages, 3 figures, 14 tables
☆ Forget but Recall: Incremental Latent Rectification in Continual Learning
Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.
☆ Semantic Deep Hiding for Robust Unlearnable Examples
Ensuring data privacy and protection has become paramount in the era of deep learning. Unlearnable examples are proposed to mislead the deep learning models and prevent data from unauthorized exploration by adding small perturbations to data. However, such perturbations (e.g., noise, texture, color change) predominantly impact low-level features, making them vulnerable to common countermeasures. In contrast, semantic images with intricate shapes have a wealth of high-level features, making them more resilient to countermeasures and potential for producing robust unlearnable examples. In this paper, we propose a Deep Hiding (DH) scheme that adaptively hides semantic images enriched with high-level features. We employ an Invertible Neural Network (INN) to invisibly integrate predefined images, inherently hiding them with deceptive perturbations. To enhance data unlearnability, we introduce a Latent Feature Concentration module, designed to work with the INN, regularizing the intra-class variance of these perturbations. To further boost the robustness of unlearnable examples, we design a Semantic Images Generation module that produces hidden semantic images. By utilizing similar semantic information, this module generates similar semantic images for samples within the same classes, thereby enlarging the inter-class distance and narrowing the intra-class distance. Extensive experiments on CIFAR-10, CIFAR-100, and an ImageNet subset, against 18 countermeasures, reveal that our proposed method exhibits outstanding robustness for unlearnable examples, demonstrating its efficacy in preventing unauthorized data exploitation.
comment: Accepted by TIFS 2024
☆ NerfBaselines: Consistent and Reproducible Evaluation of Novel View Synthesis Methods
Novel view synthesis is an important problem with many applications, including AR/VR, gaming, and simulations for robotics. With the recent rapid development of Neural Radiance Fields (NeRFs) and 3D Gaussian Splatting (3DGS) methods, it is becoming difficult to keep track of the current state of the art (SoTA) due to methods using different evaluation protocols, codebases being difficult to install and use, and methods not generalizing well to novel 3D scenes. Our experiments support this claim by showing that tiny differences in evaluation protocols of various methods can lead to inconsistent reported metrics. To address these issues, we propose a framework called NerfBaselines, which simplifies the installation of various methods, provides consistent benchmarking tools, and ensures reproducibility. We validate our implementation experimentally by reproducing numbers reported in the original papers. To further improve the accessibility, we release a web platform where commonly used methods are compared on standard benchmarks. Web: https://jkulhanek.com/nerfbaselines
comment: Web: https://jkulhanek.com/nerfbaselines
☆ Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.
☆ Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds
In the field of 2D image generation modeling and representation learning, Masked Generative Encoder (MAGE) has demonstrated the synergistic potential between generative modeling and representation learning. Inspired by this, we propose Point-MAGE to extend this concept to point cloud data. Specifically, this framework first utilizes a Vector Quantized Variational Autoencoder (VQVAE) to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, by combining the masking model with variable masking ratios, we achieve synchronous training for both generation and representation learning. Furthermore, our framework seamlessly integrates with existing point cloud self-supervised learning (SSL) models, thereby enhancing their performance. We extensively evaluate the representation learning and generation capabilities of Point-MAGE. In shape classification tasks, Point-MAGE achieved an accuracy of 94.2% on the ModelNet40 dataset and 92.9% (+1.3%) on the ScanObjectNN dataset. Additionally, it achieved new state-of-the-art performance in few-shot learning and part segmentation tasks. Experimental results also confirmed that Point-MAGE can generate detailed and high-quality 3D shapes in both unconditional and conditional settings.
☆ Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection MICCAI 2024
Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propose an innovative deep learning framework that combines feature decoupling and adaptive adversarial training. Firstly, we employ two iteratively compressed decouplers to supervised decouple common features and specific features related to fatty liver in abdominal ultrasound images. Subsequently, the decoupled features are concatenated with the original image after transforming the color space and are fed into the classifier. During adversarial training, we adaptively adjust the perturbation and balance the adversarial strength by the accuracy of each class. The model will eliminate recognition weaknesses by correctly classifying adversarial samples, thus improving recognition robustness. Finally, the accuracy of our method improved by 4.16%, achieving 82.95%. As demonstrated by extensive experiments, our method is a generalized learning framework that can be directly used to eliminate the recognition weaknesses of any classifier while improving its average performance. Code is available at https://github.com/HP-ML/MICCAI2024.
comment: MICCAI 2024
☆ XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images SP
Reflected or scattered light produce artefacts in astronomical observations that can negatively impact the scientific study. Hence, automated detection of these artefacts is highly beneficial, especially with the increasing amounts of data gathered. Machine learning methods are well-suited to this problem, but currently there is a lack of annotated data to train such approaches to detect artefacts in astronomical observations. In this work, we present a dataset of images from the XMM-Newton space telescope Optical Monitoring camera showing different types of artefacts. We hand-annotated a sample of 1000 images with artefacts which we use to train automated ML methods. We further demonstrate techniques tailored for accurate detection and masking of artefacts using instance segmentation. We adopt a hybrid approach, combining knowledge from both convolutional neural networks (CNNs) and transformer-based models and use their advantages in segmentation. The presented method and dataset will advance artefact detection in astronomical observations by providing a reproducible baseline. All code and data are made available (https://github.com/ESA-Datalabs/XAMI-model and https://github.com/ESA-Datalabs/XAMI-dataset).
comment: submitted to SPAICE 2024
☆ DMF-Net: Image-Guided Point Cloud Completion with Dual-Channel Modality Fusion and Shape-Aware Upsampling Transformer
In this paper we study the task of a single-view image-guided point cloud completion. Existing methods have got promising results by fusing the information of image into point cloud explicitly or implicitly. However, given that the image has global shape information and the partial point cloud has rich local details, We believe that both modalities need to be given equal attention when performing modality fusion. To this end, we propose a novel dual-channel modality fusion network for image-guided point cloud completion(named DMF-Net), in a coarse-to-fine manner. In the first stage, DMF-Net takes a partial point cloud and corresponding image as input to recover a coarse point cloud. In the second stage, the coarse point cloud will be upsampled twice with shape-aware upsampling transformer to get the dense and complete point cloud. Extensive quantitative and qualitative experimental results show that DMF-Net outperforms the state-of-the-art unimodal and multimodal point cloud completion works on ShapeNet-ViPC dataset.
☆ Zero-Shot Long-Form Video Understanding through Screenplay CVPR'2024
The Long-form Video Question-Answering task requires the comprehension and analysis of extended video content to respond accurately to questions by utilizing both temporal and contextual information. In this paper, we present MM-Screenplayer, an advanced video understanding system with multi-modal perception capabilities that can convert any video into textual screenplay representations. Unlike previous storytelling methods, we organize video content into scenes as the basic unit, rather than just visually continuous shots. Additionally, we developed a ``Look Back'' strategy to reassess and validate uncertain information, particularly targeting breakpoint mode. MM-Screenplayer achieved highest score in the CVPR'2024 LOng-form VidEo Understanding (LOVEU) Track 1 Challenge, with a global accuracy of 87.5% and a breakpoint accuracy of 68.8%.
comment: Highest Score Award to the CVPR'2024 LOVEU Track 1 Challenge
☆ Towards Open-set Camera 3D Object Detection
Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.
☆ Image-Guided Outdoor LiDAR Perception Quality Assessment for Autonomous Driving
LiDAR is one of the most crucial sensors for autonomous vehicle perception. However, current LiDAR-based point cloud perception algorithms lack comprehensive and rigorous LiDAR quality assessment methods, leading to uncertainty in detection performance. Additionally, existing point cloud quality assessment algorithms are predominantly designed for indoor environments or single-object scenarios. In this paper, we introduce a novel image-guided point cloud quality assessment algorithm for outdoor autonomous driving environments, named the Image-Guided Outdoor Point Cloud Quality Assessment (IGO-PQA) algorithm. Our proposed algorithm comprises two main components. The first component is the IGO-PQA generation algorithm, which leverages point cloud data, corresponding RGB surrounding view images, and agent objects' ground truth annotations to generate an overall quality score for a single-frame LiDAR-based point cloud. The second component is a transformer-based IGO-PQA regression algorithm for no-reference outdoor point cloud quality assessment. This regression algorithm allows for the direct prediction of IGO-PQA scores in an online manner, without requiring image data and object ground truth annotations. We evaluate our proposed algorithm using the nuScenes and Waymo open datasets. The IGO-PQA generation algorithm provides consistent and reasonable perception quality indices. Furthermore, our proposed IGO-PQA regression algorithm achieves a Pearson Linear Correlation Coefficient (PLCC) of 0.86 on the nuScenes dataset and 0.97 on the Waymo dataset.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Disentangled Motion Modeling for Video Frame Interpolation
Video frame interpolation (VFI) aims to synthesize intermediate frames in between existing frames to enhance visual smoothness and quality. Beyond the conventional methods based on the reconstruction loss, recent works employ the high quality generative models for perceptual quality. However, they require complex training and large computational cost for modeling on the pixel space. In this paper, we introduce disentangled Motion Modeling (MoMo), a diffusion-based approach for VFI that enhances visual quality by focusing on intermediate motion modeling. We propose disentangled two-stage training process, initially training a frame synthesis model to generate frames from input pairs and their optical flows. Subsequently, we propose a motion diffusion model, equipped with our novel diffusion U-Net architecture designed for optical flow, to produce bi-directional flows between frames. This method, by leveraging the simpler low-frequency representation of motions, achieves superior perceptual quality with reduced computational demands compared to generative modeling methods on the pixel space. Our method surpasses state-of-the-art methods in perceptual metrics across various benchmarks, demonstrating its efficacy and efficiency in VFI. Our code is available at: https://github.com/JHLew/MoMo
☆ Scalp Diagnostic System With Label-Free Segmentation and Training-Free Image Translation
Scalp diseases and alopecia affect millions of people around the world, underscoring the urgent need for early diagnosis and management of the disease.However, the development of a comprehensive AI-based diagnosis system encompassing these conditions remains an underexplored domain due to the challenges associated with data imbalance and the costly nature of labeling. To address these issues, we propose ``ScalpVision", an AI-driven system for the holistic diagnosis of scalp diseases and alopecia.In ScalpVision, effective hair segmentation is achieved using pseudo image-label pairs and an innovative prompting method in the absence of traditional hair masking labels. This approach is crucial for extracting key features such as hair thickness and count, which are then used to assess alopecia severity. Additionally, ScalpVision introduces DiffuseIT-M, a generative model adept at dataset augmentation while maintaining hair information, facilitating improved predictions of scalp disease severity. Our experimental results affirm ScalpVision's efficiency in diagnosing a variety of scalp conditions and alopecia, showcasing its potential as a valuable tool in dermatological care.
comment: IEEE Transactions on Medical Imaging (Under Review)
☆ A benchmark for 2D foetal brain ultrasound analysis
Brain development involves a sequence of structural changes from early stages of the embryo until several months after birth. Currently, ultrasound is the established technique for screening due to its ability to acquire dynamic images in real-time without radiation and to its cost-efficiency. However, identifying abnormalities remains challenging due to the difficulty in interpreting foetal brain images. In this work we present a set of 104 2D foetal brain ultrasound images acquired during the 20th week of gestation that have been co-registered to a common space from a rough skull segmentation. The images are provided both on the original space and template space centred on the ellipses of all the subjects. Furthermore, the images have been annotated to highlight landmark points from structures of interest to analyse brain development. Both the final atlas template with probabilistic maps and the original images can be used to develop new segmentation techniques, test registration approaches for foetal brain ultrasound, extend our work to longitudinal datasets and to detect anomalies in new images.
☆ Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples
The challenge of limited availability of data for training in machine learning arises in many applications and the impact on performance and generalization is serious. Traditional data augmentation methods aim to enhance training with a moderately sufficient data set. Generative models like Generative Adversarial Networks (GANs) often face problematic convergence when generating significant and diverse data samples. Diffusion models, though effective, still struggle with high computational cost and long training times. This paper introduces an innovative Expansive Synthesis model that generates large-scale, high-fidelity datasets from minimal samples. The proposed approach exploits expander graph mappings and feature interpolation to synthesize expanded datasets while preserving the intrinsic data distribution and feature structural relationships. The rationale of the model is rooted in the non-linear property of neural networks' latent space and in its capture by a Koopman operator to yield a linear space of features to facilitate the construction of larger and enriched consistent datasets starting with a much smaller dataset. This process is optimized by an autoencoder architecture enhanced with self-attention layers and further refined for distributional consistency by optimal transport. We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance to classifiers trained on larger, original datasets. Experimental results demonstrate that classifiers trained on synthesized data achieve performance metrics on par with those trained on full-scale datasets, showcasing the model's potential to effectively augment training data. This work represents a significant advancement in data generation, offering a robust solution to data scarcity and paving the way for enhanced data availability in machine learning applications.
comment: 14 pages. arXiv admin note: text overlap with arXiv:2405.13866
☆ LIPE: Learning Personalized Identity Prior for Non-rigid Image Editing
Although recent years have witnessed significant advancements in image editing thanks to the remarkable progress of text-to-image diffusion models, the problem of non-rigid image editing still presents its complexities and challenges. Existing methods often fail to achieve consistent results due to the absence of unique identity characteristics. Thus, learning a personalized identity prior might help with consistency in the edited results. In this paper, we explore a novel task: learning the personalized identity prior for text-based non-rigid image editing. To address the problems in jointly learning prior and editing the image, we present LIPE, a two-stage framework designed to customize the generative model utilizing a limited set of images of the same subject, and subsequently employ the model with learned prior for non-rigid image editing. Experimental results demonstrate the advantages of our approach in various editing scenarios over past related leading methods in qualitative and quantitative ways.
☆ Task-Agnostic Federated Learning
In the realm of medical imaging, leveraging large-scale datasets from various institutions is crucial for developing precise deep learning models, yet privacy concerns frequently impede data sharing. federated learning (FL) emerges as a prominent solution for preserving privacy while facilitating collaborative learning. However, its application in real-world scenarios faces several obstacles, such as task & data heterogeneity, label scarcity, non-identically distributed (non-IID) data, computational vaiation, etc. In real-world, medical institutions may not want to disclose their tasks to FL server and generalization challenge of out-of-network institutions with un-seen task want to join the on-going federated system. This study address task-agnostic and generalization problem on un-seen tasks by adapting self-supervised FL framework. Utilizing Vision Transformer (ViT) as consensus feature encoder for self-supervised pre-training, no initial labels required, the framework enabling effective representation learning across diverse datasets and tasks. Our extensive evaluations, using various real-world non-IID medical imaging datasets, validate our approach's efficacy, retaining 90\% of F1 accuracy with only 5\% of the training data typically required for centralized approaches and exhibiting superior adaptability to out-of-distribution task. The result indicate that federated learning architecture can be a potential approach toward multi-task foundation modeling.
☆ Multimodal Cross-Task Interaction for Survival Analysis in Whole Slide Pathological Images
Survival prediction, utilizing pathological images and genomic profiles, is increasingly important in cancer analysis and prognosis. Despite significant progress, precise survival analysis still faces two main challenges: (1) The massive pixels contained in whole slide images (WSIs) complicate the process of pathological images, making it difficult to generate an effective representation of the tumor microenvironment (TME). (2) Existing multimodal methods often rely on alignment strategies to integrate complementary information, which may lead to information loss due to the inherent heterogeneity between pathology and genes. In this paper, we propose a Multimodal Cross-Task Interaction (MCTI) framework to explore the intrinsic correlations between subtype classification and survival analysis tasks. Specifically, to capture TME-related features in WSIs, we leverage the subtype classification task to mine tumor regions. Simultaneously, multi-head attention mechanisms are applied in genomic feature extraction, adaptively performing genes grouping to obtain task-related genomic embedding. With the joint representation of pathological images and genomic data, we further introduce a Transport-Guided Attention (TGA) module that uses optimal transport theory to model the correlation between subtype classification and survival analysis tasks, effectively transferring potential information. Extensive experiments demonstrate the superiority of our approaches, with MCTI outperforming state-of-the-art frameworks on three public benchmarks. \href{https://github.com/jsh0792/MCTI}{https://github.com/jsh0792/MCTI}.
☆ Large Language Models are Interpretable Learners
The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack expressiveness, whereas neural networks excel in performance but are known for being black boxes. In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge this gap. In the proposed LLM-based Symbolic Programs (LSPs), the pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. Symbolic programs then integrate these modules into an interpretable decision rule. To train LSPs, we develop a divide-and-conquer approach to incrementally build the program from scratch, where the learning process of each step is guided by LLMs. To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP's superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. Moreover, as the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable), and other LLMs, and generalizes well to out-of-distribution samples.
comment: Preliminary Version, Code at [this url](https://github.com/ruocwang/llm-symbolic-program)
☆ Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction CVPR
The unprecedented capture and application of face images raise increasing concerns on anonymization to fight against privacy disclosure. Most existing methods may suffer from the problem of excessive change of the identity-independent information or insufficient identity protection. In this paper, we present a new face anonymization approach by distracting the intrinsic and extrinsic identity attentions. On the one hand, we anonymize the identity information in the feature space by distracting the intrinsic identity attention. On the other, we anonymize the visual clues (i.e. appearance and geometry structure) by distracting the extrinsic identity attention. Our approach allows for flexible and intuitive manipulation of face appearance and geometry structure to produce diverse results, and it can also be used to instruct users to perform personalized anonymization. We conduct extensive experiments on multiple datasets and demonstrate that our approach outperforms state-of-the-art methods.
comment: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024: 12406-12415
☆ Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emph{visual fairness} in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.
☆ Highly Constrained Coded Aperture Imaging Systems Design Via a Knowledge Distillation Approach ICIP 2024
Computational optical imaging (COI) systems have enabled the acquisition of high-dimensional signals through optical coding elements (OCEs). OCEs encode the high-dimensional signal in one or more snapshots, which are subsequently decoded using computational algorithms. Currently, COI systems are optimized through an end-to-end (E2E) approach, where the OCEs are modeled as a layer of a neural network and the remaining layers perform a specific imaging task. However, the performance of COI systems optimized through E2E is limited by the physical constraints imposed by these systems. This paper proposes a knowledge distillation (KD) framework for the design of highly physically constrained COI systems. This approach employs the KD methodology, which consists of a teacher-student relationship, where a high-performance, unconstrained COI system (the teacher), guides the optimization of a physically constrained system (the student) characterized by a limited number of snapshots. We validate the proposed approach, using a binary coded apertures single pixel camera for monochromatic and multispectral image reconstruction. Simulation results demonstrate the superiority of the KD scheme over traditional E2E optimization for the designing of highly physically constrained COI systems.
comment: 7 pages, 3 figures. Accepted at ICIP 2024
☆ MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation
Despite the remarkable developments of recent large models in Embodied Artificial Intelligence (E-AI), their integration into robotics is hampered by their excessive parameter sizes and computational demands. Towards the Vision-and-Language Navigation (VLN) task, a core task in E-AI, this paper reveals the great potential of using knowledge distillation for obtaining lightweight student models by proposing a Meta-Ability Guided Interactive Chain-of-distillation (MAGIC) method. Specifically, a Meta-Ability Knowledge Distillation (MAKD) framework is proposed for decoupling and refining the necessary meta-abilities of VLN agents. A Meta-Knowledge Randomization Weighting (MKRW) and a Meta-Knowledge Transferable Determination (MKTD) module are incorporated to dynamically adjust aggregation weights at the meta-ability and sample levels, respectively. Move beyond the traditional one-step unidirectional distillation, an Interactive Chain-of-Distillation (ICoD) learning strategy is proposed to allow students to give feedback to teachers, forming a new multi-step teacher-student co-evolution pipeline. Remarkably, on the R2R test unseen public leaderboard, our smallest model, MAGIC-S, with only 5% (11M) of the teacher's size, outperforms all previous methods under the same training data. Additionally, our largest model, MAGIC-L, surpasses the previous state-of-the-art by 5.84% in SPL and 3.18% in SR. Furthermore, a new dataset was collected and annotated from our living environments, where MAGIC-S demonstrated superior performance and real-time efficiency. Our code is publicly available on https://github.com/CrystalSixone/VLN-MAGIC.
☆ Hot-Distance: Combining One-Hot and Signed Distance Embeddings for Segmentation
Machine learning models are only as good as the data to which they are fit. As such, it is always preferable to use as much data as possible in training models. What data can be used for fitting a model depends a lot on the formulation of the task. We introduce Hot-Distance, a novel segmentation target that incorporates the strength of signed boundary distance prediction with the flexibility of one-hot encoding, to increase the amount of usable training data for segmentation of subcellular structures in focused ion beam scanning electron microscopy (FIB-SEM).
comment: 3 pages, 1 figure, in progress
☆ Semi-supervised classification of dental conditions in panoramic radiographs using large language model and instance segmentation: A real-world dataset evaluation
Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular emphasis on teeth. Large language models were explored to annotate the most common dental conditions based on dental reports. Additionally, a masked autoencoder was employed to pre-train the classification neural network, and a Vision Transformer was used to leverage the unlabeled data. The analyses were validated using two of the most extensive datasets in the literature, comprising 8,795 panoramic radiographs and 8,029 paired reports and images. Encouragingly, the results consistently met or surpassed the baseline metrics for the Matthews correlation coefficient. A comparison of the proposed solution with human practitioners, supported by statistical analysis, highlighted its effectiveness and performance limitations; based on the degree of agreement among specialists, the solution demonstrated an accuracy level comparable to that of a junior specialist.
comment: 43 pages, 12 figures, 9 tables
☆ DeepSense-V2V: A Vehicle-to-Vehicle Multi-Modal Sensing, Localization, and Communications Dataset
High data rate and low-latency vehicle-to-vehicle (V2V) communication are essential for future intelligent transport systems to enable coordination, enhance safety, and support distributed computing and intelligence requirements. Developing effective communication strategies, however, demands realistic test scenarios and datasets. This is important at the high-frequency bands where more spectrum is available, yet harvesting this bandwidth is challenged by the need for direction transmission and the sensitivity of signal propagation to blockages. This work presents the first large-scale multi-modal dataset for studying mmWave vehicle-to-vehicle communications. It presents a two-vehicle testbed that comprises data from a 360-degree camera, four radars, four 60 GHz phased arrays, a 3D lidar, and two precise GPSs. The dataset contains vehicles driving during the day and night for 120 km in intercity and rural settings, with speeds up to 100 km per hour. More than one million objects were detected across all images, from trucks to bicycles. This work further includes detailed dataset statistics that prove the coverage of various situations and highlights how this dataset can enable novel machine-learning applications.
comment: 14 pages, 15 figures, 2 tables. The dataset is available on the DeepSense6G website: https://deepsense6g.net/
☆ Domain Adaptation of Echocardiography Segmentation Via Reinforcement Learning
Performance of deep learning segmentation models is significantly challenged in its transferability across different medical imaging domains, particularly when aiming to adapt these models to a target domain with insufficient annotated data for effective fine-tuning. While existing domain adaptation (DA) methods propose strategies to alleviate this problem, these methods do not explicitly incorporate human-verified segmentation priors, compromising the potential of a model to produce anatomically plausible segmentations. We introduce RL4Seg, an innovative reinforcement learning framework that reduces the need to otherwise incorporate large expertly annotated datasets in the target domain, and eliminates the need for lengthy manual human review. Using a target dataset of 10,000 unannotated 2D echocardiographic images, RL4Seg not only outperforms existing state-of-the-art DA methods in accuracy but also achieves 99% anatomical validity on a subset of 220 expert-validated subjects from the target domain. Furthermore, our framework's reward network offers uncertainty estimates comparable with dedicated state-of-the-art uncertainty methods, demonstrating the utility and effectiveness of RL4Seg in overcoming domain adaptation challenges in medical image segmentation.
comment: 9 pages
☆ Entity Augmentation for Efficient Classification of Vertically Partitioned Data with Limited Overlap IJCAI 2024
Vertical Federated Learning (VFL) is a machine learning paradigm for learning from vertically partitioned data (i.e. features for each input are distributed across multiple "guest" clients and an aggregating "host" server owns labels) without communicating raw data. Traditionally, VFL involves an "entity resolution" phase where the host identifies and serializes the unique entities known to all guests. This is followed by private set intersection to find common entities, and an "entity alignment" step to ensure all guests are always processing the same entity's data. However, using only data of entities from the intersection means guests discard potentially useful data. Besides, the effect on privacy is dubious and these operations are computationally expensive. We propose a novel approach that eliminates the need for set intersection and entity alignment in categorical tasks. Our Entity Augmentation technique generates meaningful labels for activations sent to the host, regardless of their originating entity, enabling efficient VFL without explicit entity alignment. With limited overlap between training data, this approach performs substantially better (e.g. with 5% overlap, 48.1% vs 69.48% test accuracy on CIFAR-10). In fact, thanks to the regularizing effect, our model performs marginally better even with 100% overlap.
comment: GLOW @ IJCAI 2024 (12 pages + 2 page bibliography. 15 figures.)
☆ MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. It confines the cross-modal alignment knowledge within the scope of a limited text corpus, thereby leading to sub-optimal visual-textual modeling and poor generalizability. By leveraging the visual-textual understanding capability of multi-modal large language models (MLLM), in this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization. To effectively maintain temporal sensibility for localization, we design to get text narratives for each certain video timestamp and construct a structured text paragraph with time information, which is temporally aligned with the visual content. Then we perform cross-modal feature merging between the temporal-aware narratives and corresponding video temporal features to produce semantic-enhanced video representation sequences for query localization. Subsequently, we introduce a uni-modal narrative-query matching mechanism, which encourages the model to extract complementary information from contextual cohesive descriptions for improved retrieval. Extensive experiments on two benchmarks show the effectiveness and generalizability of our proposed method.
comment: Under review
☆ ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
☆ Burst Image Super-Resolution with Base Frame Selection CVPR2024
Burst image super-resolution has been a topic of active research in recent years due to its ability to obtain a high-resolution image by using complementary information between multiple frames in the burst. In this work, we explore using burst shots with non-uniform exposures to confront real-world practical scenarios by introducing a new benchmark dataset, dubbed Non-uniformly Exposed Burst Image (NEBI), that includes the burst frames at varying exposure times to obtain a broader range of irradiance and motion characteristics within a scene. As burst shots with non-uniform exposures exhibit varying levels of degradation, fusing information of the burst shots into the first frame as a base frame may not result in optimal image quality. To address this limitation, we propose a Frame Selection Network (FSN) for non-uniform scenarios. This network seamlessly integrates into existing super-resolution methods in a plug-and-play manner with low computational costs. The comparative analysis reveals the effectiveness of the nonuniform setting for the practical scenario and our FSN on synthetic-/real- NEBI datasets.
comment: CVPR2024W NTIRE accepted
☆ Depth-Driven Geometric Prompt Learning for Laparoscopic Liver Landmark Detection MICCAI 2024
Laparoscopic liver surgery poses a complex intraoperative dynamic environment for surgeons, where remains a significant challenge to distinguish critical or even hidden structures inside the liver. Liver anatomical landmarks, e.g., ridge and ligament, serve as important markers for 2D-3D alignment, which can significantly enhance the spatial perception of surgeons for precise surgery. To facilitate the detection of laparoscopic liver landmarks, we collect a novel dataset called L3D, which comprises 1,152 frames with elaborated landmark annotations from surgical videos of 39 patients across two medical sites. For benchmarking purposes, 12 mainstream detection methods are selected and comprehensively evaluated on L3D. Further, we propose a depth-driven geometric prompt learning network, namely D2GPLand. Specifically, we design a Depth-aware Prompt Embedding (DPE) module that is guided by self-supervised prompts and generates semantically relevant geometric information with the benefit of global depth cues extracted from SAM-based features. Additionally, a Semantic-specific Geometric Augmentation (SGA) scheme is introduced to efficiently merge RGB-D spatial and geometric information through reverse anatomic perception. The experimental results indicate that D2GPLand obtains state-of-the-art performance on L3D, with 63.52% DICE and 48.68% IoU scores. Together with 2D-3D fusion technology, our method can directly provide the surgeon with intuitive guidance information in laparoscopic scenarios.
comment: This paper has been accepted by MICCAI 2024
☆ Human-Object Interaction from Human-Level Instructions
Intelligent agents need to autonomously navigate and interact within contextual environments to perform a wide range of daily tasks based on human-level instructions. These agents require a foundational understanding of the world, incorporating common sense and knowledge, to interpret such instructions. Moreover, they must possess precise low-level skills for movement and interaction to execute the detailed task plans derived from these instructions. In this work, we address the task of synthesizing continuous human-object interactions for manipulating large objects within contextual environments, guided by human-level instructions. Our goal is to generate synchronized object motion, full-body human motion, and detailed finger motion, all essential for realistic interactions. Our framework consists of a large language model (LLM) planning module and a low-level motion generator. We use LLMs to deduce spatial object relationships and devise a method for accurately determining their positions and orientations in target scene layouts. Additionally, the LLM planner outlines a detailed task plan specifying a sequence of sub-tasks. This task plan, along with the target object poses, serves as input for our low-level motion generator, which seamlessly alternates between navigation and interaction modules. We present the first complete system that can synthesize object motion, full-body motion, and finger motion simultaneously from human-level instructions. Our experiments demonstrate the effectiveness of our high-level planner in generating plausible target layouts and our low-level motion generator in synthesizing realistic interactions for diverse objects. Please refer to our project page for more results: https://hoifhli.github.io/.
comment: 10 pages
♻ ☆ CT-Bound: Robust Boundary Detection From Noisy Images Via Hybrid Convolution and Transformer Neural Networks
We present CT-Bound, a robust and fast boundary detection method for very noisy images using a hybrid Convolution and Transformer neural network. The proposed architecture decomposes boundary estimation into two tasks: local detection and global regularization. During the local detection, the model uses a convolutional architecture to predict the boundary structure of each image patch in the form of a pre-defined local boundary representation, the field-of-junctions (FoJ). Then, it uses a feed-forward transformer architecture to globally refine the boundary structures of each patch to generate an edge map and a smoothed color map simultaneously. Our quantitative analysis shows that CT-Bound outperforms the previous best algorithms in edge detection on very noisy images. It also increases the edge detection accuracy of FoJ-based methods while having a 3-time speed improvement. Finally, we demonstrate that CT-Bound can produce boundary and color maps on real captured images without extra fine-tuning and real-time boundary map and color map videos at ten frames per second.
♻ ☆ DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
comment: Project Page: https://tsinghua-mars-lab.github.io/DriveVLM/
♻ ☆ Enhancing Active Learning for Sentinel 2 Imagery through Contrastive Learning and Uncertainty Estimation
In this paper, we introduce a novel method designed to enhance label efficiency in satellite imagery analysis by integrating semi-supervised learning (SSL) with active learning strategies. Our approach utilizes contrastive learning together with uncertainty estimations via Monte Carlo Dropout (MC Dropout), with a particular focus on Sentinel-2 imagery analyzed using the Eurosat dataset. We explore the effectiveness of our method in scenarios featuring both balanced and unbalanced class distributions. Our results show that the proposed method performs better than several other popular methods in this field, enabling significant savings in labeling effort while maintaining high classification accuracy. These findings highlight the potential of our approach to facilitate scalable and cost-effective satellite image analysis, particularly advantageous for extensive environmental monitoring and land use classification tasks.
♻ ☆ Diverse Part Synthesis for 3D Shape Creation
Methods that use neural networks for synthesizing 3D shapes in the form of a part-based representation have been introduced over the last few years. These methods represent shapes as a graph or hierarchy of parts and enable a variety of applications such as shape sampling and reconstruction. However, current methods do not allow easily regenerating individual shape parts according to user preferences. In this paper, we investigate techniques that allow the user to generate multiple, diverse suggestions for individual parts. Specifically, we experiment with multimodal deep generative models that allow sampling diverse suggestions for shape parts and focus on models which have not been considered in previous work on shape synthesis. To provide a comparative study of these techniques, we introduce a method for synthesizing 3D shapes in a part-based representation and evaluate all the part suggestion techniques within this synthesis method. In our method, which is inspired by previous work, shapes are represented as a set of parts in the form of implicit functions which are then positioned in space to form the final shape. Synthesis in this representation is enabled by a neural network architecture based on an implicit decoder and a spatial transformer. We compare the various multimodal generative models by evaluating their performance in generating part suggestions. Our contribution is to show with qualitative and quantitative evaluations which of the new techniques for multimodal part generation perform the best and that a synthesis method based on the top-performing techniques allows the user to more finely control the parts that are generated in the 3D shapes while maintaining high shape fidelity when reconstructing shapes.
♻ ☆ XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies CVPR 2024
We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.
comment: CVPR 2024 Highlight. Code: https://github.com/nv-tlabs/XCube/ Website: https://research.nvidia.com/labs/toronto-ai/xcube/
♻ ☆ ID-Animator: Zero-Shot Identity-Preserving Human Video Generation
Generating high-fidelity human video with specified identities has attracted significant attention in the content generation community. However, existing techniques struggle to strike a balance between training efficiency and identity preservation, either requiring tedious case-by-case fine-tuning or usually missing identity details in the video generation process. In this study, we present \textbf{ID-Animator}, a zero-shot human-video generation approach that can perform personalized video generation given a single reference facial image without further training. ID-Animator inherits existing diffusion-based video generation backbones with a face adapter to encode the ID-relevant embeddings from learnable facial latent queries. To facilitate the extraction of identity information in video generation, we introduce an ID-oriented dataset construction pipeline that incorporates unified human attributes and action captioning techniques from a constructed facial image pool. Based on this pipeline, a random reference training strategy is further devised to precisely capture the ID-relevant embeddings with an ID-preserving loss, thus improving the fidelity and generalization capacity of our model for ID-specific video generation. Extensive experiments demonstrate the superiority of ID-Animator to generate personalized human videos over previous models. Moreover, our method is highly compatible with popular pre-trained T2V models like animatediff and various community backbone models, showing high extendability in real-world applications for video generation where identity preservation is highly desired. Our codes and checkpoints are released at https://github.com/ID-Animator/ID-Animator.
comment: Project Page: https://id-animator.github.io/
♻ ☆ Deep Pulse-Signal Magnification for remote Heart Rate Estimation in Compressed Videos
Recent advancements in data-driven approaches for remote photoplethysmography (rPPG) have significantly improved the accuracy of remote heart rate estimation. However, the performance of such approaches worsens considerably under video compression, which is nevertheless necessary to store and transmit video data efficiently. In this paper, we present a novel approach to address the impact of video compression on rPPG estimation, which leverages a pulse-signal magnification transformation to adapt compressed videos to an uncompressed data domain in which the rPPG signal is magnified. We validate the effectiveness of our model by exhaustive evaluations on two publicly available datasets, UCLA-rPPG and UBFC-rPPG, employing both intra- and cross-database performance at several compression rates. Additionally, we assess the robustness of our approach on two additional highly compressed and widely-used datasets, MAHNOB-HCI and COHFACE, which reveal outstanding heart rate estimation results.
♻ ☆ Image Distillation for Safe Data Sharing in Histopathology MICCAI 2024
Histopathology can help clinicians make accurate diagnoses, determine disease prognosis, and plan appropriate treatment strategies. As deep learning techniques prove successful in the medical domain, the primary challenges become limited data availability and concerns about data sharing and privacy. Federated learning has addressed this challenge by training models locally and updating parameters on a server. However, issues, such as domain shift and bias, persist and impact overall performance. Dataset distillation presents an alternative approach to overcoming these challenges. It involves creating a small synthetic dataset that encapsulates essential information, which can be shared without constraints. At present, this paradigm is not practicable as current distillation approaches only generate non human readable representations and exhibit insufficient performance for downstream learning tasks. We train a latent diffusion model and construct a new distilled synthetic dataset with a small number of human readable synthetic images. Selection of maximally informative synthetic images is done via graph community analysis of the representation space. We compare downstream classification models trained on our synthetic distillation data to models trained on real data and reach performances suitable for practical application.
comment: accepted at MICCAI 2024
♻ ☆ Fine-grained Prompt Tuning: A Parameter and Memory Efficient Transfer Learning Method for High-resolution Medical Image Classification MICCAI 2024
Parameter-efficient transfer learning (PETL) is proposed as a cost-effective way to transfer pre-trained models to downstream tasks, avoiding the high cost of updating entire large-scale pre-trained models (LPMs). In this work, we present Fine-grained Prompt Tuning (FPT), a novel PETL method for medical image classification. FPT significantly reduces memory consumption compared to other PETL methods, especially in high-resolution input contexts. To achieve this, we first freeze the weights of the LPM and construct a learnable lightweight side network. The frozen LPM takes high-resolution images as input to extract fine-grained features, while the side network is fed low-resolution images to reduce memory usage. To allow the side network to access pre-trained knowledge, we introduce fine-grained prompts that summarize information from the LPM through a fusion module. Important tokens selection and preloading techniques are employed to further reduce training cost and memory requirements. We evaluate FPT on four medical datasets with varying sizes, modalities, and complexities. Experimental results demonstrate that FPT achieves comparable performance to fine-tuning the entire LPM while using only 1.8% of the learnable parameters and 13% of the memory costs of an encoder ViT-B model with a 512 x 512 input resolution.
comment: MICCAI 2024
♻ ☆ Metrics for Dataset Demographic Bias: A Case Study on Facial Expression Recognition
Demographic biases in source datasets have been shown as one of the causes of unfairness and discrimination in the predictions of Machine Learning models. One of the most prominent types of demographic bias are statistical imbalances in the representation of demographic groups in the datasets. In this paper, we study the measurement of these biases by reviewing the existing metrics, including those that can be borrowed from other disciplines. We develop a taxonomy for the classification of these metrics, providing a practical guide for the selection of appropriate metrics. To illustrate the utility of our framework, and to further understand the practical characteristics of the metrics, we conduct a case study of 20 datasets used in Facial Emotion Recognition (FER), analyzing the biases present in them. Our experimental results show that many metrics are redundant and that a reduced subset of metrics may be sufficient to measure the amount of demographic bias. The paper provides valuable insights for researchers in AI and related fields to mitigate dataset bias and improve the fairness and accuracy of AI models. The code is available at https://github.com/irisdominguez/dataset_bias_metrics.
comment: 16 pages, 8 figures. Including appendix: 45 pages, 32 figures. Updated from previous version with an additional appendix, addressing concerns about the interest of studying bias at the dataset level
♻ ☆ MaPa: Text-driven Photorealistic Material Painting for 3D Shapes SIGGRAPH 2024
This paper aims to generate materials for 3D meshes from text descriptions. Unlike existing methods that synthesize texture maps, we propose to generate segment-wise procedural material graphs as the appearance representation, which supports high-quality rendering and provides substantial flexibility in editing. Instead of relying on extensive paired data, i.e., 3D meshes with material graphs and corresponding text descriptions, to train a material graph generative model, we propose to leverage the pre-trained 2D diffusion model as a bridge to connect the text and material graphs. Specifically, our approach decomposes a shape into a set of segments and designs a segment-controlled diffusion model to synthesize 2D images that are aligned with mesh parts. Based on generated images, we initialize parameters of material graphs and fine-tune them through the differentiable rendering module to produce materials in accordance with the textual description. Extensive experiments demonstrate the superior performance of our framework in photorealism, resolution, and editability over existing methods. Project page: https://zju3dv.github.io/MaPa
comment: SIGGRAPH 2024. Project page: https://zju3dv.github.io/MaPa
♻ ☆ Instance-level quantitative saliency in multiple sclerosis lesion segmentation
In recent years, explainable methods for artificial intelligence (XAI) have tried to reveal and describe models' decision mechanisms in the case of classification tasks. However, XAI for semantic segmentation and in particular for single instances has been little studied to date. Understanding the process underlying automatic segmentation of single instances is crucial to reveal what information was used to detect and segment a given object of interest. In this study, we proposed two instance-level explanation maps for semantic segmentation based on SmoothGrad and Grad-CAM++ methods. Then, we investigated their relevance for the detection and segmentation of white matter lesions (WML), a magnetic resonance imaging (MRI) biomarker in multiple sclerosis (MS). 687 patients diagnosed with MS for a total of 4043 FLAIR and MPRAGE MRI scans were collected at the University Hospital of Basel, Switzerland. Data were randomly split into training, validation and test sets to train a 3D U-Net for MS lesion segmentation. We observed 3050 true positive (TP), 1818 false positive (FP), and 789 false negative (FN) cases. We generated instance-level explanation maps for semantic segmentation, by developing two XAI methods based on SmoothGrad and Grad-CAM++. We investigated: 1) the distribution of gradients in saliency maps with respect to both input MRI sequences; 2) the model's response in the case of synthetic lesions; 3) the amount of perilesional tissue needed by the model to segment a lesion. Saliency maps (based on SmoothGrad) in FLAIR showed positive values inside a lesion and negative in its neighborhood. Peak values of saliency maps generated for these four groups of volumes presented distributions that differ significantly from one another, suggesting a quantitative nature of the proposed saliency. Contextual information of 7mm around the lesion border was required for their segmentation.
♻ ☆ Soundify: Matching Sound Effects to Video
In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.
comment: https://soundify.cc
♻ ☆ VideoMap: Supporting Video Editing Exploration, Brainstorming, and Prototyping in the Latent Space
Video editing is a creative and complex endeavor and we believe that there is potential for reimagining a new video editing interface to better support the creative and exploratory nature of video editing. We take inspiration from latent space exploration tools that help users find patterns and connections within complex datasets. We present VideoMap, a proof-of-concept video editing interface that operates on video frames projected onto a latent space. We support intuitive navigation through map-inspired navigational elements and facilitate transitioning between different latent spaces through swappable lenses. We built three VideoMap components to support editors in three common video tasks. In a user study with both professionals and non-professionals, editors found that VideoMap helps reduce grunt work, offers a user-friendly experience, provides an inspirational way of editing, and effectively supports the exploratory nature of video editing. We further demonstrate the versatility of VideoMap by implementing three extended applications. For interactive examples, we invite you to visit our project page: https://humanvideointeraction.github.io/videomap.
comment: https://humanvideointeraction.github.io/videomap
♻ ☆ Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior
This paper investigates the challenge of extracting highlight moments from videos. To perform this task, we need to understand what constitutes a highlight for arbitrary video domains while at the same time being able to scale across different domains. Our key insight is that photographs taken by photographers tend to capture the most remarkable or photogenic moments of an activity. Drawing on this insight, we present Videogenic, a technique capable of creating domain-specific highlight videos for a diverse range of domains. In a human evaluation study (N=50), we show that a high-quality photograph collection combined with CLIP-based retrieval (which uses a neural network with semantic knowledge of images) can serve as an excellent prior for finding video highlights. In a within-subjects expert study (N=12), we demonstrate the usefulness of Videogenic in helping video editors create highlight videos with lighter workload, shorter task completion time, and better usability.
comment: https://humanvideointeraction.github.io/videogenic
♻ ☆ FSBI: Deepfakes Detection with Frequency Enhanced Self-Blended Images
Advances in deepfake research have led to the creation of almost perfect manipulations undetectable by human eyes and some deepfakes detection tools. Recently, several techniques have been proposed to differentiate deepfakes from realistic images and videos. This paper introduces a Frequency Enhanced Self-Blended Images (FSBI) approach for deepfakes detection. This proposed approach utilizes Discrete Wavelet Transforms (DWT) to extract discriminative features from the self-blended images (SBI) to be used for training a convolutional network architecture model. The SBIs blend the image with itself by introducing several forgery artifacts in a copy of the image before blending it. This prevents the classifier from overfitting specific artifacts by learning more generic representations. These blended images are then fed into the frequency features extractor to detect artifacts that can not be detected easily in the time domain. The proposed approach has been evaluated on FF++ and Celeb-DF datasets and the obtained results outperformed the state-of-the-art techniques with the cross-dataset evaluation protocol.
comment: The paper is under review
♻ ☆ The Ninth NTIRE 2024 Efficient Super-Resolution Challenge Report CVPR
This paper provides a comprehensive review of the NTIRE 2024 challenge, focusing on efficient single-image super-resolution (ESR) solutions and their outcomes. The task of this challenge is to super-resolve an input image with a magnification factor of x4 based on pairs of low and corresponding high-resolution images. The primary objective is to develop networks that optimize various aspects such as runtime, parameters, and FLOPs, while still maintaining a peak signal-to-noise ratio (PSNR) of approximately 26.90 dB on the DIV2K_LSDIR_valid dataset and 26.99 dB on the DIV2K_LSDIR_test dataset. In addition, this challenge has 4 tracks including the main track (overall performance), sub-track 1 (runtime), sub-track 2 (FLOPs), and sub-track 3 (parameters). In the main track, all three metrics (ie runtime, FLOPs, and parameter count) were considered. The ranking of the main track is calculated based on a weighted sum-up of the scores of all other sub-tracks. In sub-track 1, the practical runtime performance of the submissions was evaluated, and the corresponding score was used to determine the ranking. In sub-track 2, the number of FLOPs was considered. The score calculated based on the corresponding FLOPs was used to determine the ranking. In sub-track 3, the number of parameters was considered. The score calculated based on the corresponding parameters was used to determine the ranking. RLFN is set as the baseline for efficiency measurement. The challenge had 262 registered participants, and 34 teams made valid submissions. They gauge the state-of-the-art in efficient single-image super-resolution. To facilitate the reproducibility of the challenge and enable other researchers to build upon these findings, the code and the pre-trained model of validated solutions are made publicly available at https://github.com/Amazingren/NTIRE2024_ESR/.
comment: The report paper of NTIRE2024 Efficient Super-resolution, accepted by CVPRW2024
♻ ☆ Advancing dermatological diagnosis: Development of a hyperspectral dermatoscope for enhanced skin imaging
Clinical dermatology necessitates precision and innovation for efficient diagnosis and treatment of various skin conditions. This paper introduces the development of a cutting-edge hyperspectral dermatoscope (the Hyperscope) tailored for human skin analysis. We detail the requirements to such a device and the design considerations, from optical configurations to sensor selection, necessary to capture a wide spectral range with high fidelity. Preliminary results from 15 individuals and 160 recorded skin images demonstrate the potential of the Hyperscope in identifying and characterizing various skin conditions, offering a promising avenue for non-invasive skin evaluation and a platform for future research in dermatology-related hyperspectral imaging.
comment: 12 pages, 11 Figures
♻ ☆ A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models
Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.
comment: Repo: https://github.com/wjfu99/MIA-Gen
♻ ☆ Deep Frequency-Aware Functional Maps for Robust Shape Matching
Deep functional map frameworks are widely employed for 3D shape matching. However, most existing deep functional map methods cannot adaptively capture important frequency information for functional map estimation in specific matching scenarios, i.e., lacking \textit{frequency awareness}, resulting in poor performance when dealing with large deformable shape matching. To this end, we propose a novel unsupervised learning-based framework called Deep Frequency-Aware Functional Maps, which can gracefully cope with various shape matching scenarios. We first introduce a general constraint called Spectral Filter Operator Preservation to compute desirable functional maps, where the spectral filter operator encodes informative frequency information and can promote frequency awareness for deep functional map frameworks by learning a set of filter functions. Then, we directly utilize the proposed constraint as a loss function to supervise functional maps, pointwise maps, and filter functions simultaneously, where the filter functions are derived from the orthonormal Jacobi basis, and the coefficients of the basis are learnable parameters. Finally, we develop an effective refinement strategy to improve the final pointwise map, which incorporates our constraint and learned filter functions, leading to more robust and accurate correspondences during the inference process. Extensive experimental results on various datasets demonstrate that our approach outperforms the existing state-of-the-art methods, especially in challenging settings like datasets with non-isometric deformation and inconsistent topology.
♻ ☆ Probabilistic Approach for Detection of High-Frequency Periodic Signals using an Event Camera
Being inspired by the biological eye, event camera is a novel asynchronous technology that pose a paradigm shift in acquisition of visual information. This paradigm enables event cameras to capture pixel-size fast motions much more naturally compared to classical cameras. In this paper we present a new asynchronous event-driven algorithm for detection of high-frequency pixel-size periodic signals using an event camera. Development of such new algorithms, to efficiently process the asynchronous information of event cameras, is essential and being a main challenge in the research community, in order to utilize its special properties and potential. It turns out that this algorithm, that was developed in order to satisfy the new paradigm, is related to an untreated theoretical problem in probability: let $0\leq\tau_{1}\leq\tau_{2}\leq\cdots\leq\tau_{m}\leq1$, originated from an unknown distribution. Let also $\epsilon,\delta\in\mathbb{R}$, and $d\in\mathbb{N}$. What can be said about the probability $\Phi(m,d)$ of having more than $d$ adjacent $\tau_{i}$-s pairs that the distance between them is $\delta$, up to an error $\epsilon$ ? This problem, that reminds the area of order statistic, shows how the new visualization paradigm is also an opportunity to develop new areas and problems in mathematics.
comment: 12 pages
♻ ☆ Sampling Strategies in Bayesian Inversion: A Study of RTO and Langevin Methods
This paper studies two classes of sampling methods for the solution of inverse problems, namely Randomize-Then-Optimize (RTO), which is rooted in sensitivity analysis, and Langevin methods, which are rooted in the Bayesian framework. The two classes of methods correspond to different assumptions and yield samples from different target distributions. We highlight the main conceptual and theoretical differences between the two approaches and compare them from a practical point of view by tackling two classical inverse problems in imaging: deblurring and inpainting. We show that the choice of the sampling method has a significant impact on the quality of the reconstruction and that the RTO method is more robust to the choice of the parameters.
♻ ☆ SoK: Facial Deepfake Detectors
Deepfakes have rapidly emerged as a profound and serious threat to society, primarily due to their ease of creation and dissemination. This situation has triggered an accelerated development of deepfake detection technologies. However, many existing detectors rely heavily on lab-generated datasets for validation, which may not effectively prepare them for novel, emerging, and real-world deepfake techniques. In this paper, we conduct an extensive and comprehensive review and analysis of the latest state-of-the-art deepfake detectors, evaluating them against several critical criteria. These criteria facilitate the categorization of these detectors into 4 high-level groups and 13 fine-grained sub-groups, all aligned with a unified standard conceptual framework. This classification and framework offer deep and practical insights into the factors that affect detector efficacy. We assess the generalizability of 16 leading detectors across various standard attack scenarios, including black-box, white-box, and gray-box settings. Our systematized analysis and experimentation lay the groundwork for a deeper understanding of deepfake detectors and their generalizability, paving the way for future research focused on creating detectors adept at countering various attack scenarios. Additionally, this work offers insights for developing more proactive defenses against deepfakes.
comment: 18 pages, 6 figures, 5 table, under peer-review
♻ ☆ Graph Image Prior for Unsupervised Dynamic Cardiac Cine MRI Reconstruction
The inductive bias of the convolutional neural network (CNN) can be a strong prior for image restoration, which is known as the Deep Image Prior (DIP). Recently, DIP is utilized in unsupervised dynamic MRI reconstruction, which adopts a generative model from the latent space to the image space. However, existing methods usually use a pyramid-shaped CNN generator shared by all frames, embedding the temporal modeling within the latent space, which may hamper the model expression capability. In this work, we propose a novel scheme for dynamic MRI representation, named ``Graph Image Prior'' (GIP). GIP adopts a two-stage generative network in a new modeling methodology, which first employs independent CNNs to recover the image structure for each frame, and then exploits the spatio-temporal correlations within the feature space parameterized by a graph model. A graph convolutional network is utilized for feature fusion and dynamic image generation. In addition, we devise an ADMM algorithm to alternately optimize the images and the network parameters to improve the reconstruction performance. Experiments were conducted on cardiac cine MRI reconstruction, which demonstrate that GIP outperforms compressed sensing methods and other DIP-based unsupervised methods, significantly reducing the performance gap with state-of-the-art supervised algorithms. Moreover, GIP displays superior generalization ability when transferred to a different reconstruction setting, without the need for any additional data.
♻ ☆ SegHist: A General Segmentation-based Framework for Chinese Historical Document Text Line Detection ICDAR2024
Text line detection is a key task in historical document analysis facing many challenges of arbitrary-shaped text lines, dense texts, and text lines with high aspect ratios, etc. In this paper, we propose a general framework for historical document text detection (SegHist), enabling existing segmentation-based text detection methods to effectively address the challenges, especially text lines with high aspect ratios. Integrating the SegHist framework with the commonly used method DB++, we develop DB-SegHist. This approach achieves SOTA on the CHDAC, MTHv2, and competitive results on HDRC datasets, with a significant improvement of 1.19% on the most challenging CHDAC dataset which features more text lines with high aspect ratios. Moreover, our method attains SOTA on rotated MTHv2 and rotated HDRC, demonstrating its rotational robustness. The code is available at https://github.com/LumionHXJ/SegHist.
comment: Accepted by ICDAR2024
♻ ☆ X-ray2CTPA: Generating 3D CTPA scans from 2D X-ray conditioning
Chest X-rays or chest radiography (CXR), commonly used for medical diagnostics, typically enables limited imaging compared to computed tomography (CT) scans, which offer more detailed and accurate three-dimensional data, particularly contrast-enhanced scans like CT Pulmonary Angiography (CTPA). However, CT scans entail higher costs, greater radiation exposure, and are less accessible than CXRs. In this work we explore cross-modal translation from a 2D low contrast-resolution X-ray input to a 3D high contrast and spatial-resolution CTPA scan. Driven by recent advances in generative AI, we introduce a novel diffusion-based approach to this task. We evaluate the models performance using both quantitative metrics and qualitative feedback from radiologists, ensuring diagnostic relevance of the generated images. Furthermore, we employ the synthesized 3D images in a classification framework and show improved AUC in a PE categorization task, using the initial CXR input. The proposed method is generalizable and capable of performing additional cross-modality translations in medical imaging. It may pave the way for more accessible and cost-effective advanced diagnostic tools. The code for this project is available: https://github.com/NoaCahan/X-ray2CTPA .
comment: preprint, project code: https://github.com/NoaCahan/X-ray2CTPA
♻ ☆ Open-vocabulary object 6D pose estimation CVPR 2024
We introduce the new setting of open-vocabulary object 6D pose estimation, in which a textual prompt is used to specify the object of interest. In contrast to existing approaches, in our setting (i) the object of interest is specified solely through the textual prompt, (ii) no object model (e.g., CAD or video sequence) is required at inference, and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting, we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features, resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets, REAL275 and Toyota-Light, which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon.
comment: Camera ready version (CVPR 2024, poster highlight). New Oryon version: arXiv:2406.16384
♻ ☆ Universal Prompt Optimizer for Safe Text-to-Image Generation NAACL 2024
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal prompt optimizer for safe T2I (POSI) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at https://github.com/wzongyu/POSI.
comment: NAACL 2024
♻ ☆ Bilateral Reference for High-Resolution Dichotomous Image Segmentation
We introduce a novel bilateral reference framework (BiRefNet) for high-resolution dichotomous image segmentation (DIS). It comprises two essential components: the localization module (LM) and the reconstruction module (RM) with our proposed bilateral reference (BiRef). The LM aids in object localization using global semantic information. Within the RM, we utilize BiRef for the reconstruction process, where hierarchical patches of images provide the source reference and gradient maps serve as the target reference. These components collaborate to generate the final predicted maps. We also introduce auxiliary gradient supervision to enhance focus on regions with finer details. Furthermore, we outline practical training strategies tailored for DIS to improve map quality and training process. To validate the general applicability of our approach, we conduct extensive experiments on four tasks to evince that BiRefNet exhibits remarkable performance, outperforming task-specific cutting-edge methods across all benchmarks. Our codes are available at https://github.com/ZhengPeng7/BiRefNet.
comment: Version 5, with updated DIS performance, accuracy-efficiency comparison, and 3rd-party applications
♻ ☆ DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction CVPR 2024
We propose a novel concept of dual and integrated latent topologies (DITTO in short) for implicit 3D reconstruction from noisy and sparse point clouds. Most existing methods predominantly focus on single latent type, such as point or grid latents. In contrast, the proposed DITTO leverages both point and grid latents (i.e., dual latent) to enhance their strengths, the stability of grid latents and the detail-rich capability of point latents. Concretely, DITTO consists of dual latent encoder and integrated implicit decoder. In the dual latent encoder, a dual latent layer, which is the key module block composing the encoder, refines both latents in parallel, maintaining their distinct shapes and enabling recursive interaction. Notably, a newly proposed dynamic sparse point transformer within the dual latent layer effectively refines point latents. Then, the integrated implicit decoder systematically combines these refined latents, achieving high-fidelity 3D reconstruction and surpassing previous state-of-the-art methods on object- and scene-level datasets, especially in thin and detailed structures.
comment: Accepted by CVPR 2024
♻ ☆ DK-SLAM: Monocular Visual SLAM with Deep Keypoint Learning, Tracking and Loop-Closing
The performance of visual SLAM in complex, real-world scenarios is often compromised by unreliable feature extraction and matching when using handcrafted features. Although deep learning-based local features excel at capturing high-level information and perform well on matching benchmarks, they struggle with generalization in continuous motion scenes, adversely affecting loop detection accuracy. Our system employs a Model-Agnostic Meta-Learning (MAML) strategy to optimize the training of keypoint extraction networks, enhancing their adaptability to diverse environments. Additionally, we introduce a coarse-to-fine feature tracking mechanism for learned keypoints. It begins with a direct method to approximate the relative pose between consecutive frames, followed by a feature matching method for refined pose estimation. To mitigate cumulative positioning errors, DK-SLAM incorporates a novel online learning module that utilizes binary features for loop closure detection. This module dynamically identifies loop nodes within a sequence, ensuring accurate and efficient localization. Experimental evaluations on publicly available datasets demonstrate that DK-SLAM outperforms leading traditional and learning based SLAM systems, such as ORB-SLAM3 and LIFT-SLAM. These results underscore the efficacy and robustness of our DK-SLAM in varied and challenging real-world environments.
comment: In submission
♻ ☆ AAformer: Auto-Aligned Transformer for Person Re-Identification
In person re-identification (re-ID), extracting part-level features from person images has been verified to be crucial to offer fine-grained information. Most of the existing CNN-based methods only locate the human parts coarsely, or rely on pretrained human parsing models and fail in locating the identifiable nonhuman parts (e.g., knapsack). In this article, we introduce an alignment scheme in transformer architecture for the first time and propose the auto-aligned transformer (AAformer) to automatically locate both the human parts and nonhuman ones at patch level. We introduce the "Part tokens ([PART]s)", which are learnable vectors, to extract part features in the transformer. A [PART] only interacts with a local subset of patches in self-attention and learns to be the part representation. To adaptively group the image patches into different subsets, we design the auto-alignment. Auto-alignment employs a fast variant of optimal transport (OT) algorithm to online cluster the patch embeddings into several groups with the [PART]s as their prototypes. AAformer integrates the part alignment into the self-attention and the output [PART]s can be directly used as part features for retrieval. Extensive experiments validate the effectiveness of [PART]s and the superiority of AAformer over various state-of-the-art methods.
comment: Accepted by TNNLS. IEEE Transactions on Neural Networks and Learning Systems (2023)
♻ ☆ LKM-UNet: Large Kernel Vision Mamba UNet for Medical Image Segmentation MICCAI 2024
In clinical practice, medical image segmentation provides useful information on the contours and dimensions of target organs or tissues, facilitating improved diagnosis, analysis, and treatment. In the past few years, convolutional neural networks (CNNs) and Transformers have dominated this area, but they still suffer from either limited receptive fields or costly long-range modeling. Mamba, a State Space Sequence Model (SSM), recently emerged as a promising paradigm for long-range dependency modeling with linear complexity. In this paper, we introduce a Large Kernel Vision Mamba U-shape Network, or LKM-UNet, for medical image segmentation. A distinguishing feature of our LKM-UNet is its utilization of large Mamba kernels, excelling in locally spatial modeling compared to small kernel-based CNNs and Transformers, while maintaining superior efficiency in global modeling compared to self-attention with quadratic complexity. Additionally, we design a novel hierarchical and bidirectional Mamba block to further enhance Mamba's global and neighborhood spatial modeling capability for vision inputs. Comprehensive experiments demonstrate the feasibility and the effectiveness of using large-size Mamba kernels to achieve large receptive fields. Codes are available at https://github.com/wjh892521292/LKM-UNet.
comment: Accepted by MICCAI 2024
♻ ☆ Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study
Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed as having other acute chest pain conditions. Subsequently, these AAS patients will undergo clinically inaccurate or suboptimal differential diagnosis. Fortunately, even under these suboptimal protocols, nearly all these patients underwent non-contrast CT covering the aorta anatomy at the early stage of differential diagnosis. In this study, we developed an artificial intelligence model (DeepAAS) using non-contrast CT, which is highly accurate for identifying AAS and provides interpretable results to assist in clinical decision-making. Performance was assessed in two major phases: a multi-center retrospective study (n = 20,750) and an exploration in real-world emergency scenarios (n = 137,525). In the multi-center cohort, DeepAAS achieved a mean area under the receiver operating characteristic curve of 0.958 (95% CI 0.950-0.967). In the real-world cohort, DeepAAS detected 109 AAS patients with misguided initial suspicion, achieving 92.6% (95% CI 76.2%-97.5%) in mean sensitivity and 99.2% (95% CI 99.1%-99.3%) in mean specificity. Our AI model performed well on non-contrast CT at all applicable early stages of differential diagnosis workflows, effectively reduced the overall missed diagnosis and misdiagnosis rate from 48.8% to 4.8% and shortened the diagnosis time for patients with misguided initial suspicion from an average of 681.8 (74-11,820) mins to 68.5 (23-195) mins. DeepAAS could effectively fill the gap in the current clinical workflow without requiring additional tests.
comment: under peer review
♻ ☆ OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent's efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
♻ ☆ Revisiting Active Learning in the Era of Vision Foundation Models
Foundation vision or vision-language models are trained on large unlabeled or noisy data and learn robust representations that can achieve impressive zero- or few-shot performance on diverse tasks. Given these properties, they are a natural fit for active learning (AL), which aims to maximize labeling efficiency. However, the full potential of foundation models has not been explored in the context of AL, specifically in the low-budget regime. In this work, we evaluate how foundation models influence three critical components of effective AL, namely, 1) initial labeled pool selection, 2) ensuring diverse sampling, and 3) the trade-off between representative and uncertainty sampling. We systematically study how the robust representations of foundation models (DINOv2, OpenCLIP) challenge existing findings in active learning. Our observations inform the principled construction of a new simple and elegant AL strategy that balances uncertainty estimated via dropout with sample diversity. We extensively test our strategy on many challenging image classification benchmarks, including natural images as well as out-of-domain biomedical images that are relatively understudied in the AL literature. We also provide a highly performant and efficient implementation of modern AL strategies (including our method) at https://github.com/sanketx/AL-foundation-models.
comment: Accepted to TMLR
♻ ☆ Accurately Classifying Out-Of-Distribution Data in Facial Recognition
Standard classification theory assumes that the distribution of images in the test and training sets are identical. Unfortunately, real-life scenarios typically feature unseen data ("out-of-distribution data") which is different from data in the training distribution("in-distribution"). This issue is most prevalent in social justice problems where data from under-represented groups may appear in the test data without representing an equal proportion of the training data. This may result in a model returning confidently wrong decisions and predictions. We are interested in the following question: Can the performance of a neural network improve on facial images of out-of-distribution data when it is trained simultaneously on multiple datasets of in-distribution data? We approach this problem by incorporating the Outlier Exposure model and investigate how the model's performance changes when other datasets of facial images were implemented. We observe that the accuracy and other metrics of the model can be increased by applying Outlier Exposure, incorporating a trainable weight parameter to increase the machine's emphasis on outlier images, and by re-weighting the importance of different class labels. We also experimented with whether sorting the images and determining outliers via image features would have more of an effect on the metrics than sorting by average pixel value. Our goal was to make models not only more accurate but also more fair by scanning a more expanded range of images. We also tested the datasets in reverse order to see whether a more fair dataset with balanced features has an effect on the model's accuracy.
comment: 18 pages, 6 tables, 6 figures
♻ ☆ Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments
For real-world applications, neural network models are commonly deployed in dynamic environments, where the distribution of the target domain undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to test data drawn from a continually changing target domain. Despite recent advancements in addressing CTTA, two critical issues remain: 1) The use of a fixed threshold for pseudo-labeling in existing methodologies leads to the generation of low-quality pseudo-labels, as model confidence varies across categories and domains; 2) While current solutions utilize stochastic parameter restoration to mitigate catastrophic forgetting, their capacity to preserve critical information is undermined by its intrinsic randomness. To tackle these challenges, we present CTAOD, aiming to enhance the performance of detection models in CTTA scenarios. Inspired by prior CTTA works for effective adaptation, CTAOD is founded on the mean-teacher framework, characterized by three core components. Firstly, the object-level contrastive learning module tailored for object detection extracts object-level features using the teacher's region of interest features and optimizes them through contrastive learning. Secondly, the dynamic threshold strategy updates the category-specific threshold based on predicted confidence scores to improve the quality of pseudo-labels. Lastly, we design a data-driven stochastic restoration mechanism to selectively reset inactive parameters using the gradients as weights for a random mask matrix, thereby ensuring the retention of essential knowledge. We demonstrate the effectiveness of our approach on four CTTA tasks for object detection, where CTAOD outperforms existing methods, especially achieving a 3.0 mAP improvement on the Cityscapes-to-Cityscapes-C CTTA task.
♻ ☆ Domain Adaptation based Object Detection for Autonomous Driving in Foggy and Rainy Weather
Typically, object detection methods for autonomous driving that rely on supervised learning make the assumption of a consistent feature distribution between the training and testing data, this such assumption may fail in different weather conditions. Due to the domain gap, a detection model trained under clear weather may not perform well in foggy and rainy conditions. Overcoming detection bottlenecks in foggy and rainy weather is a real challenge for autonomous vehicles deployed in the wild. To bridge the domain gap and improve the performance of object detection in foggy and rainy weather, this paper presents a novel framework for domain-adaptive object detection. The adaptations at both the image-level and object-level are intended to minimize the differences in image style and object appearance between domains. Furthermore, in order to improve the model's performance on challenging examples, we introduce a novel adversarial gradient reversal layer that conducts adversarial mining on difficult instances in addition to domain adaptation. Additionally, we suggest generating an auxiliary domain through data augmentation to enforce a new domain-level metric regularization. Experimental findings on public V2V benchmark exhibit a substantial enhancement in object detection specifically for foggy and rainy driving scenarios.
comment: the final version
♻ ☆ Text-to-Image Rectified Flow as Plug-and-Play Priors
Large-scale diffusion models have achieved remarkable performance in generative tasks. Beyond their initial training applications, these models have proven their ability to function as versatile plug-and-play priors. For instance, 2D diffusion models can serve as loss functions to optimize 3D implicit models. Rectified flow, a novel class of generative models, enforces a linear progression from the source to the target distribution and has demonstrated superior performance across various domains. Compared to diffusion-based methods, rectified flow approaches surpass in terms of generation quality and efficiency, requiring fewer inference steps. In this work, we present theoretical and experimental evidence demonstrating that rectified flow based methods offer similar functionalities to diffusion models - they can also serve as effective priors. Besides the generative capabilities of diffusion priors, motivated by the unique time-symmetry properties of rectified flow models, a variant of our method can additionally perform image inversion. Experimentally, our rectified flow-based priors outperform their diffusion counterparts - the SDS and VSD losses - in text-to-3D generation. Our method also displays competitive performance in image inversion and editing.
comment: Added results on Stable Diffusion 3. Code: https://github.com/yangxiaofeng/rectified_flow_prior
♻ ☆ MIRReS: Multi-bounce Inverse Rendering using Reservoir Sampling
We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based inverse rendering model that utilizes multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method effectively estimates indirect illumination, including self-shadowing and internal reflections, which improves the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through qualitative and quantitative evaluation of several scenarios, especially in challenging scenarios with complex shadows, we demonstrate that our method achieves state-of-the-art performance on decomposition results. Additionally, our optimized explicit geometry enables applications such as scene editing, relighting, and material editing with modern graphics engines or CAD software. The source code is available at https://brabbitdousha.github.io/MIRReS/
comment: 16 pages, 14 figures
♻ ☆ ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification ESORICS 2024
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE.
comment: Accepted by ESORICS 2024
♻ ☆ CUTS: A Deep Learning and Topological Framework for Multigranular Unsupervised Medical Image Segmentation MICCAI 2024
Segmenting medical images is critical to facilitating both patient diagnoses and quantitative research. A major limiting factor is the lack of labeled data, as obtaining expert annotations for each new set of imaging data and task can be labor intensive and inconsistent among annotators. We present CUTS, an unsupervised deep learning framework for medical image segmentation. CUTS operates in two stages. For each image, it produces an embedding map via intra-image contrastive learning and local patch reconstruction. Then, these embeddings are partitioned at dynamic granularity levels that correspond to the data topology. CUTS yields a series of coarse-to-fine-grained segmentations that highlight features at various granularities. We applied CUTS to retinal fundus images and two types of brain MRI images to delineate structures and patterns at different scales. When evaluated against predefined anatomical masks, CUTS improved the dice coefficient and Hausdorff distance by at least 10% compared to existing unsupervised methods. Finally, CUTS showed performance on par with Segment Anything Models (SAM, MedSAM, SAM-Med2D) pre-trained on gigantic labeled datasets.
comment: Accepted to the 27th International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI 2024)
♻ ☆ Towards Task-Compatible Compressible Representations ICME
We identify an issue in multi-task learnable compression, in which a representation learned for one task does not positively contribute to the rate-distortion performance of a different task as much as expected, given the estimated amount of information available in it. We interpret this issue using the predictive $\mathcal{V}$-information framework. In learnable scalable coding, previous work increased the utilization of side-information for input reconstruction by also rewarding input reconstruction when learning this shared representation. We evaluate the impact of this idea in the context of input reconstruction more rigorously and extended it to other computer vision tasks. We perform experiments using representations trained for object detection on COCO 2017 and depth estimation on the Cityscapes dataset, and use them to assist in image reconstruction and semantic segmentation tasks. The results show considerable improvements in the rate-distortion performance of the assisted tasks. Moreover, using the proposed representations, the performance of the base tasks are also improved. Results suggest that the proposed method induces simpler representations that are more compatible with downstream processes.
comment: To be published in ICME Workshops 2024
♻ ☆ Listening to the Noise: Blind Denoising with Gibbs Diffusion
In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.
comment: 12+9 pages, 7+5 figures, 1+1 tables; accepted to 2024 International Conference on Machine Learning; code: https://github.com/rubenohana/Gibbs-Diffusion
♻ ☆ LEDITS++: Limitless Image Editing using Text-to-Image Models CVPR
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming finetuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods.
comment: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) The project page is available at https://leditsplusplus-project.static.hf.space
♻ ☆ ODIN: A Single Model for 2D and 3D Segmentation CVPR 2024
State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).
comment: Camera Ready (CVPR 2024, Highlight)
♻ ☆ Learning Point Spread Function Invertibility Assessment for Image Deconvolution
Deep-learning (DL)-based image deconvolution (ID) has exhibited remarkable recovery performance, surpassing traditional linear methods. However, unlike traditional ID approaches that rely on analytical properties of the point spread function (PSF) to achieve high recovery performance - such as specific spectrum properties or small conditional numbers in the convolution matrix - DL techniques lack quantifiable metrics for evaluating PSF suitability for DL-assisted recovery. Aiming to enhance deconvolution quality, we propose a metric that employs a non-linear approach to learn the invertibility of an arbitrary PSF using a neural network by mapping it to a unit impulse. A lower discrepancy between the mapped PSF and a unit impulse indicates a higher likelihood of successful inversion by a DL network. Our findings reveal that this metric correlates with high recovery performance in DL and traditional methods, thereby serving as an effective regularizer in deconvolution tasks. This approach reduces the computational complexity over conventional condition number assessments and is a differentiable process. These useful properties allow its application in designing diffractive optical elements through end-to-end (E2E) optimization, achieving invertible PSFs, and outperforming the E2E baseline framework.
comment: Accepted at EUSIPCO 2024
♻ ☆ Visualize and Paint GAN Activations
We investigate how generated structures of GANs correlate with their activations in hidden layers, with the purpose of better understanding the inner workings of those models and being able to paint structures with unconditionally trained GANs. This gives us more control over the generated images, allowing to generate them from a semantic segmentation map while not requiring such a segmentation in the training data. To this end we introduce the concept of tileable features, allowing us to identify activations that work well for painting.
♻ ☆ MambaMIR: An Arbitrary-Masked Mamba for Joint Medical Image Reconstruction and Uncertainty Estimation
The recent Mamba model has shown remarkable adaptability for visual representation learning, including in medical imaging tasks. This study introduces MambaMIR, a Mamba-based model for medical image reconstruction, as well as its Generative Adversarial Network-based variant, MambaMIR-GAN. Our proposed MambaMIR inherits several advantages, such as linear complexity, global receptive fields, and dynamic weights, from the original Mamba model. The innovated arbitrary-mask mechanism effectively adapt Mamba to our image reconstruction task, providing randomness for subsequent Monte Carlo-based uncertainty estimation. Experiments conducted on various medical image reconstruction tasks, including fast MRI and SVCT, which cover anatomical regions such as the knee, chest, and abdomen, have demonstrated that MambaMIR and MambaMIR-GAN achieve comparable or superior reconstruction results relative to state-of-the-art methods. Additionally, the estimated uncertainty maps offer further insights into the reliability of the reconstruction quality. The code is publicly available at https://github.com/ayanglab/MambaMIR.
♻ ☆ Enhancing Global Sensitivity and Uncertainty Quantification in Medical Image Reconstruction with Monte Carlo Arbitrary-Masked Mamba
Deep learning has been extensively applied in medical image reconstruction, where Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) represent the predominant paradigms, each possessing distinct advantages and inherent limitations: CNNs exhibit linear complexity with local sensitivity, whereas ViTs demonstrate quadratic complexity with global sensitivity. The emerging Mamba has shown superiority in learning visual representation, which combines the advantages of linear scalability and global sensitivity. In this study, we introduce MambaMIR, an Arbitrary-Masked Mamba-based model with wavelet decomposition for joint medical image reconstruction and uncertainty estimation. A novel Arbitrary Scan Masking (ASM) mechanism "masks out" redundant information to introduce randomness for further uncertainty estimation. Compared to the commonly used Monte Carlo (MC) dropout, our proposed MC-ASM provides an uncertainty map without the need for hyperparameter tuning and mitigates the performance drop typically observed when applying dropout to low-level tasks. For further texture preservation and better perceptual quality, we employ the wavelet transformation into MambaMIR and explore its variant based on the Generative Adversarial Network, namely MambaMIR-GAN. Comprehensive experiments have been conducted for multiple representative medical image reconstruction tasks, demonstrating that the proposed MambaMIR and MambaMIR-GAN outperform other baseline and state-of-the-art methods in different reconstruction tasks, where MambaMIR achieves the best reconstruction fidelity and MambaMIR-GAN has the best perceptual quality. In addition, our MC-ASM provides uncertainty maps as an additional tool for clinicians, while mitigating the typical performance drop caused by the commonly used dropout.
♻ ☆ Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors
Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in real-world scenarios. In recent years, many OOD detectors have been developed, and even the benchmarking has been standardized, i.e. OpenOOD. The number of post-hoc detectors is growing fast and showing an option to protect a pre-trained classifier against natural distribution shifts, claiming to be ready for real-world scenarios. However, its efficacy in handling adversarial examples has been neglected in the majority of studies. This paper investigates the adversarial robustness of the 16 post-hoc detectors on several evasion attacks and discuss a roadmap towards adversarial defense in OOD detectors.
♻ ☆ Deep Learning for Multi-Label Learning: A Comprehensive Survey
Multi-label learning is a rapidly growing research area that aims to predict multiple labels from a single input data point. In the era of big data, tasks involving multi-label classification (MLC) or ranking present significant and intricate challenges, capturing considerable attention in diverse domains. Inherent difficulties in MLC include dealing with high-dimensional data, addressing label correlations, and handling partial labels, for which conventional methods prove ineffective. Recent years have witnessed a notable increase in adopting deep learning (DL) techniques to address these challenges more effectively in MLC. Notably, there is a burgeoning effort to harness the robust learning capabilities of DL for improved modelling of label dependencies and other challenges in MLC. However, it is noteworthy that comprehensive studies specifically dedicated to DL for multi-label learning are limited. Thus, this survey aims to thoroughly review recent progress in DL for multi-label learning, along with a summary of open research problems in MLC. The review consolidates existing research efforts in DL for MLC,including deep neural networks, transformers, autoencoders, and convolutional and recurrent architectures. Finally, the study presents a comparative analysis of the existing methods to provide insightful observations and stimulate future research directions in this domain.
comment: 20 pages, 5 tables
♻ ☆ Expected Grad-CAM: Towards gradient faithfulness
Although input-gradients techniques have evolved to mitigate and tackle the challenges associated with gradients, modern gradient-weighted CAM approaches still rely on vanilla gradients, which are inherently susceptible to the saturation phenomena. Despite recent enhancements have incorporated counterfactual gradient strategies as a mitigating measure, these local explanation techniques still exhibit a lack of sensitivity to their baseline parameter. Our work proposes a gradient-weighted CAM augmentation that tackles both the saturation and sensitivity problem by reshaping the gradient computation, incorporating two well-established and provably approaches: Expected Gradients and kernel smoothing. By revisiting the original formulation as the smoothed expectation of the perturbed integrated gradients, one can concurrently construct more faithful, localized and robust explanations which minimize infidelity. Through fine modulation of the perturbation distribution it is possible to regulate the complexity characteristic of the explanation, selectively discriminating stable features. Our technique, Expected Grad-CAM, differently from recent works, exclusively optimizes the gradient computation, purposefully designed as an enhanced substitute of the foundational Grad-CAM algorithm and any method built therefrom. Quantitative and qualitative evaluations have been conducted to assess the effectiveness of our method.
comment: Updated appendix figures to vector format for improved clarity
Computation and Language 146
☆ BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning
Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE. Our code and data are publicly accessible via the anonymous repository at https://anonymous.4open.science/r/MIKE.
comment: 12 pages, 4 figures
☆ CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Large language models (LLMs) are commonly used for long-form question answering, which requires them to generate paragraph-length answers to complex questions. While long-form QA has been well-studied in English via many different datasets and evaluation metrics, this research has not been extended to cover most other languages. To bridge this gap, we introduce CaLMQA, a collection of 2.6K complex questions spanning 23 languages, including under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset includes both naturally-occurring questions collected from community web forums as well as questions written by native speakers, whom we hire for this purpose. Our process yields diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We conduct automatic evaluation across a suite of open- and closed-source models using our novel metric CaLMScore, which detects incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. We perform human evaluation on a subset of models and see that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in LLM multilingual capabilities and non-English LFQA evaluation.
comment: 39 pages, 16 figures. Code and data available at https://github.com/2015aroras/CaLMQA
☆ Accelerating Clinical Evidence Synthesis with Large Language Models
Automatic medical discovery by AI is a dream of many. One step toward that goal is to create an AI model to understand clinical studies and synthesize clinical evidence from the literature. Clinical evidence synthesis currently relies on systematic reviews of clinical trials and retrospective analyses from medical literature. However, the rapid expansion of publications presents challenges in efficiently identifying, summarizing, and updating evidence. We introduce TrialMind, a generative AI-based pipeline for conducting medical systematic reviews, encompassing study search, screening, and data extraction phases. We utilize large language models (LLMs) to drive each pipeline component while incorporating human expert oversight to minimize errors. To facilitate evaluation, we also create a benchmark dataset TrialReviewBench, a custom dataset with 870 annotated clinical studies from 25 meta-analysis papers across various medical treatments. Our results demonstrate that TrialMind significantly improves the literature review process, achieving high recall rates (0.897-1.000) in study searching from over 20 million PubMed studies and outperforming traditional language model embeddings-based methods in screening (Recall@20 of 0.227-0.246 vs. 0.000-0.102). Furthermore, our approach surpasses direct GPT-4 performance in result extraction, with accuracy ranging from 0.65 to 0.84. We also support clinical evidence synthesis in forest plots, as validated by eight human annotators who preferred TrialMind over the GPT-4 baseline with a winning rate of 62.5%-100% across the involved reviews. Our findings suggest that an LLM-based clinical evidence synthesis approach, such as TrialMind, can enable reliable and high-quality clinical evidence synthesis to improve clinical research efficiency.
☆ Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language
We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study the ability of LLMs to produce persuasive text. As opposed to prior work which focuses on particular domains or types of persuasion, we conduct a general study across various domains to measure and benchmark to what degree LLMs produce persuasive text - both when explicitly instructed to rewrite text to be more or less persuasive and when only instructed to paraphrase. To this end, we construct a new dataset, Persuasive-Pairs, of pairs each consisting of a short text and of a text rewritten by an LLM to amplify or diminish persuasive language. We multi-annotate the pairs on a relative scale for persuasive language. This data is not only a valuable resource in itself, but we also show that it can be used to train a regression model to predict a score of persuasive language between text pairs. This model can score and benchmark new LLMs across domains, thereby facilitating the comparison of different LLMs. Finally, we discuss effects observed for different system prompts. Notably, we find that different 'personas' in the system prompt of LLaMA3 change the persuasive language in the text substantially, even when only instructed to paraphrase. These findings underscore the importance of investigating persuasive language in LLM generated text.
☆ Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
☆ Following Length Constraints in Instructions
Aligned instruction following models can better fulfill user requests than their unaligned counterparts. However, it has been shown that there is a length bias in evaluation of such models, and that training algorithms tend to exploit this bias by learning longer responses. In this work we show how to train models that can be controlled at inference time with instructions containing desired length constraints. Such models are superior in length instructed evaluations, outperforming standard instruction following models such as GPT4, Llama 3 and Mixtral.
comment: 13 pages
☆ Find Parent then Label Children: A Two-stage Taxonomy Completion Method with Pre-trained Language Model
Taxonomies, which organize domain concepts into hierarchical structures, are crucial for building knowledge systems and downstream applications. As domain knowledge evolves, taxonomies need to be continuously updated to include new concepts. Previous approaches have mainly focused on adding concepts to the leaf nodes of the existing hierarchical tree, which does not fully utilize the taxonomy's knowledge and is unable to update the original taxonomy structure (usually involving non-leaf nodes). In this paper, we propose a two-stage method called ATTEMPT for taxonomy completion. Our method inserts new concepts into the correct position by finding a parent node and labeling child nodes. Specifically, by combining local nodes with prompts to generate natural sentences, we take advantage of pre-trained language models for hypernym/hyponymy recognition. Experimental results on two public datasets (including six domains) show that ATTEMPT performs best on both taxonomy completion and extension tasks, surpassing existing methods.
☆ LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
While state-of-the-art Large Language Models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.
☆ ViANLI: Adversarial Natural Language Inference for Vietnamese
The development of Natural Language Processing (NLI) datasets and models has been inspired by innovations in annotation design. With the rapid development of machine learning models today, the performance of existing machine learning models has quickly reached state-of-the-art results on a variety of tasks related to natural language processing, including natural language inference tasks. By using a pre-trained model during the annotation process, it is possible to challenge current NLI models by having humans produce premise-hypothesis combinations that the machine model cannot correctly predict. To remain attractive and challenging in the research of natural language inference for Vietnamese, in this paper, we introduce the adversarial NLI dataset to the NLP research community with the name ViANLI. This data set contains more than 10K premise-hypothesis pairs and is built by a continuously adjusting process to obtain the most out of the patterns generated by the annotators. ViANLI dataset has brought many difficulties to many current SOTA models when the accuracy of the most powerful model on the test set only reached 48.4%. Additionally, the experimental results show that the models trained on our dataset have significantly improved the results on other Vietnamese NLI datasets.
☆ FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model KDD 2024
Large language models (LLMs) show amazing performance on many domain-specific tasks after fine-tuning with some appropriate data. However, many domain-specific data are privately distributed across multiple owners. Thus, this dilemma raises the interest in how to perform LLM fine-tuning in federated learning (FL). However, confronted with limited computation and communication capacities, FL clients struggle to fine-tune an LLM effectively. To this end, we introduce FedBiOT, a resource-efficient LLM fine-tuning approach to FL. Specifically, our method involves the server generating a compressed LLM and aligning its performance with the full model. Subsequently, the clients fine-tune a lightweight yet important part of the compressed model, referred to as an adapter. Notice that as the server has no access to the private data owned by the clients, the data used for alignment by the server has a different distribution from the one used for fine-tuning by clients. We formulate the problem into a bi-level optimization problem to minimize the negative effect of data discrepancy and derive the updating rules for the server and clients. We conduct extensive experiments on LLaMA-2, empirically showing that the adapter has exceptional performance when reintegrated into the global LLM. The results also indicate that the proposed FedBiOT significantly reduces resource consumption compared to existing benchmarks, all while achieving comparable performance levels.
comment: KDD 2024
☆ From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
The alignment process changes several properties of a large language model's (LLM's) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.
☆ VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
As large language models achieve impressive scores on traditional benchmarks, an increasing number of researchers are becoming concerned about benchmark data leakage during pre-training, commonly known as the data contamination problem. To ensure fair evaluation, recent benchmarks release only the training and validation sets, keeping the test set labels closed-source. They require anyone wishing to evaluate his language model to submit the model's predictions for centralized processing and then publish the model's result on their leaderboard. However, this submission process is inefficient and prevents effective error analysis. To address this issue, we propose to variabilize benchmarks and evaluate language models dynamically. Specifically, we extract variables from each test case and define a value range for each variable. For each evaluation, we sample new values from these value ranges to create unique test cases, thus ensuring a fresh evaluation each time. We applied this variable perturbation method to four datasets: GSM8K, ARC, CommonsenseQA, and TruthfulQA, which cover mathematical generation and multiple-choice tasks. Our experimental results demonstrate that this approach provides a more accurate assessment of the true capabilities of language models, effectively mitigating the contamination problem.
☆ Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. The broader integration of LLMs into society has sparked interest in whether they manifest psychological attributes, and whether these attributes are stable-inquiries that could deepen the understanding of their behaviors. Inspired by psychometrics, this paper presents a framework for investigating psychology in LLMs, including psychological dimension identification, assessment dataset curation, and assessment with results validation. Following this framework, we introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence. This benchmark includes thirteen datasets featuring diverse scenarios and item types. Our findings indicate that LLMs manifest a broad spectrum of psychological attributes. We also uncover discrepancies between LLMs' self-reported traits and their behaviors in real-world scenarios. This paper demonstrates a thorough psychometric assessment of LLMs, providing insights into reliable evaluation and potential applications in AI and social sciences.
☆ This Paper Had the Smartest Reviewers -- Flattery Detection Utilising an Audio-Textual Transformer-Based Approach
Flattery is an important aspect of human communication that facilitates social bonding, shapes perceptions, and influences behavior through strategic compliments and praise, leveraging the power of speech to build rapport effectively. Its automatic detection can thus enhance the naturalness of human-AI interactions. To meet this need, we present a novel audio textual dataset comprising 20 hours of speech and train machine learning models for automatic flattery detection. In particular, we employ pretrained AST, Wav2Vec2, and Whisper models for the speech modality, and Whisper TTS models combined with a RoBERTa text classifier for the textual modality. Subsequently, we build a multimodal classifier by combining text and audio representations. Evaluation on unseen test data demonstrates promising results, with Unweighted Average Recall scores reaching 82.46% in audio-only experiments, 85.97% in text-only experiments, and 87.16% using a multimodal approach.
comment: Interspeech 2024
☆ LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic
We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.
☆ ELIZA Reinterpreted: The world's first chatbot was not intended as a chatbot at all
ELIZA, often considered the world's first chatbot, was written by Joseph Weizenbaum in the early 1960s. Weizenbaum did not intend to invent the chatbot, but rather to build a platform for research into human-machine conversation and the important cognitive processes of interpretation and misinterpretation. His purpose was obscured by ELIZA's fame, resulting in large part from the fortuitous timing of it's creation, and it's escape into the wild. In this paper I provide a rich historical context for ELIZA's creation, demonstrating that ELIZA arose from the intersection of some of the central threads in the technical history of AI. I also briefly discuss how ELIZA escaped into the world, and how its accidental escape, along with several coincidental turns of the programming language screws, led both to the misapprehension that ELIZA was intended as a chatbot, and to the loss of the original ELIZA to history for over 50 years.
comment: In review in IEEE Annals of the History of Computing (submitted Apr 2024)
☆ Variationist: Exploring Multifaceted Variation and Bias in Written Language Data ACL 2024
Exploring and understanding language data is a fundamental stage in all areas dealing with human language. It allows NLP practitioners to uncover quality concerns and harmful biases in data before training, and helps linguists and social scientists to gain insight into language use and human behavior. Yet, there is currently a lack of a unified, customizable tool to seamlessly inspect and visualize language variation and bias across multiple variables, language units, and diverse metrics that go beyond descriptive statistics. In this paper, we introduce Variationist, a highly-modular, extensible, and task-agnostic tool that fills this gap. Variationist handles at once a potentially unlimited combination of variable types and semantics across diversity and association metrics with regards to the language unit of choice, and orchestrates the creation of up to five-dimensional interactive charts for over 30 variable type-semantics combinations. Through our case studies on computational dialectology, human label variation, and text generation, we show how Variationist enables researchers from different disciplines to effortlessly answer specific research questions or unveil undesired associations in language data. A Python library, code, documentation, and tutorials are made publicly available to the research community.
comment: ACL 2024 (System Demonstrations)
☆ Banishing LLM Hallucinations Requires Rethinking Generalization
Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations -- Lamini-1 -- that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.
☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
☆ Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.
comment: In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science
☆ CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024
As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
comment: Submitted to EMNLP 2024
☆ Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models
As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.
☆ Towards Building an End-to-End Multilingual Automatic Lyrics Transcription Model
Multilingual automatic lyrics transcription (ALT) is a challenging task due to the limited availability of labelled data and the challenges introduced by singing, compared to multilingual automatic speech recognition. Although some multilingual singing datasets have been released recently, English continues to dominate these collections. Multilingual ALT remains underexplored due to the scale of data and annotation quality. In this paper, we aim to create a multilingual ALT system with available datasets. Inspired by architectures that have been proven effective for English ALT, we adapt these techniques to the multilingual scenario by expanding the target vocabulary set. We then evaluate the performance of the multilingual model in comparison to its monolingual counterparts. Additionally, we explore various conditioning methods to incorporate language information into the model. We apply analysis by language and combine it with the language classification performance. Our findings reveal that the multilingual model performs consistently better than the monolingual models trained on the language subsets. Furthermore, we demonstrate that incorporating language information significantly enhances performance.
comment: Accepted at EUSIPCO 2024
☆ "Seeing the Big through the Small": Can LLMs Approximate Human Judgment Distributions on NLI from a Few Explanations?
Human label variation (HLV) is a valuable source of information that arises when multiple human annotators provide different labels for valid reasons. In Natural Language Inference (NLI) earlier approaches to capturing HLV involve either collecting annotations from many crowd workers to represent human judgment distribution (HJD) or use expert linguists to provide detailed explanations for their chosen labels. While the former method provides denser HJD information, obtaining it is resource-intensive. In contrast, the latter offers richer textual information but it is challenging to scale up to many human judges. Besides, large language models (LLMs) are increasingly used as evaluators (``LLM judges'') but with mixed results, and few works aim to study HJDs. This study proposes to exploit LLMs to approximate HJDs using a small number of expert labels and explanations. Our experiments show that a few explanations significantly improve LLMs' ability to approximate HJDs with and without explicit labels, thereby providing a solution to scale up annotations for HJD. However, fine-tuning smaller soft-label aware models with the LLM-generated model judgment distributions (MJDs) presents partially inconsistent results: while similar in distance, their resulting fine-tuned models and visualized distributions differ substantially. We show the importance of complementing instance-level distance measures with a global-level shape metric and visualization to more effectively evaluate MJDs against human judgment distributions.
comment: 22 pages, 9 figures
☆ LongIns: A Challenging Long-context Instruction-based Exam for LLMs
The long-context capabilities of large language models (LLMs) have been a hot topic in recent years. To evaluate the performance of LLMs in different scenarios, various assessment benchmarks have emerged. However, as most of these benchmarks focus on identifying key information to answer questions, which mainly requires the retrieval ability of LLMs, these benchmarks can partially represent the reasoning performance of LLMs from large amounts of information. Meanwhile, although LLMs often claim to have context windows of 32k, 128k, 200k, or even longer, these benchmarks fail to reveal the actual supported length of these LLMs. To address these issues, we propose the LongIns benchmark dataset, a challenging long-context instruction-based exam for LLMs, which is built based on the existing instruction datasets. Specifically, in our LongIns, we introduce three evaluation settings: Global Instruction & Single Task (GIST), Local Instruction & Single Task (LIST), and Local Instruction & Multiple Tasks (LIMT). Based on LongIns, we perform comprehensive evaluations on existing LLMs and have the following important findings: (1). The top-performing GPT-4 with 128k context length performs poorly on the evaluation context window of 16k in our LongIns. (2). For the multi-hop reasoning ability of many existing LLMs, significant efforts are still needed under short context windows (less than 4k).
☆ Beyond Text-to-SQL for IoT Defense: A Comprehensive Framework for Querying and Classifying IoT Threats
Recognizing the promise of natural language interfaces to databases, prior studies have emphasized the development of text-to-SQL systems. While substantial progress has been made in this field, existing research has concentrated on generating SQL statements from text queries. The broader challenge, however, lies in inferring new information about the returned data. Our research makes two major contributions to address this gap. First, we introduce a novel Internet-of-Things (IoT) text-to-SQL dataset comprising 10,985 text-SQL pairs and 239,398 rows of network traffic activity. The dataset contains additional query types limited in prior text-to-SQL datasets, notably temporal-related queries. Our dataset is sourced from a smart building's IoT ecosystem exploring sensor read and network traffic data. Second, our dataset allows two-stage processing, where the returned data (network traffic) from a generated SQL can be categorized as malicious or not. Our results show that joint training to query and infer information about the data can improve overall text-to-SQL performance, nearly matching substantially larger models. We also show that current large language models (e.g., GPT3.5) struggle to infer new information about returned data, thus our dataset provides a novel test bed for integrating complex domain-specific reasoning into LLMs.
☆ FrenchToxicityPrompts: a Large Benchmark for Evaluating and Mitigating Toxicity in French Texts
Large language models (LLMs) are increasingly popular but are also prone to generating bias, toxic or harmful language, which can have detrimental effects on individuals and communities. Although most efforts is put to assess and mitigate toxicity in generated content, it is primarily concentrated on English, while it's essential to consider other languages as well. For addressing this issue, we create and release FrenchToxicityPrompts, a dataset of 50K naturally occurring French prompts and their continuations, annotated with toxicity scores from a widely used toxicity classifier. We evaluate 14 different models from four prevalent open-sourced families of LLMs against our dataset to assess their potential toxicity across various dimensions. We hope that our contribution will foster future research on toxicity detection and mitigation beyond Englis
comment: TRAC-2024, Fourth Workshop on Threat, Aggression and Cyberbullying. 20 May 2024
☆ Multi-property Steering of Large Language Models with Dynamic Activation Composition
Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models' intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.
☆ The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale
The performance of a large language model (LLM) depends heavily on the quality and size of its pretraining dataset. However, the pretraining datasets for state-of-the-art open LLMs like Llama 3 and Mixtral are not publicly available and very little is known about how they were created. In this work, we introduce FineWeb, a 15-trillion token dataset derived from 96 Common Crawl snapshots that produces better-performing LLMs than other open pretraining datasets. To advance the understanding of how best to curate high-quality pretraining datasets, we carefully document and ablate all of the design choices used in FineWeb, including in-depth investigations of deduplication and filtering strategies. In addition, we introduce FineWeb-Edu, a 1.3-trillion token collection of educational text filtered from FineWeb. LLMs pretrained on FineWeb-Edu exhibit dramatically better performance on knowledge- and reasoning-intensive benchmarks like MMLU and ARC. Along with our datasets, we publicly release our data curation codebase and all of the models trained during our ablation experiments.
☆ Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
In the Minecraft Collaborative Building Task, two players collaborate: an Architect (A) provides instructions to a Builder (B) to assemble a specified structure using 3D blocks. In this work, we investigate the use of large language models (LLMs) to predict the sequence of actions taken by the Builder. Leveraging LLMs' in-context learning abilities, we use few-shot prompting techniques, that significantly improve performance over baseline methods. Additionally, we present a detailed analysis of the gaps in performance for future work
comment: under review
☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
☆ Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.
☆ Retrieval-style In-Context Learning for Few-shot Hierarchical Text Classification
Hierarchical text classification (HTC) is an important task with broad applications, while few-shot HTC has gained increasing interest recently. While in-context learning (ICL) with large language models (LLMs) has achieved significant success in few-shot learning, it is not as effective for HTC because of the expansive hierarchical label sets and extremely-ambiguous labels. In this work, we introduce the first ICL-based framework with LLM for few-shot HTC. We exploit a retrieval database to identify relevant demonstrations, and an iterative policy to manage multi-layer hierarchical labels. Particularly, we equip the retrieval database with HTC label-aware representations for the input texts, which is achieved by continual training on a pretrained language model with masked language modeling (MLM), layer-wise classification (CLS, specifically for HTC), and a novel divergent contrastive learning (DCL, mainly for adjacent semantically-similar labels) objective. Experimental results on three benchmark datasets demonstrate superior performance of our method, and we can achieve state-of-the-art results in few-shot HTC.
comment: 17 pages
☆ Can Large Language Models Understand DL-Lite Ontologies? An Empirical Study
Large language models (LLMs) have shown significant achievements in solving a wide range of tasks. Recently, LLMs' capability to store, retrieve and infer with symbolic knowledge has drawn a great deal of attention, showing their potential to understand structured information. However, it is not yet known whether LLMs can understand Description Logic (DL) ontologies. In this work, we empirically analyze the LLMs' capability of understanding DL-Lite ontologies covering 6 representative tasks from syntactic and semantic aspects. With extensive experiments, we demonstrate both the effectiveness and limitations of LLMs in understanding DL-Lite ontologies. We find that LLMs can understand formal syntax and model-theoretic semantics of concepts and roles. However, LLMs struggle with understanding TBox NI transitivity and handling ontologies with large ABoxes. We hope that our experiments and analyses provide more insights into LLMs and inspire to build more faithful knowledge engineering solutions.
☆ LumberChunker: Long-Form Narrative Document Segmentation
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker
☆ Entropy-Based Decoding for Retrieval-Augmented Large Language Models
Augmenting Large Language Models (LLMs) with retrieved external knowledge has proven effective for improving the factual accuracy of generated responses. Despite their success, retrieval-augmented LLMs still face the distractibility issue, where the generated responses are negatively influenced by noise from both external and internal knowledge sources. In this paper, we introduce a novel, training-free decoding method guided by entropy considerations to mitigate this issue. Our approach utilizes entropy-based document-parallel ensemble decoding to prioritize low-entropy distributions from retrieved documents, thereby enhancing the extraction of relevant information of context. Additionally, it incorporates a contrastive decoding mechanism that contrasts the obtained low-entropy ensemble distribution with the high-entropy distribution derived from the model's internal knowledge across layers, which ensures a greater emphasis on reliable external information. Extensive experiments on open-domain question answering datasets demonstrate the superiority of our method.
☆ Benchmarking Mental State Representations in Language Models ICML 2024
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models' internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models' representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models' reasoning performance by steering their activations without the need to train any probe.
comment: ICML 2024 Workshop on Mechanistic Interpretability
☆ MedCare: Advancing Medical LLMs through Decoupling Clinical Alignment and Knowledge Aggregation
Large language models (LLMs) have shown substantial progress in natural language understanding and generation, proving valuable especially in the medical field. Despite advancements, challenges persist due to the complexity and diversity inherent in medical tasks, which can be categorized as knowledge-intensive tasks and alignment-required tasks. Previous approaches either ignore the latter task or focus on a minority of tasks and hence lose generalization. To address these drawbacks, we propose a progressive fine-tuning pipeline. This pipeline employs a Knowledge Aggregator and a Noise aggregator to encode diverse knowledge in the first stage and filter out detrimental information. In the second stage, we drop the Noise Aggregator to avoid the interference of suboptimal representation and leverage an additional alignment module optimized towards an orthogonal direction to the knowledge space to mitigate knowledge forgetting. Based on this two-stage paradigm, we proposed a Medical LLM through decoupling Clinical Alignment and Knowledge Aggregation (MedCare), which is designed to achieve state-of-the-art (SOTA) performance on over 20 medical tasks, as well as SOTA results on specific medical alignment tasks. Various model sizes of MedCare (1.8B, 7B, 14B) all demonstrate significant improvements over existing models with similar model sizes.
comment: 19 pages, 6 figures
Transformer-based Named Entity Recognition with Combined Data Representation
This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models with a single strategy may lead to poor performance on different data representations. To address this limitation, the study proposes a combined training procedure that utilizes all three strategies to improve model stability and adaptability. The results of this approach are presented and discussed for four languages (English, Polish, Czech, and German) across various datasets, demonstrating the effectiveness of the combined strategy.
comment: 14 pages, 6 figures
☆ Enhancing Tool Retrieval with Iterative Feedback from Large Language Models
Tool learning aims to enhance and expand large language models' (LLMs) capabilities with external tools, which has gained significant attention recently. Current methods have shown that LLMs can effectively handle a certain amount of tools through in-context learning or fine-tuning. However, in real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. Tool retrieval is nontrivial due to the following challenges: 1) complex user instructions and tool descriptions; 2) misalignment between tool retrieval and tool usage models. To address the above issues, we propose to enhance tool retrieval with iterative feedback from the large language model. Specifically, we prompt the tool usage model, i.e., the LLM, to provide feedback for the tool retriever model in multi-round, which could progressively improve the tool retriever's understanding of instructions and tools and reduce the gap between the two standalone components. We build a unified and comprehensive benchmark to evaluate tool retrieval models. The extensive experiments indicate that our proposed approach achieves advanced performance in both in-domain evaluation and out-of-domain evaluation.
☆ Improving Grammatical Error Correction via Contextual Data Augmentation ACL 2024
Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
comment: Accepted as Findings of ACL 2024
☆ Learning to Ask Informative Questions: Enhancing LLMs with Preference Optimization and Expected Information Gain
Questions are essential tools for acquiring the necessary information to complete information-seeking tasks. However, large language models (LLMs), especially open-source models, often perform poorly in generating informative questions, as measured by expected information gain (EIG). In this paper, we propose a method to enhance the informativeness of LLM-generated questions in 20-question game dialogues. We sample multiple questions from the same model (LLAMA 2-CHAT 7B) for each game and create pairs of low-EIG and high-EIG questions to apply a Direct Preference Optimization (DPO) algorithm. Our results show that this method produces more effective questions (in terms of EIG), even in domains different from those used to train the DPO model.
☆ Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
Large Multimodal Models (LMMs) have achieved great success recently, demonstrating a strong capability to understand multimodal information and to interact with human users. Despite the progress made, the challenge of detecting high-risk interactions in multimodal settings, and in particular in speech modality, remains largely unexplored. Conventional research on risk for speech modality primarily emphasises the content (e.g., what is captured as transcription). However, in speech-based interactions, paralinguistic cues in audio can significantly alter the intended meaning behind utterances. In this work, we propose a speech-specific risk taxonomy, covering 8 risk categories under hostility (malicious sarcasm and threats), malicious imitation (age, gender, ethnicity), and stereotypical biases (age, gender, ethnicity). Based on the taxonomy, we create a small-scale dataset for evaluating current LMMs capability in detecting these categories of risk. We observe even the latest models remain ineffective to detect various paralinguistic-specific risks in speech (e.g., Gemini 1.5 Pro is performing only slightly above random baseline). Warning: this paper contains biased and offensive examples.
☆ Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong's test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model's long-context modeling capabilities.
comment: We release our code and data publicly at https://github.com/MozerWang/Loong
☆ Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
☆ Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
comment: 11 pages, 6 figures
☆ Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance
Large Language Models (LLMs) excel at providing information acquired during pretraining on large-scale corpora and following instructions through user prompts. This study investigates whether the quality of LLM responses varies depending on the demographic profile of users. Considering English as the global lingua franca, along with the diversity of its dialects among speakers of different native languages, we explore whether non-native English speakers receive lower-quality or even factually incorrect responses from LLMs more frequently. Our results show that performance discrepancies occur when LLMs are prompted by native versus non-native English speakers and persist when comparing native speakers from Western countries with others. Additionally, we find a strong anchoring effect when the model recognizes or is made aware of the user's nativeness, which further degrades the response quality when interacting with non-native speakers. Our analysis is based on a newly collected dataset with over 12,000 unique annotations from 124 annotators, including information on their native language and English proficiency.
☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the embedding LLMs, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight embedding LLMs and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we then find that the main change in embedding space between the embedding LLMs and their original generative LLMs is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80\% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a fresh perspective to help understand fuzzy concepts (e.g., semantic relatedness vs. semantic similarity) and emerging technologies (e.g., instruction-following embedding) in this field.
comment: Work in Progress
☆ A Three-Pronged Approach to Cross-Lingual Adaptation with Multilingual LLMs
Low-resource languages, by its very definition, tend to be under represented in the pre-training corpora of Large Language Models. In this work, we investigate three low-resource cross-lingual approaches that enable an LLM adapt to tasks in previously unseen languages. Llama-2 is an LLM where Indic languages, among many other language families, contribute to less than $0.005\%$ of the total $2$ trillion token pre-training corpora. In this work, we experiment with the English-dominated Llama-2 for cross-lingual transfer to three Indic languages, Bengali, Hindi, and Tamil as target languages. We study three approaches for cross-lingual transfer, under ICL and fine-tuning. One, we find that adding additional supervisory signals via a dominant language in the LLM, leads to improvements, both under in-context learning and fine-tuning. Two, adapting the target languages to word reordering may be beneficial under ICL, but its impact diminishes with fine tuning. Finally, continued pre-training in one low-resource language can improve model performance for other related low-resource languages.
☆ An Empirical Study on the Characteristics of Bias upon Context Length Variation for Bangla ACL
Pretrained language models inherently exhibit various social biases, prompting a crucial examination of their social impact across various linguistic contexts due to their widespread usage. Previous studies have provided numerous methods for intrinsic bias measurements, predominantly focused on high-resource languages. In this work, we aim to extend these investigations to Bangla, a low-resource language. Specifically, in this study, we (1) create a dataset for intrinsic gender bias measurement in Bangla, (2) discuss necessary adaptations to apply existing bias measurement methods for Bangla, and (3) examine the impact of context length variation on bias measurement, a factor that has been overlooked in previous studies. Through our experiments, we demonstrate a clear dependency of bias metrics on context length, highlighting the need for nuanced considerations in Bangla bias analysis. We consider our work as a stepping stone for bias measurement in the Bangla Language and make all of our resources publicly available to support future research.
comment: Accepted in Findings of ACL, 2024
☆ Leveraging Synthetic Audio Data for End-to-End Low-Resource Speech Translation
This paper describes our system submission to the International Conference on Spoken Language Translation (IWSLT 2024) for Irish-to-English speech translation. We built end-to-end systems based on Whisper, and employed a number of data augmentation techniques, such as speech back-translation and noise augmentation. We investigate the effect of using synthetic audio data and discuss several methods for enriching signal diversity.
comment: IWSLT 2024
☆ Dual-Space Knowledge Distillation for Large Language Models
Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred. However, in the current white-box KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads. We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels. Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs. To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies. Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies. Experiments on task-agnostic instruction-following benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies.
comment: 17 pages, 11 figures, code available at: https://github.com/songmzhang/DSKD
☆ Delving into the Utilisation of ChatGPT in Scientific Publications in Astronomy SP
Rapid progress in the capabilities of machine learning approaches in natural language processing has culminated in the rise of large language models over the last two years. Recent works have shown unprecedented adoption of these for academic writing, especially in some fields, but their pervasiveness in astronomy has not been studied sufficiently. To remedy this, we extract words that ChatGPT uses more often than humans when generating academic text and search a total of 1 million articles for them. This way, we assess the frequency of word occurrence in published works in astronomy tracked by the NASA Astrophysics Data System since 2000. We then perform a statistical analysis of the occurrences. We identify a list of words favoured by ChatGPT and find a statistically significant increase for these words against a control group in 2024, which matches the trend in other disciplines. These results suggest a widespread adoption of these models in the writing of astronomy papers. We encourage organisations, publishers, and researchers to work together to identify ethical and pragmatic guidelines to maximise the benefits of these systems while maintaining scientific rigour.
comment: Submitted to SPAICE
☆ Not All Preference Pairs Are Created Equal: A Recipe for Annotation-Efficient Iterative Preference Learning
Iterative preference learning, though yielding superior performances, requires online annotated preference labels. In this work, we study strategies to select worth-annotating response pairs for cost-efficient annotation while achieving competitive or even better performances compared with the random selection baseline for iterative preference learning. Built on assumptions regarding uncertainty and distribution shifts, we propose a comparative view to rank the implicit reward margins as predicted by DPO to select the response pairs that yield more benefits. Through extensive experiments, we show that annotating those response pairs with small margins is generally better than large or random, under both single- and multi-iteration scenarios. Besides, our empirical results suggest allocating more annotation budgets in the earlier iterations rather than later across multiple iterations.
☆ Retrieval Augmented Instruction Tuning for Open NER with Large Language Models
The strong capability of large language models (LLMs) has been applied to information extraction (IE) through either retrieval augmented prompting or instruction tuning (IT). However, the best way to incorporate information with LLMs for IE remains an open question. In this paper, we explore Retrieval Augmented Instruction Tuning (RA-IT) for IE, focusing on the task of open named entity recognition (NER). Specifically, for each training sample, we retrieve semantically similar examples from the training dataset as the context and prepend them to the input of the original instruction. To evaluate our RA-IT approach more thoroughly, we construct a Chinese IT dataset for open NER and evaluate RA-IT in both English and Chinese scenarios. Experimental results verify the effectiveness of RA-IT across various data sizes and in both English and Chinese scenarios. We also conduct thorough studies to explore the impacts of various retrieval strategies in the proposed RA-IT framework. Code and data are available at: https://github.com/Emma1066/Retrieval-Augmented-IT-OpenNER
☆ Leveraging LLMs for Dialogue Quality Measurement
In task-oriented conversational AI evaluation, unsupervised methods poorly correlate with human judgments, and supervised approaches lack generalization. Recent advances in large language models (LLMs) show robust zeroshot and few-shot capabilities across NLP tasks. This paper explores using LLMs for automated dialogue quality evaluation, experimenting with various configurations on public and proprietary datasets. Manipulating factors such as model size, in-context examples, and selection techniques, we examine "chain-of-thought" (CoT) reasoning and label extraction procedures. Our results show that (1) larger models yield more accurate dialogue labels; (2) algorithmic selection of in-context examples outperforms random selection; (3) CoT reasoning where an LLM is asked to provide justifications before outputting final labels improves performance; and (4) fine-tuned LLMs outperform out-of-the-box ones. Our results indicate that LLMs that are suitably fine-tuned and have sufficient reasoning capabilities can be leveraged for automated dialogue evaluation.
☆ CausalScore: An Automatic Reference-Free Metric for Assessing Response Relevance in Open-Domain Dialogue Systems
Automatically evaluating the quality of responses in open-domain dialogue systems is a challenging but crucial task. Current evaluation metrics often fail to align with human judgments, especially when assessing responses that are grammatically correct. To address this issue, we propose a novel metric, called CausalScore, which assesses the relevance of responses by measuring the causal strength between dialogue histories and responses. The causal strength is estimated by utilizing both unconditional dependence and conditional dependencies from the dialogue history to responses. We compare our metric with the existing competitive metrics in terms of their alignment with human judgements. Our experimental results demonstrate that CausalScore significantly surpasses existing state-of-the-art metrics by aligning better with human judgements. Additionally, we collect a new dialogue dataset CGDIALOG+ with human-annotated causal relations and a set of pairwise human judgements to facilitate the development of future automatic metrics.
☆ Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models
Large language models (LLMs) have demonstrated impressive reasoning capabilities, particularly in textual mathematical problem-solving. However, existing open-source image instruction fine-tuning datasets, containing limited question-answer pairs per image, do not fully exploit visual information to enhance the multimodal mathematical reasoning capabilities of Multimodal LLMs (MLLMs). To bridge this gap, we address the lack of high-quality, diverse multimodal mathematical datasets by collecting 40K high-quality images with question-answer pairs from 24 existing datasets and synthesizing 320K new pairs, creating the MathV360K dataset, which enhances both the breadth and depth of multimodal mathematical questions. We introduce Math-LLaVA, a LLaVA-1.5-based model fine-tuned with MathV360K. This novel approach significantly improves the multimodal mathematical reasoning capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V on MathVista's minitest split. Furthermore, Math-LLaVA demonstrates enhanced generalizability, showing substantial improvements on the MMMU benchmark. Our research highlights the importance of dataset diversity and synthesis in advancing MLLMs' mathematical reasoning abilities. The code and data are available at: \url{https://github.com/HZQ950419/Math-LLaVA}.
comment: 8 pages
☆ Predicting the Big Five Personality Traits in Chinese Counselling Dialogues Using Large Language Models
Accurate assessment of personality traits is crucial for effective psycho-counseling, yet traditional methods like self-report questionnaires are time-consuming and biased. This study exams whether Large Language Models (LLMs) can predict the Big Five personality traits directly from counseling dialogues and introduces an innovative framework to perform the task. Our framework applies role-play and questionnaire-based prompting to condition LLMs on counseling sessions, simulating client responses to the Big Five Inventory. We evaluated our framework on 853 real-world counseling sessions, finding a significant correlation between LLM-predicted and actual Big Five traits, proving the validity of framework. Moreover, ablation studies highlight the importance of role-play simulations and task simplification via questionnaires in enhancing prediction accuracy. Meanwhile, our fine-tuned Llama3-8B model, utilizing Direct Preference Optimization with Supervised Fine-Tuning, achieves a 130.95\% improvement, surpassing the state-of-the-art Qwen1.5-110B by 36.94\% in personality prediction validity. In conclusion, LLMs can predict personality based on counseling dialogues. Our code and model are publicly available at \url{https://github.com/kuri-leo/BigFive-LLM-Predictor}, providing a valuable tool for future research in computational psychometrics.
☆ A Recursive Encoding for Cuneiform Signs
One of the most significant problems in cuneiform pedagogy is the process of looking up unknown signs, which often involves a tedious page-by-page search through a sign list. This paper proposes a new "recursive encoding" for signs, which represents the arrangement of strokes in a way a computer can process. A series of new algorithms then offers students a new way to look up signs by any distinctive component, as well as providing new ways to render signs and tablets electronically.
comment: 27 pages, 29 figures, 5 tables
BERT, Neural Information Retrieval, Boolean Retrieval, Negation Retrieval
We introduce SetBERT, a fine-tuned BERT-based model designed to enhance query embeddings for set operations and Boolean logic queries, such as Intersection (AND), Difference (NOT), and Union (OR). SetBERT significantly improves retrieval performance for logic-structured queries, an area where both traditional and neural retrieval methods typically underperform. We propose an innovative use of inversed-contrastive loss, focusing on identifying the negative sentence, and fine-tuning BERT with a dataset generated via prompt GPT. Furthermore, we demonstrate that, unlike other BERT-based models, fine-tuning with triplet loss actually degrades performance for this specific task. Our experiments reveal that SetBERT-base not only significantly outperforms BERT-base (up to a 63% improvement in Recall) but also achieves performance comparable to the much larger BERT-large model, despite being only one-third the size.
comment: 10 pages, 1 figure
☆ OPT-Tree: Speculative Decoding with Adaptive Draft Tree Structure
Autoregressive language models demonstrate excellent performance in various scenarios. However, the inference efficiency is limited by its one-step-one-word generation mode, which has become a pressing problem recently as the models become increasingly larger. Speculative decoding employs a "draft and then verify" mechanism to allow multiple tokens to be generated in one step, realizing lossless acceleration. Existing methods mainly adopt fixed heuristic draft structures, which fail to adapt to different situations to maximize the acceptance length during verification. To alleviate this dilemma, we proposed OPT-Tree, an algorithm to construct adaptive and scalable draft trees. It searches the optimal tree structure that maximizes the mathematical expectation of the acceptance length in each decoding step. Experimental results reveal that OPT-Tree outperforms the existing draft structures and achieves a speed-up ratio of up to 3.2 compared with autoregressive decoding. If the draft model is powerful enough and the node budget is sufficient, it can generate more than ten tokens in a single step. Our code is available at https://github.com/Jikai0Wang/OPT-Tree.
☆ Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?
Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.
comment: 63 pages, 41 figures, 11 tables
☆ DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
The current paradigm of evaluating Large Language Models (LLMs) through static benchmarks comes with significant limitations, such as vulnerability to data contamination and a lack of adaptability to the evolving capabilities of LLMs. Therefore, evaluation methods that can adapt and generate evaluation data with controlled complexity are urgently needed. In this work, we introduce Dynamic Evaluation of LLMs via Adaptive Reasoning Graph Evolvement (DARG) to dynamically extend current benchmarks with controlled complexity and diversity. Specifically, we first extract the reasoning graphs of data points in current benchmarks and then perturb the reasoning graphs to generate novel testing data. Such newly generated test samples can have different levels of complexity while maintaining linguistic diversity similar to the original benchmarks. We further use a code-augmented LLM to ensure the label correctness of newly generated data. We apply our DARG framework to diverse reasoning tasks in four domains with 15 state-of-the-art LLMs. Experimental results show that almost all LLMs experience a performance decrease with increased complexity and certain LLMs exhibit significant drops. Additionally, we find that LLMs exhibit more biases when being evaluated via the data generated by DARG with higher complexity levels. These observations provide useful insights into how to dynamically and adaptively evaluate LLMs. The code is available at https://github.com/SALT-NLP/DARG.
☆ AG-LSEC: Audio Grounded Lexical Speaker Error Correction INTERSPEECH 2024
Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
comment: Accepted at INTERSPEECH 2024
☆ D2LLM: Decomposed and Distilled Large Language Models for Semantic Search
The key challenge in semantic search is to create models that are both accurate and efficient in pinpointing relevant sentences for queries. While BERT-style bi-encoders excel in efficiency with pre-computed embeddings, they often miss subtle nuances in search tasks. Conversely, GPT-style LLMs with cross-encoder designs capture these nuances but are computationally intensive, hindering real-time applications. In this paper, we present D2LLMs-Decomposed and Distilled LLMs for semantic search-that combines the best of both worlds. We decompose a cross-encoder into an efficient bi-encoder integrated with Pooling by Multihead Attention and an Interaction Emulation Module, achieving nuanced understanding and pre-computability. Knowledge from the LLM is distilled into this model using contrastive, rank, and feature imitation techniques. Our experiments show that D2LLM surpasses five leading baselines in terms of all metrics across three tasks, particularly improving NLI task performance by at least 6.45%. The source code is available at https://github.com/codefuse-ai/D2LLM.
☆ TRAWL: Tensor Reduced and Approximated Weights for Large Language Models EMNLP 2024
Large language models (LLMs) have fundamentally transformed artificial intelligence, catalyzing recent advancements while imposing substantial environmental and computational burdens. We introduce TRAWL (Tensor Reduced and Approximated Weights for Large Language Models), a novel methodology for optimizing LLMs through tensor decomposition. TRAWL leverages diverse strategies to exploit matrices within transformer-based architectures, realizing notable performance enhancements without necessitating retraining. The most significant improvements were observed through a layer-by-layer intervention strategy, particularly when applied to fully connected weights of the final layers, yielding up to 16% enhancement in accuracy without the need for additional data or fine-tuning. These results underscore the importance of targeted and adaptive techniques in increasing the efficiency and effectiveness of large language model optimization, thereby promoting the development of more sustainable and accessible AI systems.
comment: 8 pages, 5 figures. Submitted to EMNLP 2024 and under review
☆ Mitigating Hallucination in Fictional Character Role-Play
Role-playing has wide-ranging applications in customer support, embodied agents, computational social science, etc. The influence of parametric world knowledge of large language models (LLMs) often causes role-playing characters to act out of character and hallucinate about things outside the scope of their knowledge. In this work, we focus on the evaluation and mitigation of hallucination in fictional character role-play. We introduce a dataset with more than 2,000 characters and 72,000 interviews, including 18,000 adversarial questions. We propose RoleFact, a role-playing method that mitigates hallucination by modulating the influence of parametric knowledge using a pre-calibrated confidence threshold. Experiments show that the proposed method improves the factual precision of generated responses by 18% for adversarial questions with a 44% reduction in temporal hallucination for time-sensitive interviews. The code and the dataset will be available at https://github.com/NafisSadeq/rolefact.git.
☆ Leveraging Parameter-Efficient Transfer Learning for Multi-Lingual Text-to-Speech Adaptation
Different languages have distinct phonetic systems and vary in their prosodic features making it challenging to develop a Text-to-Speech (TTS) model that can effectively synthesise speech in multilingual settings. Furthermore, TTS architecture needs to be both efficient enough to capture nuances in multiple languages and efficient enough to be practical for deployment. The standard approach is to build transformer based model such as SpeechT5 and train it on large multilingual dataset. As the size of these models grow the conventional fine-tuning for adapting these model becomes impractical due to heavy computational cost. In this paper, we proposes to integrate parameter-efficient transfer learning (PETL) methods such as adapters and hypernetwork with TTS architecture for multilingual speech synthesis. Notably, in our experiments PETL methods able to achieve comparable or even better performance compared to full fine-tuning with only $\sim$2.5\% tunable parameters.The code and samples are available at: https://anonymous.4open.science/r/multilingualTTS-BA4C.
☆ MPCODER: Multi-user Personalized Code Generator with Explicit and Implicit Style Representation Learning ACL 2024
Large Language Models (LLMs) have demonstrated great potential for assisting developers in their daily development. However, most research focuses on generating correct code, how to use LLMs to generate personalized code has seldom been investigated. To bridge this gap, we proposed MPCoder (Multi-user Personalized Code Generator) to generate personalized code for multiple users. To better learn coding style features, we utilize explicit coding style residual learning to capture the syntax code style standards and implicit style learning to capture the semantic code style conventions. We train a multi-user style adapter to better differentiate the implicit feature representations of different users through contrastive learning, ultimately enabling personalized code generation for multiple users. We further propose a novel evaluation metric for estimating similarities between codes of different coding styles. The experimental results show the effectiveness of our approach for this novel task.
comment: Accepted by ACL 2024, Main Conference
☆ How Well Can Knowledge Edit Methods Edit Perplexing Knowledge?
As large language models (LLMs) are widely deployed, targeted editing of their knowledge has become a critical challenge. Recently, advancements in model editing techniques, such as Rank-One Model Editing (ROME), have paved the way for updating LLMs with new knowledge. However, the efficacy of these methods varies across different types of knowledge. This study investigates the capability of knowledge editing methods to incorporate new knowledge with varying degrees of "perplexingness", a term we use to describe the initial difficulty LLMs have in understanding new concepts. We begin by quantifying the "perplexingness" of target knowledge using pre-edit conditional probabilities, and assess the efficacy of edits through post-edit conditional probabilities. Utilizing the widely-used CounterFact dataset, we find significant negative correlations between the "perplexingness" of the new knowledge and the edit efficacy across all 12 scenarios. To dive deeper into this phenomenon, we introduce a novel dataset, HierarchyData, consisting of 99 hyponym-hypernym pairs across diverse categories. Our analysis reveal that more abstract concepts (hypernyms) tend to be more perplexing than their specific counterparts (hyponyms). Further exploration into the influence of knowledge hierarchy on editing outcomes indicates that knowledge positioned at higher hierarchical levels is more challenging to modify in some scenarios. Our research highlights a previously overlooked aspect of LLM editing: the variable efficacy of editing methods in handling perplexing knowledge. By revealing how hierarchical relationships can influence editing outcomes, our findings offer new insights into the challenges of updating LLMs and pave the way for more nuanced approaches to model editing in the future.
☆ Unlocking Continual Learning Abilities in Language Models
Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task information are often unavailable or costly to collect, hindering the availability of current CL approaches for LMs. To address this limitation, we introduce $\textbf{MIGU}$ ($\textbf{M}$agn$\textbf{I}$tude-based $\textbf{G}$radient $\textbf{U}$pdating for continual learning), a rehearsal-free and task-label-free method that only updates the model parameters with large magnitudes of output in LMs' linear layers. MIGU is based on our observation that the L1-normalized magnitude distribution of the output in LMs' linear layers is different when the LM models deal with different task data. By imposing this simple constraint on the gradient update process, we can leverage the inherent behaviors of LMs, thereby unlocking their innate CL abilities. Our experiments demonstrate that MIGU is universally applicable to all three LM architectures (T5, RoBERTa, and Llama2), delivering state-of-the-art or on-par performance across continual finetuning and continual pre-training settings on four CL benchmarks. For example, MIGU brings a 15.2% average accuracy improvement over conventional parameter-efficient finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly integrate with all three existing CL types to further enhance performance. Code is available at \href{https://github.com/wenyudu/MIGU}{this https URL}.
comment: preprint, 19 pages
☆ What Do the Circuits Mean? A Knowledge Edit View
In the field of language model interpretability, circuit discovery is gaining popularity. Despite this, the true meaning of these circuits remain largely unanswered. We introduce a novel method to learn their meanings as a holistic object through the lens of knowledge editing. We extract circuits in the GPT2-XL model using diverse text classification datasets, and use hierarchical relations datasets to explore knowledge editing in the circuits. Our findings indicate that these circuits contain entity knowledge but resist new knowledge more than complementary circuits during knowledge editing. Additionally, we examine the impact of circuit size, discovering that an ideal "theoretical circuit" where essential knowledge is concentrated likely incorporates more than 5% but less than 50% of the model's parameters. We also assess the overlap between circuits from different datasets, finding moderate similarities. What constitutes these circuits, then? We find that up to 60% of the circuits consist of layer normalization modules rather than attention or MLP modules, adding evidence to the ongoing debates regarding knowledge localization. In summary, our findings offer new insights into the functions of the circuits, and introduce research directions for further interpretability and safety research of language models.
☆ Self-Constructed Context Decompilation with Fined-grained Alignment Enhancement
Decompilation transforms compiled code back into a high-level programming language for analysis when source code is unavailable. Previous work has primarily focused on enhancing decompilation performance by increasing the scale of model parameters or training data for pre-training. Based on the characteristics of the decompilation task, we propose two methods: (1) Without fine-tuning, the Self-Constructed Context Decompilation (sc$^2$dec) method recompiles the LLM's decompilation results to construct pairs for in-context learning, helping the model improve decompilation performance. (2) Fine-grained Alignment Enhancement (FAE), which meticulously aligns assembly code with source code at the statement level by leveraging debugging information, is employed during the fine-tuning phase to achieve further improvements in decompilation. By integrating these two methods, we achieved a Re-Executability performance improvement of approximately 7.35\% on the Decompile-Eval benchmark, establishing a new state-of-the-art performance of 55.03\%.
comment: Under Review
☆ Beyond Demographics: Aligning Role-playing LLM-based Agents Using Human Belief Networks
Creating human-like large language model (LLM) agents is crucial for faithful social simulation. Having LLMs role-play based on demographic information sometimes improves human likeness but often does not. This study assessed whether LLM alignment with human behavior can be improved by integrating information from empirically-derived human belief networks. Using data from a human survey, we estimated a belief network encompassing 18 topics loading on two non-overlapping latent factors. We then seeded LLM-based agents with an opinion on one topic, and assessed the alignment of its expressed opinions on remaining test topics with corresponding human data. Role-playing based on demographic information alone did not align LLM and human opinions, but seeding the agent with a single belief greatly improved alignment for topics related in the belief network, and not for topics outside the network. These results suggest a novel path for human-LLM belief alignment in work seeking to simulate and understand patterns of belief distributions in society.
☆ CogMG: Collaborative Augmentation Between Large Language Model and Knowledge Graph
Large language models have become integral to question-answering applications despite their propensity for generating hallucinations and factually inaccurate content. Querying knowledge graphs to reduce hallucinations in LLM meets the challenge of incomplete knowledge coverage in knowledge graphs. On the other hand, updating knowledge graphs by information extraction and knowledge graph completion faces the knowledge update misalignment issue. In this work, we introduce a collaborative augmentation framework, CogMG, leveraging knowledge graphs to address the limitations of LLMs in QA scenarios, explicitly targeting the problems of incomplete knowledge coverage and knowledge update misalignment. The LLMs identify and decompose required knowledge triples that are not present in the KG, enriching them and aligning updates with real-world demands. We demonstrate the efficacy of this approach through a supervised fine-tuned LLM within an agent framework, showing significant improvements in reducing hallucinations and enhancing factual accuracy in QA responses. Our code and video are publicly available.
☆ Large Language Models are Interpretable Learners
The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack expressiveness, whereas neural networks excel in performance but are known for being black boxes. In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge this gap. In the proposed LLM-based Symbolic Programs (LSPs), the pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. Symbolic programs then integrate these modules into an interpretable decision rule. To train LSPs, we develop a divide-and-conquer approach to incrementally build the program from scratch, where the learning process of each step is guided by LLMs. To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP's superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. Moreover, as the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable), and other LLMs, and generalizes well to out-of-distribution samples.
comment: Preliminary Version, Code at [this url](https://github.com/ruocwang/llm-symbolic-program)
☆ Detecting Frames in News Headlines and Lead Images in U.S. Gun Violence Coverage EMNLP 2021
News media structure their reporting of events or issues using certain perspectives. When describing an incident involving gun violence, for example, some journalists may focus on mental health or gun regulation, while others may emphasize the discussion of gun rights. Such perspectives are called \say{frames} in communication research. We study, for the first time, the value of combining lead images and their contextual information with text to identify the frame of a given news article. We observe that using multiple modes of information(article- and image-derived features) improves prediction of news frames over any single mode of information when the images are relevant to the frames of the headlines. We also observe that frame image relevance is related to the ease of conveying frames via images, which we call frame concreteness. Additionally, we release the first multimodal news framing dataset related to gun violence in the U.S., curated and annotated by communication researchers. The dataset will allow researchers to further examine the use of multiple information modalities for studying media framing.
comment: published at Findings of the Association for Computational Linguistics: EMNLP 2021
☆ EDEN: Empathetic Dialogues for English learning
Dialogue systems have been used as conversation partners in English learning, but few have studied whether these systems improve learning outcomes. Student passion and perseverance, or grit, has been associated with language learning success. Recent work establishes that as students perceive their English teachers to be more supportive, their grit improves. Hypothesizing that the same pattern applies to English-teaching chatbots, we create EDEN, a robust open-domain chatbot for spoken conversation practice that provides empathetic feedback. To construct EDEN, we first train a specialized spoken utterance grammar correction model and a high-quality social chit-chat conversation model. We then conduct a preliminary user study with a variety of strategies for empathetic feedback. Our experiment suggests that using adaptive empathetic feedback leads to higher perceived affective support, which, in turn, predicts increased student grit.
☆ Inherent Challenges of Post-Hoc Membership Inference for Large Language Models
Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.
☆ Evaluating Fairness in Large Vision-Language Models Across Diverse Demographic Attributes and Prompts
Large vision-language models (LVLMs) have recently achieved significant progress, demonstrating strong capabilities in open-world visual understanding. However, it is not yet clear how LVLMs address demographic biases in real life, especially the disparities across attributes such as gender, skin tone, and age. In this paper, we empirically investigate \emph{visual fairness} in several mainstream LVLMs and audit their performance disparities across sensitive demographic attributes, based on public fairness benchmark datasets (e.g., FACET). To disclose the visual bias in LVLMs, we design a fairness evaluation framework with direct questions and single-choice question-instructed prompts on visual question-answering/classification tasks. The zero-shot prompting results indicate that, despite enhancements in visual understanding, both open-source and closed-source LVLMs exhibit prevalent fairness issues across different instruct prompts and demographic attributes.
☆ LABOR-LLM: Language-Based Occupational Representations with Large Language Models
Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based "foundation model", CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.
☆ Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.
☆ Unmasking the Imposters: In-Domain Detection of Human vs. Machine-Generated Tweets
The rapid development of large language models (LLMs) has significantly improved the generation of fluent and convincing text, raising concerns about their misuse on social media platforms. We present a methodology using Twitter datasets to examine the generative capabilities of four LLMs: Llama 3, Mistral, Qwen2, and GPT4o. We evaluate 7B and 8B parameter base-instruction models of the three open-source LLMs and validate the impact of further fine-tuning and "uncensored" versions. Our findings show that "uncensored" models with additional in-domain fine-tuning dramatically reduce the effectiveness of automated detection methods. This study addresses a gap by exploring smaller open-source models and the effects of "uncensoring," providing insights into how fine-tuning and content moderation influence machine-generated text detection.
☆ SimsChat: A Customisable Persona-Driven Role-Playing Agent
Large Language Models (LLMs) possess the remarkable capability to understand human instructions and generate high-quality text, enabling them to act as agents that simulate human behaviours. This capability allows LLMs to emulate human beings in a more advanced manner, beyond merely replicating simple human behaviours. However, there is a lack of exploring into leveraging LLMs to craft characters from several aspects. In this work, we introduce the Customisable Conversation Agent Framework, which employs LLMs to simulate real-world characters that can be freely customised according to different user preferences. The customisable framework is helpful for designing customisable characters and role-playing agents according to human's preferences. We first propose the SimsConv dataset, which comprises 68 different customised characters, 1,360 multi-turn role-playing dialogues, and encompasses 13,971 interaction dialogues in total. The characters are created from several real-world elements, such as career, aspiration, trait, and skill. Building on these foundations, we present SimsChat, a freely customisable role-playing agent. It incorporates different real-world scenes and topic-specific character interaction dialogues, simulating characters' life experiences in various scenarios and topic-specific interactions with specific emotions. Experimental results show that our proposed framework achieves desirable performance and provides helpful guideline for building better simulacra of human beings in the future. Our data and code are available at https://github.com/Bernard-Yang/SimsChat.
☆ NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.
comment: Work in Progress
☆ Do they mean 'us'? Interpreting Referring Expressions in Intergroup Bias
The variations between in-group and out-group speech (intergroup bias) are subtle and could underlie many social phenomena like stereotype perpetuation and implicit bias. In this paper, we model the intergroup bias as a tagging task on English sports comments from forums dedicated to fandom for NFL teams. We curate a unique dataset of over 6 million game-time comments from opposing perspectives (the teams in the game), each comment grounded in a non-linguistic description of the events that precipitated these comments (live win probabilities for each team). Expert and crowd annotations justify modeling the bias through tagging of implicit and explicit referring expressions and reveal the rich, contextual understanding of language and the world required for this task. For large-scale analysis of intergroup variation, we use LLMs for automated tagging, and discover that some LLMs perform best when prompted with linguistic descriptions of the win probability at the time of the comment, rather than numerical probability. Further, large-scale tagging of comments using LLMs uncovers linear variations in the form of referent across win probabilities that distinguish in-group and out-group utterances. Code and data are available at https://github.com/venkatasg/intergroup-nfl .
☆ Sequential Editing for Lifelong Training of Speech Recognition Models INTERSPEECH 2024
Automatic Speech Recognition (ASR) traditionally assumes known domains, but adding data from a new domain raises concerns about computational inefficiencies linked to retraining models on both existing and new domains. Fine-tuning solely on new domain risks Catastrophic Forgetting (CF). To address this, Lifelong Learning (LLL) algorithms have been proposed for ASR. Prior research has explored techniques such as Elastic Weight Consolidation, Knowledge Distillation, and Replay, all of which necessitate either additional parameters or access to prior domain data. We propose Sequential Model Editing as a novel method to continually learn new domains in ASR systems. Different than previous methods, our approach does not necessitate access to prior datasets or the introduction of extra parameters. Our study demonstrates up to 15% Word Error Rate Reduction (WERR) over fine-tuning baseline, and superior efficiency over other LLL techniques on CommonVoice English multi-accent dataset.
comment: INTERSPEECH 2024
☆ FASA: a Flexible and Automatic Speech Aligner for Extracting High-quality Aligned Children Speech Data
Automatic Speech Recognition (ASR) for adults' speeches has made significant progress by employing deep neural network (DNN) models recently, but improvement in children's speech is still unsatisfactory due to children's speech's distinct characteristics. DNN models pre-trained on adult data often struggle in generalizing children's speeches with fine tuning because of the lack of high-quality aligned children's speeches. When generating datasets, human annotations are not scalable, and existing forced-alignment tools are not usable as they make impractical assumptions about the quality of the input transcriptions. To address these challenges, we propose a new forced-alignment tool, FASA, as a flexible and automatic speech aligner to extract high-quality aligned children's speech data from many of the existing noisy children's speech data. We demonstrate its usage on the CHILDES dataset and show that FASA can improve data quality by 13.6$\times$ over human annotations.
comment: 4 pages, 1 figure
☆ PAFT: A Parallel Training Paradigm for Effective LLM Fine-Tuning
Large language models (LLMs) have shown remarkable abilities in diverse natural language processing (NLP) tasks. The LLMs generally undergo supervised fine-tuning (SFT) followed by preference alignment to be usable in downstream applications. However, this sequential training pipeline leads to alignment tax that degrades the LLM performance. This paper introduces PAFT, a new PArallel training paradigm for effective LLM Fine-Tuning, which independently performs SFT and preference alignment (e.g., DPO and ORPO, etc.) with the same pre-trained model on respective datasets. The model produced by SFT and the model from preference alignment are then merged into a final model by parameter fusing for use in downstream applications. This work reveals important findings that preference alignment like DPO naturally results in a sparse model while SFT leads to a natural dense model which needs to be sparsified for effective model merging. This paper introduces an effective interference resolution which reduces the redundancy by sparsifying the delta parameters. The LLM resulted from the new training paradigm achieved Rank #1 on the HuggingFace Open LLM Leaderboard. Comprehensive evaluation shows the effectiveness of the parallel training paradigm.
☆ X-ray Made Simple: Radiology Report Generation and Evaluation with Layman's Terms
Radiology Report Generation (RRG) has achieved significant progress with the advancements of multimodal generative models. However, the evaluation in the domain suffers from a lack of fair and robust metrics. We reveal that, high performance on RRG with existing lexical-based metrics (e.g. BLEU) might be more of a mirage - a model can get a high BLEU only by learning the template of reports. This has become an urgent problem for RRG due to the highly patternized nature of these reports. In this work, we un-intuitively approach this problem by proposing the Layman's RRG framework, a layman's terms-based dataset, evaluation and training framework that systematically improves RRG with day-to-day language. We first contribute the translated Layman's terms dataset. Building upon the dataset, we then propose a semantics-based evaluation method, which is proved to mitigate the inflated numbers of BLEU and provides fairer evaluation. Last, we show that training on the layman's terms dataset encourages models to focus on the semantics of the reports, as opposed to overfitting to learning the report templates. We reveal a promising scaling law between the number of training examples and semantics gain provided by our dataset, compared to the inverse pattern brought by the original formats. Our code is available at \url{https://github.com/hegehongcha/LaymanRRG}.
☆ Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata
In this paper, we describe the extraction of all the location entries from a prominent Swedish encyclopedia from the early 20th century, the \textit{Nordisk Familjebok} `Nordic Family Book.' We focused on the second edition called \textit{Uggleupplagan}, which comprises 38 volumes and over 182,000 articles. This makes it one of the most extensive Swedish encyclopedias. Using a classifier, we first determined the category of the entries. We found that approximately 22 percent of them were locations. We applied a named entity recognition to these entries and we linked them to Wikidata. Wikidata enabled us to extract their precise geographic locations resulting in almost 18,000 valid coordinates. We then analyzed the distribution of these locations and the entry selection process. It showed a higher density within Sweden, Germany, and the United Kingdom. The paper sheds light on the selection and representation of geographic information in the \textit{Nordisk Familjebok}, providing insights into historical and societal perspectives. It also paves the way for future investigations into entry selection in different time periods and comparative analyses among various encyclopedias.
comment: 9 pages, 3 figures
☆ Script-Agnostic Language Identification ACL
Language identification is used as the first step in many data collection and crawling efforts because it allows us to sort online text into language-specific buckets. However, many modern languages, such as Konkani, Kashmiri, Punjabi etc., are synchronically written in several scripts. Moreover, languages with different writing systems do not share significant lexical, semantic, and syntactic properties in neural representation spaces, which is a disadvantage for closely related languages and low-resource languages, especially those from the Indian Subcontinent. To counter this, we propose learning script-agnostic representations using several different experimental strategies (upscaling, flattening, and script mixing) focusing on four major Dravidian languages (Tamil, Telugu, Kannada, and Malayalam). We find that word-level script randomization and exposure to a language written in multiple scripts is extremely valuable for downstream script-agnostic language identification, while also maintaining competitive performance on naturally occurring text.
comment: Under Review in ACL Rolling Review
☆ CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. "ListMatch-LM" and "ListMatch-BERT" use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.
☆ ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
☆ Cloaked Classifiers: Pseudonymization Strategies on Sensitive Classification Tasks
Protecting privacy is essential when sharing data, particularly in the case of an online radicalization dataset that may contain personal information. In this paper, we explore the balance between preserving data usefulness and ensuring robust privacy safeguards, since regulations like the European GDPR shape how personal information must be handled. We share our method for manually pseudonymizing a multilingual radicalization dataset, ensuring performance comparable to the original data. Furthermore, we highlight the importance of establishing comprehensive guidelines for processing sensitive NLP data by sharing our complete pseudonymization process, our guidelines, the challenges we encountered as well as the resulting dataset.
comment: Proceedings of the fifth Workshop on Privacy in Natural Language Processing
♻ ☆ Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
♻ ☆ Large Language Models in Healthcare: A Comprehensive Benchmark
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
♻ ☆ A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
♻ ☆ Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.
♻ ☆ An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek
Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as "kosmos" (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.
♻ ☆ Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
comment: 8 page preprint + appendix. Minor fixes and appendix changes June 25, 2024
♻ ☆ Towards Zero-Shot Text-To-Speech for Arabic Dialects
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
♻ ☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
♻ ☆ Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.
♻ ☆ MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
comment: Github link: https://github.com/ncbi-nlp/MedCalc-Bench HuggingFace link: https://huggingface.co/datasets/nsk7153/MedCalc-Bench
♻ ☆ Aligning Large Language Models by On-Policy Self-Judgment ACL 2024
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.
comment: Published as a main conference paper at ACL 2024
♻ ☆ MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues ACL 2024
The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.
comment: [ACL 2024] The first three authors contribute equally, 34 pages, repo at https://github.com/mtbench101/mt-bench-101
♻ ☆ Representation Surgery: Theory and Practice of Affine Steering ICML 2024
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
comment: Accepted in ICML 2024
♻ ☆ Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.
comment: Repo: https://github.com/wjfu99/MIA-LLMs
♻ ☆ AudioBench: A Universal Benchmark for Audio Large Language Models
We introduce AudioBench, a new benchmark designed to evaluate audio large language models (AudioLLMs). AudioBench encompasses 8 distinct tasks and 26 carefully selected or newly curated datasets, focusing on speech understanding, voice interpretation, and audio scene understanding. Despite the rapid advancement of large language models, including multimodal versions, a significant gap exists in comprehensive benchmarks for thoroughly evaluating their capabilities. AudioBench addresses this gap by providing relevant datasets and evaluation metrics. In our study, we evaluated the capabilities of four models across various aspects and found that no single model excels consistently across all tasks. We outline the research outlook for AudioLLMs and anticipate that our open-source code, data, and leaderboard will offer a robust testbed for future model developments.
comment: 20 pages; v2 - typo update; Code: https://github.com/AudioLLMs/AudioBench
♻ ☆ High-Dimension Human Value Representation in Large Language Models
The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, ranging from Reinforcement Learning with Human Feedback (RLHF), to constitutional learning, etc. there is an urgent need to understand the scope and nature of human values injected into these models before their release. There is also a need for model alignment without a costly large scale human annotation effort. We propose UniVaR, a high-dimensional representation of human value distributions in LLMs, orthogonal to model architecture and training data. Trained from the value-relevant output of eight multilingual LLMs and tested on the output from four multilingual LLMs, namely LlaMA2, ChatGPT, JAIS and Yi, we show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources. Through UniVaR, we explore how different LLMs prioritize various values in different languages and cultures, shedding light on the complex interplay between human values and language modeling.
♻ ☆ Embedding Ontologies via Incorporating Extensional and Intensional Knowledge
Ontologies contain rich knowledge within domain, which can be divided into two categories, namely extensional knowledge and intensional knowledge. Extensional knowledge provides information about the concrete instances that belong to specific concepts in the ontology, while intensional knowledge details inherent properties, characteristics, and semantic associations among concepts. However, existing ontology embedding approaches fail to take both extensional knowledge and intensional knowledge into fine consideration simultaneously. In this paper, we propose a novel ontology embedding approach named EIKE (Extensional and Intensional Knowledge Embedding) by representing ontologies in two spaces, called extensional space and intensional space. EIKE presents a unified framework for embedding instances, concepts and their relations in an ontology, applying a geometry-based method to model extensional knowledge and a pretrained language model to model intensional knowledge, which can capture both structure information and textual information. Experimental results show that EIKE significantly outperforms state-of-the-art methods in three datasets for both triple classification and link prediction, indicating that EIKE provides a more comprehensive and representative perspective of the domain.
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at https://github.com/SamuelCahyawijaya/in-context-alignment
♻ ☆ Embodied Question Answering via Multi-LLM Systems
Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.
comment: 17 pages, 13 Figures, 4 Tables
♻ ☆ WRDScore: New Metric for Evaluation of Natural Language Generation Models
The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.
♻ ☆ DE-COP: Detecting Copyrighted Content in Language Models Training Data
How can we detect if copyrighted content was used in the training process of a language model, considering that the training data is typically undisclosed? We are motivated by the premise that a language model is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approach is to probe an LLM with multiple-choice questions, whose options include both verbatim text and their paraphrases. We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff, along with their paraphrases. Our experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance (AUC) on models with logits available. Moreover, DE-COP also achieves an average accuracy of 72% for detecting suspect books on fully black-box models where prior methods give approximately 4% accuracy. The code and datasets are available at https://github.com/LeiLiLab/DE-COP.
♻ ☆ S$^3$HQA: A Three-Stage Approach for Multi-hop Text-Table Hybrid Question Answering ACL 2023
Answering multi-hop questions over hybrid factual knowledge from the given text and table (TextTableQA) is a challenging task. Existing models mainly adopt a retriever-reader framework, which have several deficiencies, such as noisy labeling in training retriever, insufficient utilization of heterogeneous information over text and table, and deficient ability for different reasoning operations. In this paper, we propose a three-stage TextTableQA framework S3HQA, which comprises of retriever, selector, and reasoner. We use a retriever with refinement training to solve the noisy labeling problem. Then, a hybrid selector considers the linked relationships between heterogeneous data to select the most relevant factual knowledge. For the final stage, instead of adapting a reading comprehension module like in previous methods, we employ a generation-based reasoner to obtain answers. This includes two approaches: a row-wise generator and an LLM prompting generator~(first time used in this task). The experimental results demonstrate that our method achieves competitive results in the few-shot setting. When trained on the full dataset, our approach outperforms all baseline methods, ranking first on the HybridQA leaderboard.
comment: ACL 2023
♻ ☆ Telecom Language Models: Must They Be Large?
The increasing interest in Large Language Models (LLMs) within the telecommunications sector underscores their potential to revolutionize operational efficiency. However, the deployment of these sophisticated models is often hampered by their substantial size and computational demands, raising concerns about their viability in resource-constrained environments. Addressing this challenge, recent advancements have seen the emergence of small language models that surprisingly exhibit performance comparable to their larger counterparts in many tasks, such as coding and common-sense reasoning. Phi-2, a compact yet powerful model, exemplifies this new wave of efficient small language models. This paper conducts a comprehensive evaluation of Phi-2's intrinsic understanding of the telecommunications domain. Recognizing the scale-related limitations, we enhance Phi-2's capabilities through a Retrieval-Augmented Generation approach, meticulously integrating an extensive knowledge base specifically curated with telecom standard specifications. The enhanced Phi-2 model demonstrates a profound improvement in accuracy, answering questions about telecom standards with a precision that closely rivals the more resource-intensive GPT-3.5. The paper further explores the refined capabilities of Phi-2 in addressing problem-solving scenarios within the telecom sector, highlighting its potential and limitations.
♻ ☆ R4: Reinforced Retriever-Reorder-Responder for Retrieval-Augmented Large Language Models
Retrieval-augmented large language models (LLMs) leverage relevant content retrieved by information retrieval systems to generate correct responses, aiming to alleviate the hallucination problem. However, existing retriever-responder methods typically append relevant documents to the prompt of LLMs to perform text generation tasks without considering the interaction of fine-grained structural semantics between the retrieved documents and the LLMs. This issue is particularly important for accurate response generation as LLMs tend to "lose in the middle" when dealing with input prompts augmented with lengthy documents. In this work, we propose a new pipeline named "Reinforced Retriever-Reorder-Responder" (R$^4$) to learn document orderings for retrieval-augmented LLMs, thereby further enhancing their generation abilities while the large numbers of parameters of LLMs remain frozen. The reordering learning process is divided into two steps according to the quality of the generated responses: document order adjustment and document representation enhancement. Specifically, document order adjustment aims to organize retrieved document orderings into beginning, middle, and end positions based on graph attention learning, which maximizes the reinforced reward of response quality. Document representation enhancement further refines the representations of retrieved documents for responses of poor quality via document-level gradient adversarial learning. Extensive experiments demonstrate that our proposed pipeline achieves better factual question-answering performance on knowledge-intensive tasks compared to strong baselines across various public datasets. The source codes and trained models will be released upon paper acceptance.
comment: need to further experiment
♻ ☆ Direct Multi-Turn Preference Optimization for Language Agents
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
♻ ☆ Comprehensive Evaluation of Large Language Models for Topic Modeling
Recent work utilizes Large Language Models (LLMs) for topic modeling, generating comprehensible topic labels for given documents. However, their performance has mainly been evaluated qualitatively, and there remains room for quantitative investigation of their capabilities. In this paper, we quantitatively evaluate LLMs from multiple perspectives: the quality of topics, the impact of LLM-specific concerns, such as hallucination and shortcuts for limited documents, and LLMs' controllability of topic categories via prompts. Our findings show that LLMs can identify coherent and diverse topics with few hallucinations but may take shortcuts by focusing only on parts of documents. We also found that their controllability is limited.
♻ ☆ Harnessing Large Language Models as Post-hoc Correctors
As Machine Learning (ML) models grow in size and demand higher-quality training data, the expenses associated with re-training and fine-tuning these models are escalating rapidly. Inspired by recent impressive achievements of Large Language Models (LLMs) in different fields, this paper delves into the question: can LLMs efficiently improve an ML's performance at a minimal cost? We show that, through our proposed training-free framework LlmCorr, an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary ML model. In particular, we form a contextual knowledge database by incorporating the dataset's label information and the ML model's predictions on the validation dataset. Leveraging the in-context learning capability of LLMs, we ask the LLM to summarise the instances in which the ML model makes mistakes and the correlation between primary predictions and true labels. Following this, the LLM can transfer its acquired knowledge to suggest corrections for the ML model's predictions. Our experimental results on text analysis and the challenging molecular predictions show that \model improves the performance of a number of models by up to 39%.
♻ ☆ PatentEval: Understanding Errors in Patent Generation
In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
♻ ☆ Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding
Automated audio captioning (AAC) is an audio-to-text task to describe audio contents in natural language. Recently, the advancements in large language models (LLMs), with improvements in training approaches for audio encoders, have opened up possibilities for improving AAC. Thus, we explore enhancing AAC from three aspects: 1) a pre-trained audio encoder via consistent ensemble distillation (CED) is used to improve the effectivity of acoustic tokens, with a querying transformer (Q-Former) bridging the modality gap to LLM and compress acoustic tokens; 2) we investigate the advantages of using a Llama 2 with 7B parameters as the decoder; 3) another pre-trained LLM corrects text errors caused by insufficient training data and annotation ambiguities. Both the audio encoder and text decoder are optimized by low-rank adaptation (LoRA). Experiments show that each of these enhancements is effective. Our method obtains a 33.0 SPIDEr-FL score, outperforming the winner of DCASE 2023 Task 6A.
comment: Accepted by Interspeech 2024
♻ ☆ SegHist: A General Segmentation-based Framework for Chinese Historical Document Text Line Detection ICDAR2024
Text line detection is a key task in historical document analysis facing many challenges of arbitrary-shaped text lines, dense texts, and text lines with high aspect ratios, etc. In this paper, we propose a general framework for historical document text detection (SegHist), enabling existing segmentation-based text detection methods to effectively address the challenges, especially text lines with high aspect ratios. Integrating the SegHist framework with the commonly used method DB++, we develop DB-SegHist. This approach achieves SOTA on the CHDAC, MTHv2, and competitive results on HDRC datasets, with a significant improvement of 1.19% on the most challenging CHDAC dataset which features more text lines with high aspect ratios. Moreover, our method attains SOTA on rotated MTHv2 and rotated HDRC, demonstrating its rotational robustness. The code is available at https://github.com/LumionHXJ/SegHist.
comment: Accepted by ICDAR2024
♻ ☆ TemPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems
Temporal relation extraction (TRE) aims to grasp the evolution of events or actions, and thus shape the workflow of associated tasks, so it holds promise in helping understand task requests initiated by requesters in crowdsourcing systems. However, existing methods still struggle with limited and unevenly distributed annotated data. Therefore, inspired by the abundant global knowledge stored within pre-trained language models (PLMs), we propose a multi-task prompt learning framework for TRE (TemPrompt), incorporating prompt tuning and contrastive learning to tackle these issues. To elicit more effective prompts for PLMs, we introduce a task-oriented prompt construction approach that thoroughly takes the myriad factors of TRE into consideration for automatic prompt generation. In addition, we present temporal event reasoning as a supplement to bolster the model's focus on events and temporal cues. The experimental results demonstrate that TemPrompt outperforms all compared baselines across the majority of metrics under both standard and few-shot settings. A case study is provided to validate its effectiveness in crowdsourcing scenarios.
comment: 12 pages, 9 figures
♻ ☆ A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models ICML 2024
Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models. A challenge in the domain lies in preserving the distribution of original generated content after watermarking. Our research extends and improves upon existing watermarking framework, placing emphasis on the importance of a \textbf{Di}stribution-\textbf{P}reserving (DiP) watermark. Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking (distribution-preserving), is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens (resilient). DiPmark operates by selecting a random set of tokens prior to the generation of a word, then modifying the token distribution through a distribution-preserving reweight function to enhance the probability of these selected tokens during the sampling process. Extensive empirical evaluation on various language models and tasks demonstrates our approach's distribution-preserving property, accessibility, and resilience, making it a effective solution for watermarking tasks that demand impeccable quality preservation.
comment: ICML 2024
♻ ☆ Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models
We propose Knowledge Crosswords, a geometric knowledge reasoning benchmark consisting of incomplete knowledge networks bounded by structured factual constraints, where LLMs are tasked with inferring the missing facts to meet all constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA, such as backtracking, verifying facts and constraints, reasoning with uncertainty, and more. Knowledge Crosswords contains 2,101 individual problems, covering diverse knowledge domains, and is further divided into three difficulty levels. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords. Results demonstrate that baseline approaches struggle with larger knowledge networks and semantically-equivalent entity distractors. In light of their limitations, we propose two new approaches, Staged Prompting and Verify-All, to augment LLMs' abilities for error-aware backtracking and constraint verification. Our Verify-All significantly outperforms prior methods and is more robust towards problems in the hard subset. Further analysis shows that geometric knowledge reasoning poses new challenges to LLMs' knowledge abilities, particularly in robustness towards varying option orders, complex structural constraints in knowledge networks, "none of the above" scenarios, and more.
♻ ☆ Universal Prompt Optimizer for Safe Text-to-Image Generation NAACL 2024
Text-to-Image (T2I) models have shown great performance in generating images based on textual prompts. However, these models are vulnerable to unsafe input to generate unsafe content like sexual, harassment and illegal-activity images. Existing studies based on image checker, model fine-tuning and embedding blocking are impractical in real-world applications. Hence, we propose the first universal prompt optimizer for safe T2I (POSI) generation in black-box scenario. We first construct a dataset consisting of toxic-clean prompt pairs by GPT-3.5 Turbo. To guide the optimizer to have the ability of converting toxic prompt to clean prompt while preserving semantic information, we design a novel reward function measuring toxicity and text alignment of generated images and train the optimizer through Proximal Policy Optimization. Experiments show that our approach can effectively reduce the likelihood of various T2I models in generating inappropriate images, with no significant impact on text alignment. It is also flexible to be combined with methods to achieve better performance. Our code is available at https://github.com/wzongyu/POSI.
comment: NAACL 2024
♻ ☆ NExT-GPT: Any-to-Any Multimodal LLM ICML 2024
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
comment: ICML 2024 (Oral)
♻ ☆ LLMs Are Zero-Shot Context-Aware Simultaneous Translators
The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks, zero-shot. We also demonstrate that injection of minimal background information, which is easy with an LLM, brings further performance gains, especially on challenging technical subject-matter. This highlights LLMs' potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning.
♻ ☆ Modeling Emotions and Ethics with Large Language Models
This paper explores the integration of human-like emotions and ethical considerations into Large Language Models (LLMs). We first model eight fundamental human emotions, presented as opposing pairs, and employ collaborative LLMs to reinterpret and express these emotions across a spectrum of intensity. Our focus extends to embedding a latent ethical dimension within LLMs, guided by a novel self-supervised learning algorithm with human feedback (SSHF). This approach enables LLMs to perform self-evaluations and adjustments concerning ethical guidelines, enhancing their capability to generate content that is not only emotionally resonant but also ethically aligned. The methodologies and case studies presented herein illustrate the potential of LLMs to transcend mere text and image generation, venturing into the realms of empathetic interaction and principled decision-making, thereby setting a new precedent in the development of emotionally aware and ethically conscious AI systems.
comment: 8 pages, 4 figures, 3 tables
♻ ☆ When Large Language Models Meet Optical Networks: Paving the Way for Automation
Since the advent of GPT, large language models (LLMs) have brought about revolutionary advancements in all walks of life. As a superior natural language processing (NLP) technology, LLMs have consistently achieved state-of-the-art performance on numerous areas. However, LLMs are considered to be general-purpose models for NLP tasks, which may encounter challenges when applied to complex tasks in specialized fields such as optical networks. In this study, we propose a framework of LLM-empowered optical networks, facilitating intelligent control of the physical layer and efficient interaction with the application layer through an LLM-driven agent (AI-Agent) deployed in the control layer. The AI-Agent can leverage external tools and extract domain knowledge from a comprehensive resource library specifically established for optical networks. This is achieved through user input and well-crafted prompts, enabling the generation of control instructions and result representations for autonomous operation and maintenance in optical networks. To improve LLM's capability in professional fields and stimulate its potential on complex tasks, the details of performing prompt engineering, establishing domain knowledge library, and implementing complex tasks are illustrated in this study. Moreover, the proposed framework is verified on two typical tasks: network alarm analysis and network performance optimization. The good response accuracies and sematic similarities of 2,400 test situations exhibit the great potential of LLM in optical networks.
♻ ☆ GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.
♻ ☆ Annotating FrameNet via Structure-Conditioned Language Generation ACL 2024
Despite the remarkable generative capabilities of language models in producing naturalistic language, their effectiveness on explicit manipulation and generation of linguistic structures remain understudied. In this paper, we investigate the task of generating new sentences preserving a given semantic structure, following the FrameNet formalism. We propose a framework to produce novel frame-semantically annotated sentences following an overgenerate-and-filter approach. Our results show that conditioning on rich, explicit semantic information tends to produce generations with high human acceptance, under both prompting and finetuning. Our generated frame-semantic structured annotations are effective at training data augmentation for frame-semantic role labeling in low-resource settings; however, we do not see benefits under higher resource settings. Our study concludes that while generating high-quality, semantically rich data might be within reach, the downstream utility of such generations remains to be seen, highlighting the outstanding challenges with automating linguistic annotation tasks.
comment: This paper has been accepted to ACL 2024
♻ ☆ OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.
♻ ☆ Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass
Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding.
comment: 22 pages, 15 figures
♻ ☆ Modeling the Sacred: Considerations when Using Religious Texts in Natural Language Processing NAACL2024
This position paper concerns the use of religious texts in Natural Language Processing (NLP), which is of special interest to the Ethics of NLP. Religious texts are expressions of culturally important values, and machine learned models have a propensity to reproduce cultural values encoded in their training data. Furthermore, translations of religious texts are frequently used by NLP researchers when language data is scarce. This repurposes the translations from their original uses and motivations, which often involve attracting new followers. This paper argues that NLP's use of such texts raises considerations that go beyond model biases, including data provenance, cultural contexts, and their use in proselytism. We argue for more consideration of researcher positionality, and of the perspectives of marginalized linguistic and religious communities.
comment: Findings of NAACL2024
♻ ☆ Investigating writing style as a contributor to gender gaps in science and technology
A growing stream of research finds that scientific contributions are evaluated differently depending on the gender of the author. In this article, we consider whether gender differences in writing styles - how men and women communicate their work - may contribute to these observed gender gaps. We ground our investigation in a framework for characterizing the linguistic style of written text, with two sets of features - informational (i.e., features that emphasize facts) and involved (i.e., features that emphasize relationships). Using a large sample of academic papers and patents, we find significant differences in writing style by gender, with women using more involved features in their writing. Papers and patents with more involved features also tend to be cited more by women. Our findings suggest that scientific text is not devoid of personal character, which could contribute to bias in evaluation, thereby compromising the norm of universalism as a foundational principle of science.
♻ ☆ Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.
♻ ☆ Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exam or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
♻ ☆ Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers NAACL 2024
The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.
comment: Accepted at NAACL 2024 Main Conference
♻ ☆ SpecExec: Massively Parallel Speculative Decoding for Interactive LLM Inference on Consumer Devices
As large language models gain widespread adoption, running them efficiently becomes crucial. Recent works on LLM inference use speculative decoding to achieve extreme speedups. However, most of these works implicitly design their algorithms for high-end datacenter hardware. In this work, we ask the opposite question: how fast can we run LLMs on consumer machines? Consumer GPUs can no longer fit the largest available models (50B+ parameters) and must offload them to RAM or SSD. When running with offloaded parameters, the inference engine can process batches of hundreds or thousands of tokens at the same time as just one token, making it a natural fit for speculative decoding. We propose SpecExec (Speculative Execution), a simple parallel decoding method that can generate up to 20 tokens per target model iteration for popular LLM families. It utilizes the high spikiness of the token probabilities distribution in modern LLMs and a high degree of alignment between model output probabilities. SpecExec takes the most probable tokens continuation from the draft model to build a "cache" tree for the target model, which then gets validated in a single pass. Using SpecExec, we demonstrate inference of 50B+ parameter LLMs on consumer GPUs with RAM offloading at 4-6 tokens per second with 4-bit quantization or 2-3 tokens per second with 16-bit weights.
comment: preprint
♻ ☆ SpeechX: Neural Codec Language Model as a Versatile Speech Transformer
Recent advancements in generative speech models based on audio-text prompts have enabled remarkable innovations like high-quality zero-shot text-to-speech. However, existing models still face limitations in handling diverse audio-text speech generation tasks involving transforming input speech and processing audio captured in adverse acoustic conditions. This paper introduces SpeechX, a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks, dealing with both clean and noisy signals. SpeechX combines neural codec language modeling with multi-task learning using task-dependent prompting, enabling unified and extensible modeling and providing a consistent way for leveraging textual input in speech enhancement and transformation tasks. Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise, achieving comparable or superior performance to specialized models across tasks. See https://aka.ms/speechx for demo samples.
comment: To appear in TASLP. See https://aka.ms/speechx for demo samples
♻ ☆ Confabulation: The Surprising Value of Large Language Model Hallucinations ACL2024
This paper presents a systematic defense of large language model (LLM) hallucinations or 'confabulations' as a potential resource instead of a categorically negative pitfall. The standard view is that confabulations are inherently problematic and AI research should eliminate this flaw. In this paper, we argue and empirically demonstrate that measurable semantic characteristics of LLM confabulations mirror a human propensity to utilize increased narrativity as a cognitive resource for sense-making and communication. In other words, it has potential value. Specifically, we analyze popular hallucination benchmarks and reveal that hallucinated outputs display increased levels of narrativity and semantic coherence relative to veridical outputs. This finding reveals a tension in our usually dismissive understandings of confabulation. It suggests, counter-intuitively, that the tendency for LLMs to confabulate may be intimately associated with a positive capacity for coherent narrative-text generation.
comment: Forthcoming at ACL2024 main conference. 1 figure
Machine Learning 158
☆ EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data
Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.
comment: 22 pages, 13 figures
☆ DiffusionPDE: Generative PDE-Solving Under Partial Observation
We introduce a general framework for solving partial differential equations (PDEs) using generative diffusion models. In particular, we focus on the scenarios where we do not have the full knowledge of the scene necessary to apply classical solvers. Most existing forward or inverse PDE approaches perform poorly when the observations on the data or the underlying coefficients are incomplete, which is a common assumption for real-world measurements. In this work, we propose DiffusionPDE that can simultaneously fill in the missing information and solve a PDE by modeling the joint distribution of the solution and coefficient spaces. We show that the learned generative priors lead to a versatile framework for accurately solving a wide range of PDEs under partial observation, significantly outperforming the state-of-the-art methods for both forward and inverse directions.
comment: Project page: https://jhhuangchloe.github.io/Diffusion-PDE/
☆ Solving Hard Mizar Problems with Instantiation and Strategy Invention
In this work, we prove over 3000 previously ATP-unproved Mizar/MPTP problems by using several ATP and AI methods, raising the number of ATP-solved Mizar problems from 75\% to above 80\%. First, we start to experiment with the cvc5 SMT solver which uses several instantiation-based heuristics that differ from the superposition-based systems, that were previously applied to Mizar,and add many new solutions. Then we use automated strategy invention to develop cvc5 strategies that largely improve cvc5's performance on the hard problems. In particular, the best invented strategy solves over 14\% more problems than the best previously available cvc5 strategy. We also show that different clausification methods have a high impact on such instantiation-based methods, again producing many new solutions. In total, the methods solve 3021 (21.3\%) of the 14163 previously unsolved hard Mizar problems. This is a new milestone over the Mizar large-theory benchmark and a large strengthening of the hammer methods for Mizar.
☆ CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Large language models (LLMs) are commonly used for long-form question answering, which requires them to generate paragraph-length answers to complex questions. While long-form QA has been well-studied in English via many different datasets and evaluation metrics, this research has not been extended to cover most other languages. To bridge this gap, we introduce CaLMQA, a collection of 2.6K complex questions spanning 23 languages, including under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset includes both naturally-occurring questions collected from community web forums as well as questions written by native speakers, whom we hire for this purpose. Our process yields diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We conduct automatic evaluation across a suite of open- and closed-source models using our novel metric CaLMScore, which detects incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. We perform human evaluation on a subset of models and see that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in LLM multilingual capabilities and non-English LFQA evaluation.
comment: 39 pages, 16 figures. Code and data available at https://github.com/2015aroras/CaLMQA
☆ Interpreting Attention Layer Outputs with Sparse Autoencoders
Decomposing model activations into interpretable components is a key open problem in mechanistic interpretability. Sparse autoencoders (SAEs) are a popular method for decomposing the internal activations of trained transformers into sparse, interpretable features, and have been applied to MLP layers and the residual stream. In this work we train SAEs on attention layer outputs and show that also here SAEs find a sparse, interpretable decomposition. We demonstrate this on transformers from several model families and up to 2B parameters. We perform a qualitative study of the features computed by attention layers, and find multiple families: long-range context, short-range context and induction features. We qualitatively study the role of every head in GPT-2 Small, and estimate that at least 90% of the heads are polysemantic, i.e. have multiple unrelated roles. Further, we show that Sparse Autoencoders are a useful tool that enable researchers to explain model behavior in greater detail than prior work. For example, we explore the mystery of why models have so many seemingly redundant induction heads, use SAEs to motivate the hypothesis that some are long-prefix whereas others are short-prefix, and confirm this with more rigorous analysis. We use our SAEs to analyze the computation performed by the Indirect Object Identification circuit (Wang et al.), validating that the SAEs find causally meaningful intermediate variables, and deepening our understanding of the semantics of the circuit. We open-source the trained SAEs and a tool for exploring arbitrary prompts through the lens of Attention Output SAEs.
☆ Benchmarking Deep Learning Models on NVIDIA Jetson Nano for Real-Time Systems: An Empirical Investigation
The proliferation of complex deep learning (DL) models has revolutionized various applications, including computer vision-based solutions, prompting their integration into real-time systems. However, the resource-intensive nature of these models poses challenges for deployment on low-computational power and low-memory devices, like embedded and edge devices. This work empirically investigates the optimization of such complex DL models to analyze their functionality on an embedded device, particularly on the NVIDIA Jetson Nano. It evaluates the effectiveness of the optimized models in terms of their inference speed for image classification and video action detection. The experimental results reveal that, on average, optimized models exhibit a 16.11% speed improvement over their non-optimized counterparts. This not only emphasizes the critical need to consider hardware constraints and environmental sustainability in model development and deployment but also underscores the pivotal role of model optimization in enabling the widespread deployment of AI-assisted technologies on resource-constrained computational systems. It also serves as proof that prioritizing hardware-specific model optimization leads to efficient and scalable solutions that substantially decrease energy consumption and carbon footprint.
comment: 7 pages, 4 figures
☆ A New Perspective on Shampoo's Preconditioner
Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the $\textit{optimal}$ Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the $\textit{square}$ of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.
☆ Probing the effects of broken symmetries in machine learning
Symmetry is one of the most central concepts in physics, and it is no surprise that it has also been widely adopted as an inductive bias for machine-learning models applied to the physical sciences. This is especially true for models targeting the properties of matter at the atomic scale. Both established and state-of-the-art approaches, with almost no exceptions, are built to be exactly equivariant to translations, permutations, and rotations of the atoms. Incorporating symmetries -- rotations in particular -- constrains the model design space and implies more complicated architectures that are often also computationally demanding. There are indications that non-symmetric models can easily learn symmetries from data, and that doing so can even be beneficial for the accuracy of the model. We put a model that obeys rotational invariance only approximately to the test, in realistic scenarios involving simulations of gas-phase, liquid, and solid water. We focus specifically on physical observables that are likely to be affected -- directly or indirectly -- by symmetry breaking, finding negligible consequences when the model is used in an interpolative, bulk, regime. Even for extrapolative gas-phase predictions, the model remains very stable, even though symmetry artifacts are noticeable. We also discuss strategies that can be used to systematically reduce the magnitude of symmetry breaking when it occurs, and assess their impact on the convergence of observables.
☆ Light-weight End-to-End Graph Interest Network for CTR Prediction in E-commerce Search
Click-through-rate (CTR) prediction has an essential impact on improving user experience and revenue in e-commerce search. With the development of deep learning, graph-based methods are well exploited to utilize graph structure extracted from user behaviors and other information to help embedding learning. However, most of the previous graph-based methods mainly focus on recommendation scenarios, and therefore their graph structures highly depend on item's sequential information from user behaviors, ignoring query's sequential signal and query-item correlation. In this paper, we propose a new approach named Light-weight End-to-End Graph Interest Network (EGIN) to effectively mine users' search interests and tackle previous challenges. (i) EGIN utilizes query and item's correlation and sequential information from the search system to build a heterogeneous graph for better CTR prediction in e-commerce search. (ii) EGIN's graph embedding learning shares the same training input and is jointly trained with CTR prediction, making the end-to-end framework effortless to deploy in large-scale search systems. The proposed EGIN is composed of three parts: query-item heterogeneous graph, light-weight graph sampling, and multi-interest network. The query-item heterogeneous graph captures correlation and sequential information of query and item efficiently by the proposed light-weight graph sampling. The multi-interest network is well designed to utilize graph embedding to capture various similarity relationships between query and item to enhance the final CTR prediction. We conduct extensive experiments on both public and industrial datasets to demonstrate the effectiveness of the proposed EGIN. At the same time, the training cost of graph learning is relatively low compared with the main CTR prediction task, ensuring efficiency in practical applications.
comment: 8 pages, 4 figures
☆ Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning
Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. In this work, we propose a general framework for parameter efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices (SURM) which can serve as a drop-in replacement for popular approaches such as Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more flexibility in finding the right balance between compactness and expressiveness. This is achieved by using low displacement rank matrices (LDRMs), which hasn't been used in this context before. SURMs remain competitive with baselines, often providing significant quality improvements while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on various image classification tasks while replacing low-rank matrices in LoRA. It also results in up to 12x reduction of the number of parameters in adapters (with virtually no loss in quality) on the GLUE benchmark.
comment: Work in progress
☆ LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
While state-of-the-art Large Language Models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.
☆ When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning
We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation learning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful \emph{auxiliary task}, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.
☆ Compositional Models for Estimating Causal Effects
Many real-world systems can be represented as sets of interacting components. Examples of such systems include computational systems such as query processors, natural systems such as cells, and social systems such as families. Many approaches have been proposed in traditional (associational) machine learning to model such structured systems, including statistical relational models and graph neural networks. Despite this prior work, existing approaches to estimating causal effects typically treat such systems as single units, represent them with a fixed set of variables and assume a homogeneous data-generating process. We study a compositional approach for estimating individual treatment effects (ITE) in structured systems, where each unit is represented by the composition of multiple heterogeneous components. This approach uses a modular architecture to model potential outcomes at each component and aggregates component-level potential outcomes to obtain the unit-level potential outcomes. We discover novel benefits of the compositional approach in causal inference - systematic generalization to estimate counterfactual outcomes of unseen combinations of components and improved overlap guarantees between treatment and control groups compared to the classical methods for causal effect estimation. We also introduce a set of novel environments for empirically evaluating the compositional approach and demonstrate the effectiveness of our approach using both simulated and real-world data.
☆ Data curation via joint example selection further accelerates multimodal learning
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.
comment: Main text: 9 pages, 5 figures, 3 tables, 1 algorithm. Appendix: 7 pages, 5 figures, 1 table, 2. algorithm
☆ FedBiOT: LLM Local Fine-tuning in Federated Learning without Full Model KDD 2024
Large language models (LLMs) show amazing performance on many domain-specific tasks after fine-tuning with some appropriate data. However, many domain-specific data are privately distributed across multiple owners. Thus, this dilemma raises the interest in how to perform LLM fine-tuning in federated learning (FL). However, confronted with limited computation and communication capacities, FL clients struggle to fine-tune an LLM effectively. To this end, we introduce FedBiOT, a resource-efficient LLM fine-tuning approach to FL. Specifically, our method involves the server generating a compressed LLM and aligning its performance with the full model. Subsequently, the clients fine-tune a lightweight yet important part of the compressed model, referred to as an adapter. Notice that as the server has no access to the private data owned by the clients, the data used for alignment by the server has a different distribution from the one used for fine-tuning by clients. We formulate the problem into a bi-level optimization problem to minimize the negative effect of data discrepancy and derive the updating rules for the server and clients. We conduct extensive experiments on LLaMA-2, empirically showing that the adapter has exceptional performance when reintegrated into the global LLM. The results also indicate that the proposed FedBiOT significantly reduces resource consumption compared to existing benchmarks, all while achieving comparable performance levels.
comment: KDD 2024
☆ Can independent Metropolis beat crude Monte Carlo?
Assume that we would like to estimate the expected value of a function $F$ with respect to a density $\pi$. We prove that if $\pi$ is close enough under KL divergence to another density $q$, an independent Metropolis sampler estimator that obtains samples from $\pi$ with proposal density $q$, enriched with a variance reduction computational strategy based on control variates, achieves smaller asymptotic variance than that of the crude Monte Carlo estimator. The control variates construction requires no extra computational effort but assumes that the expected value of $F$ under $q$ is analytically available. We illustrate this result by calculating the marginal likelihood in a linear regression model with prior-likelihood conflict and a non-conjugate prior. Furthermore, we propose an adaptive independent Metropolis algorithm that adapts the proposal density such that its KL divergence with the target is being reduced. We demonstrate its applicability in a Bayesian logistic and Gaussian process regression problems and we rigorously justify our asymptotic arguments under easily verifiable and essentially minimal conditions.
comment: 37 pages, 3 figures
☆ Identifying Nonstationary Causal Structures with High-Order Markov Switching Models UAI2024
Causal discovery in time series is a rapidly evolving field with a wide variety of applications in other areas such as climate science and neuroscience. Traditional approaches assume a stationary causal graph, which can be adapted to nonstationary time series with time-dependent effects or heterogeneous noise. In this work we address nonstationarity via regime-dependent causal structures. We first establish identifiability for high-order Markov Switching Models, which provide the foundations for identifiable regime-dependent causal discovery. Our empirical studies demonstrate the scalability of our proposed approach for high-order regime-dependent structure estimation, and we illustrate its applicability on brain activity data.
comment: CI4TS Workshop @UAI2024
☆ HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction
Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DTA prediction method, termed HGTDP-DTA, which utilizes dynamic prompts within a hybrid Graph-Transformer framework. Our method generates context-specific prompts for each drug-target pair, enhancing the model's ability to capture unique interactions. The introduction of prompt tuning further optimizes the prediction process by filtering out irrelevant noise and emphasizing task-relevant information, dynamically adjusting the input features of the molecular graph. The proposed hybrid Graph-Transformer architecture combines structural information from Graph Convolutional Networks (GCNs) with sequence information captured by Transformers, facilitating the interaction between global and local information. Additionally, we adopted the multi-view feature fusion method to project molecular graph views and affinity subgraph views into a common feature space, effectively combining structural and contextual information. Experiments on two widely used public datasets, Davis and KIBA, show that HGTDP-DTA outperforms state-of-the-art DTA prediction methods in both prediction performance and generalization ability.
☆ From Distributional to Overton Pluralism: Investigating Large Language Model Alignment
The alignment process changes several properties of a large language model's (LLM's) output distribution. We analyze two aspects of post-alignment distributional shift of LLM responses. First, we re-examine previously reported reductions in response diversity post-alignment. Our analysis suggests that an apparent drop in the diversity of responses is largely explained by quality control and information aggregation. Alignment suppresses irrelevant and unhelpful content while shifting the output distribution toward longer responses that cover information spanning several responses from the base LLM, essentially presenting diverse information in a single response. Finding little evidence that alignment suppresses useful information, it is natural to ask the opposite question: do aligned models surface information that cannot be recovered from base models? Our second investigation shows this is not the case and the behavior of aligned models is recoverable from base models without fine-tuning. A combination of in-context examples and lower-resolution semantic hints about response content can elicit responses from base LLMs that are as similar to alignment-tuned LLM responses as alignment-tuned LLM responses are to each other. Taken together, these results indicate that current alignment techniques capture but do not extend the useful subset of assistant-like base LLM behavior, providing further evidence for the Superficial Alignment Hypothesis. They also show that in-context alignment can go surprisingly far as a strategy for imitating aligned LLMs without fine-tuning. Our code and data is available at https://github.com/thomlake/investigating-alignment.
☆ LaTable: Towards Large Tabular Models
Tabular data is one of the most ubiquitous modalities, yet the literature on tabular generative foundation models is lagging far behind its text and vision counterparts. Creating such a model is hard, due to the heterogeneous feature spaces of different tabular datasets, tabular metadata (e.g. dataset description and feature headers), and tables lacking prior knowledge (e.g. feature order). In this work we propose LaTable: a novel tabular diffusion model that addresses these challenges and can be trained across different datasets. Through extensive experiments we find that LaTable outperforms baselines on in-distribution generation, and that finetuning LaTable can generate out-of-distribution datasets better with fewer samples. On the other hand, we explore the poor zero-shot performance of LaTable, and what it may teach us about building generative tabular foundation models with better zero- and few-shot generation capabilities.
☆ Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients
Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a $2\times$ throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .
☆ Privacy Preserving Reinforcement Learning for Population Processes
We consider the problem of privacy protection in Reinforcement Learning (RL) algorithms that operate over population processes, a practical but understudied setting that includes, for example, the control of epidemics in large populations of dynamically interacting individuals. In this setting, the RL algorithm interacts with the population over $T$ time steps by receiving population-level statistics as state and performing actions which can affect the entire population at each time step. An individual's data can be collected across multiple interactions and their privacy must be protected at all times. We clarify the Bayesian semantics of Differential Privacy (DP) in the presence of correlated data in population processes through a Pufferfish Privacy analysis. We then give a meta algorithm that can take any RL algorithm as input and make it differentially private. This is achieved by taking an approach that uses DP mechanisms to privatize the state and reward signal at each time step before the RL algorithm receives them as input. Our main theoretical result shows that the value-function approximation error when applying standard RL algorithms directly to the privatized states shrinks quickly as the population size and privacy budget increase. This highlights that reasonable privacy-utility trade-offs are possible for differentially private RL algorithms in population processes. Our theoretical findings are validated by experiments performed on a simulated epidemic control problem over large population sizes.
☆ BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging
Test-time augmentation (TTA) is a well-known technique employed during the testing phase of computer vision tasks. It involves aggregating multiple augmented versions of input data. Combining predictions using a simple average formulation is a common and straightforward approach after performing TTA. This paper introduces a novel framework for optimizing TTA, called BayTTA (Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First, we generate a model list associated with different variations of the input data created through TTA. Then, we use BMA to combine model predictions weighted by their respective posterior probabilities. Such an approach allows one to take into account model uncertainty, and thus to enhance the predictive performance of the related machine learning or deep learning model. We evaluate the performance of BayTTA on various public data, including three medical image datasets comprising skin cancer, breast cancer, and chest X-ray images and two well-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental results indicate that BayTTA can be effectively integrated into state-of-the-art deep learning models used in medical image analysis as well as into some popular pre-trained CNN models such as VGG-16, MobileNetV2, DenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in their accuracy and robustness performance.
☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
☆ Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings indicate that supervised classification models fine-tuned on LLM-generated labels perform comparably to models fine-tuned with labels from human annotators. Fine-tuning models using LLM-generated labels can be a fast, efficient and cost-effective method of building supervised text classifiers.
comment: In Proceedings of the Sixth Workshop on Natural Language Processing and Computational Social Science
☆ KANQAS: Kolmogorov Arnold Network for Quantum Architecture Search
Quantum architecture search~(QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. While multi-layer perceptron-based deep Q-networks have been applied for QAS, their interpretability remains challenging due to the high number of parameters. In this work, we evaluate the practicality of KANs in quantum architecture search problems, analyzing their efficiency in terms of the probability of success, frequency of optimal solutions and their dependencies on various degrees of freedom of the network. In a noiseless scenario, the probability of success and the number of optimal quantum circuit configurations to generate the multi-qubit maximally entangled states are significantly higher than MLPs. Moreover in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. Further investigation reveals that KAN requires a very small number of learnable parameters compared to MLPs, however, the average time of executing each episode for KAN is much higher.
comment: 10 pages and 4 figures
☆ Querying Labeled Time Series Data with Scenario Programs
In order to ensure autonomous vehicles are safe for on-road deployment, simulation-based testing has become an integral complement to on-road testing. The rise in simulation testing and validation reflects a growing need to verify that AV behavior is consistent with desired outcomes even in edge case scenarios $-$ which may seldom or never appear in on-road testing data. This raises a critical question: to what extent are AV failures in simulation consistent with data collected from real-world testing? As a result of the gap between simulated and real sensor data (sim-to-real gap), failures in simulation can either be spurious (simulation- or simulator-specific issues) or relevant (safety-critical AV system issues). One possible method for validating if simulated time series failures are consistent with real world time series sensor data could involve retrieving instances of the failure scenario from a real-world time series dataset, in order to understand AV performance in these scenarios. Adopting this strategy, we propose a formal definition of what constitutes a match between a real-world labeled time series data item and a simulated scenario written from a fragment of the Scenic probabilistic programming language for simulation generation. With this definition of a match, we develop a querying algorithm that identifies the subset of a labeled time series dataset matching a given scenario. To allow this approach to be used to verify the safety of other cyber-physical systems (CPS), we present a definition and algorithm for matching scalable beyond the autonomous vehicles domain. Experiments demonstrate the precision and scalability of the algorithm for a set of challenging and uncommon time series scenarios identified from the nuScenes autonomous driving dataset. We include a full system implementation of the querying algorithm freely available for use across a wide range of CPS.
comment: 72 pages, 6 figures, 5 algorithms. Published on https://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-136.html
☆ Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
Bug localization refers to the identification of source code files which is in a programming language and also responsible for the unexpected behavior of software using the bug report, which is a natural language. As bug localization is labor-intensive, bug localization models are employed to assist software developers. Due to the domain difference between source code files and bug reports, modern bug-localization systems, based on deep learning models, rely heavily on embedding techniques that project bug reports and source code files into a shared vector space. The creation of an embedding involves several design choices, but the impact of these choices on the quality of embedding and the performance of bug localization models remains unexplained in current research. To address this gap, our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices. Subsequently, we developed bug localization models utilizing these embedding models to assess the influence of these choices on the performance of the localization models. Our findings indicate that the pre-training strategies significantly affect the quality of the embedding. Moreover, we discovered that the familiarity of the embedding models with the data has a notable impact on the bug localization model's performance. Notably, when the training and testing data are collected from different projects, the performance of the bug localization models exhibits substantial fluctuations.
☆ Distributed Training of Large Graph Neural Networks with Variable Communication Rates
Training Graph Neural Networks (GNNs) on large graphs presents unique challenges due to the large memory and computing requirements. Distributed GNN training, where the graph is partitioned across multiple machines, is a common approach to training GNNs on large graphs. However, as the graph cannot generally be decomposed into small non-interacting components, data communication between the training machines quickly limits training speeds. Compressing the communicated node activations by a fixed amount improves the training speeds, but lowers the accuracy of the trained GNN. In this paper, we introduce a variable compression scheme for reducing the communication volume in distributed GNN training without compromising the accuracy of the learned model. Based on our theoretical analysis, we derive a variable compression method that converges to a solution equivalent to the full communication case, for all graph partitioning schemes. Our empirical results show that our method attains a comparable performance to the one obtained with full communication. We outperform full communication at any fixed compression ratio for any communication budget.
☆ Diffusion-based Adversarial Purification for Intrusion Detection
The escalating sophistication of cyberattacks has encouraged the integration of machine learning techniques in intrusion detection systems, but the rise of adversarial examples presents a significant challenge. These crafted perturbations mislead ML models, enabling attackers to evade detection or trigger false alerts. As a reaction, adversarial purification has emerged as a compelling solution, particularly with diffusion models showing promising results. However, their purification potential remains unexplored in the context of intrusion detection. This paper demonstrates the effectiveness of diffusion models in purifying adversarial examples in network intrusion detection. Through a comprehensive analysis of the diffusion parameters, we identify optimal configurations maximizing adversarial robustness with minimal impact on normal performance. Importantly, this study reveals insights into the relationship between diffusion noise and diffusion steps, representing a novel contribution to the field. Our experiments are carried out on two datasets and against 5 adversarial attacks. The implementation code is publicly available.
☆ Constructing structured tensor priors for Bayesian inverse problems
Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of (A,b)-constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.
☆ Learning Dynamic Bayesian Networks from Data: Foundations, First Principles and Numerical Comparisons
In this paper, we present a guide to the foundations of learning Dynamic Bayesian Networks (DBNs) from data in the form of multiple samples of trajectories for some length of time. We present the formalism for a generic as well as a set of common types of DBNs for particular variable distributions. We present the analytical form of the models, with a comprehensive discussion on the interdependence between structure and weights in a DBN model and their implications for learning. Next, we give a broad overview of learning methods and describe and categorize them based on the most important statistical features, and how they treat the interplay between learning structure and weights. We give the analytical form of the likelihood and Bayesian score functions, emphasizing the distinction from the static case. We discuss functions used in optimization to enforce structural requirements. We briefly discuss more complex extensions and representations. Finally we present a set of comparisons in different settings for various distinct but representative algorithms across the variants.
☆ Towards Compositional Interpretability for XAI
Artificial intelligence (AI) is currently based largely on black-box machine learning models which lack interpretability. The field of eXplainable AI (XAI) strives to address this major concern, being critical in high-stakes areas such as the finance, legal and health sectors. We present an approach to defining AI models and their interpretability based on category theory. For this we employ the notion of a compositional model, which sees a model in terms of formal string diagrams which capture its abstract structure together with its concrete implementation. This comprehensive view incorporates deterministic, probabilistic and quantum models. We compare a wide range of AI models as compositional models, including linear and rule-based models, (recurrent) neural networks, transformers, VAEs, and causal and DisCoCirc models. Next we give a definition of interpretation of a model in terms of its compositional structure, demonstrating how to analyse the interpretability of a model, and using this to clarify common themes in XAI. We find that what makes the standard 'intrinsically interpretable' models so transparent is brought out most clearly diagrammatically. This leads us to the more general notion of compositionally-interpretable (CI) models, which additionally include, for instance, causal, conceptual space, and DisCoCirc models. We next demonstrate the explainability benefits of CI models. Firstly, their compositional structure may allow the computation of other quantities of interest, and may facilitate inference from the model to the modelled phenomenon by matching its structure. Secondly, they allow for diagrammatic explanations for their behaviour, based on influence constraints, diagram surgery and rewrite explanations. Finally, we discuss many future directions for the approach, raising the question of how to learn such meaningfully structured models in practice.
☆ Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations
Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, helping them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent's capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access).
☆ Multi-property Steering of Large Language Models with Dynamic Activation Composition
Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models' intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.
☆ Modularity Based Community Detection in Hypergraphs
In this paper, we propose a scalable community detection algorithm using hypergraph modularity function, h-Louvain. It is an adaptation of the classical Louvain algorithm in the context of hypergraphs. We observe that a direct application of the Louvain algorithm to optimize the hypergraph modularity function often fails to find meaningful communities. We propose a solution to this issue by adjusting the initial stage of the algorithm via carefully and dynamically tuned linear combination of the graph modularity function of the corresponding two-section graph and the desired hypergraph modularity function. The process is guided by Bayesian optimization of the hyper-parameters of the proposed procedure. Various experiments on synthetic as well as real-world networks are performed showing that this process yields improved results in various regimes.
comment: 21 pages, 8 figures, 4 tables
☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
☆ SincVAE: a New Approach to Improve Anomaly Detection on EEG Data Using SincNet and Variational Autoencoder
Over the past few decades, electroencephalography (EEG) monitoring has become a pivotal tool for diagnosing neurological disorders, particularly for detecting seizures. Epilepsy, one of the most prevalent neurological diseases worldwide, affects approximately the 1 \% of the population. These patients face significant risks, underscoring the need for reliable, continuous seizure monitoring in daily life. Most of the techniques discussed in the literature rely on supervised Machine Learning (ML) methods. However, the challenge of accurately labeling variations in epileptic EEG waveforms complicates the use of these approaches. Additionally, the rarity of ictal events introduces an high imbalancing within the data, which could lead to poor prediction performance in supervised learning approaches. Instead, a semi-supervised approach allows to train the model only on data not containing seizures, thus avoiding the issues related to the data imbalancing. This work proposes a semi-supervised approach for detecting epileptic seizures from EEG data, utilizing a novel Deep Learning-based method called SincVAE. This proposal incorporates the learning of an ad-hoc array of bandpass filter as a first layer of a Variational Autoencoder (VAE), potentially eliminating the preprocessing stage where informative band frequencies are identified and isolated. Results indicate that SincVAE improves seizure detection in EEG data and is capable of identifying early seizures during the preictal stage as well as monitoring patients throughout the postictal stage.
☆ MedMNIST-C: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions
The integration of neural-network-based systems into clinical practice is limited by challenges related to domain generalization and robustness. The computer vision community established benchmarks such as ImageNet-C as a fundamental prerequisite to measure progress towards those challenges. Similar datasets are largely absent in the medical imaging community which lacks a comprehensive benchmark that spans across imaging modalities and applications. To address this gap, we create and open-source MedMNIST-C, a benchmark dataset based on the MedMNIST+ collection covering 12 datasets and 9 imaging modalities. We simulate task and modality-specific image corruptions of varying severity to comprehensively evaluate the robustness of established algorithms against real-world artifacts and distribution shifts. We further provide quantitative evidence that our simple-to-use artificial corruptions allow for highly performant, lightweight data augmentation to enhance model robustness. Unlike traditional, generic augmentation strategies, our approach leverages domain knowledge, exhibiting significantly higher robustness when compared to widely adopted methods. By introducing MedMNIST-C and open-sourcing the corresponding library allowing for targeted data augmentations, we contribute to the development of increasingly robust methods tailored to the challenges of medical imaging. The code is available at https://github.com/francescodisalvo05/medmnistc-api}{github.com/francescodisalvo05/medmnistc-api.
☆ On the consistency of hyper-parameter selection in value-based deep reinforcement learning
Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.
☆ Preserving Node Distinctness in Graph Autoencoders via Similarity Distillation
Graph autoencoders (GAEs), as a kind of generative self-supervised learning approach, have shown great potential in recent years. GAEs typically rely on distance-based criteria, such as mean-square-error (MSE), to reconstruct the input graph. However, relying solely on a single reconstruction criterion may lead to a loss of distinctiveness in the reconstructed graph, causing nodes to collapse into similar representations and resulting in sub-optimal performance. To address this issue, we have developed a simple yet effective strategy to preserve the necessary distinctness in the reconstructed graph. Inspired by the knowledge distillation technique, we found that the dual encoder-decoder architecture of GAEs can be viewed as a teacher-student relationship. Therefore, we propose transferring the knowledge of distinctness from the raw graph to the reconstructed graph, achieved through a simple KL constraint. Specifically, we compute pairwise node similarity scores in the raw graph and reconstructed graph. During the training process, the KL constraint is optimized alongside the reconstruction criterion. We conducted extensive experiments across three types of graph tasks, demonstrating the effectiveness and generality of our strategy. This indicates that the proposed approach can be employed as a plug-and-play method to avoid vague reconstructions and enhance overall performance.
☆ WAVE: Weight Template for Adaptive Initialization of Variable-sized Models
The expansion of model parameters underscores the significance of pre-trained models; however, the constraints encountered during model deployment necessitate models of variable sizes. Consequently, the traditional pre-training and fine-tuning paradigm fails to address the initialization problem when target models are incompatible with pre-trained models. We tackle this issue from a multitasking perspective and introduce \textbf{WAVE}, which incorporates a set of shared \textbf{W}eight templates for \textbf{A}daptive initialization of \textbf{V}ariable-siz\textbf{E}d Models. During initialization, target models will initialize the corresponding weight scalers tailored to their model size, which are sufficient to learn the connection rules of weight templates based on the Kronecker product from a limited amount of data. For the construction of the weight templates, WAVE utilizes the \textit{Learngene} framework, which structurally condenses common knowledge from ancestry models into weight templates as the learngenes through knowledge distillation. This process allows the integration of pre-trained models' knowledge into structured knowledge according to the rules of weight templates. We provide a comprehensive benchmark for the learngenes, and extensive experiments demonstrate the efficacy of WAVE. The results show that WAVE achieves state-of-the-art performance when initializing models with various depth and width, and even outperforms the direct pre-training of $n$ entire models, particularly for smaller models, saving approximately $n\times$ and $5\times$ in computational and storage resources, respectively. WAVE simultaneously achieves the most efficient knowledge transfer across a series of datasets, specifically achieving an average improvement of 1.8\% and 1.2\% on 7 downstream datasets.
☆ BricksRL: A Platform for Democratizing Robotics and Reinforcement Learning Research and Education with LEGO
We present BricksRL, a platform designed to democratize access to robotics for reinforcement learning research and education. BricksRL facilitates the creation, design, and training of custom LEGO robots in the real world by interfacing them with the TorchRL library for reinforcement learning agents. The integration of TorchRL with the LEGO hubs, via Bluetooth bidirectional communication, enables state-of-the-art reinforcement learning training on GPUs for a wide variety of LEGO builds. This offers a flexible and cost-efficient approach for scaling and also provides a robust infrastructure for robot-environment-algorithm communication. We present various experiments across tasks and robot configurations, providing built plans and training results. Furthermore, we demonstrate that inexpensive LEGO robots can be trained end-to-end in the real world to achieve simple tasks, with training times typically under 120 minutes on a normal laptop. Moreover, we show how users can extend the capabilities, exemplified by the successful integration of non-LEGO sensors. By enhancing accessibility to both robotics and reinforcement learning, BricksRL establishes a strong foundation for democratized robotic learning in research and educational settings.
☆ Towards Federated Low-Rank Adaptation with Rank-Heterogeneous Communication
Low-rank adaptation (LoRA) is an attractive alternative of adapting full weights for the federated fine-tuning of large pretrained models, which can significantly reduce the memory and communication burden. In principle, federated LoRA can provide an effective mean to allocate different resources to each client by tuning ranks for each client, which can be useful in achieving a better communication-performance tradeoff. We find, however, that the empirical performance of LoRA is highly unstable with respect to such rank-heterogeneity, severely limiting the applicability to the scenarios where it is desirable or even required to allocate nonuniform communication bandwidth to each client due to constrained total bandwidth. Our investigation reveals that the root cause of this instability is the zero-padding-based aggregation strategy adopted in conventional federated LoRA frameworks, which causes the information from high rank clients to get diluted during the aggregation process. To address this issue, we propose a new replication-based padding strategy, which allows us to better leverage the information from clients with high-quality datasets. This method ensures that valuable information from high rank clients is retained during the aggregation process, accelerating the convergence speed and enhancing the overall prediction quality of the global model.
☆ Performative Debias with Fair-exposure Optimization Driven by Strategic Agents in Recommender Systems KDD 2024
Data bias, e.g., popularity impairs the dynamics of two-sided markets within recommender systems. This overshadows the less visible but potentially intriguing long-tail items that could capture user interest. Despite the abundance of research surrounding this issue, it still poses challenges and remains a hot topic in academic circles. Along this line, in this paper, we developed a re-ranking approach in dynamic settings with fair-exposure optimization driven by strategic agents. Designed for the producer side, the execution of agents assumes content creators can modify item features based on strategic incentives to maximize their exposure. This iterative process entails an end-to-end optimization, employing differentiable ranking operators that simultaneously target accuracy and fairness. Joint objectives ensure the performance of recommendations while enhancing the visibility of tail items. We also leveraged the performativity nature of predictions to illustrate how strategic learning influences content creators to shift towards fairness efficiently, thereby incentivizing features of tail items. Through comprehensive experiments on both public and industrial datasets, we have substantiated the effectiveness and dominance of the proposed method especially on unveiling the potential of tail items.
comment: SIGKDD 2024 accepted paper
☆ Dynamic Scheduling for Vehicle-to-Vehicle Communications Enhanced Federated Learning
Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.
comment: Submitted to IEEE for possible publication
☆ Early learning of the optimal constant solution in neural networks and humans
Deep neural networks learn increasingly complex functions over the course of training. Here, we show both empirically and theoretically that learning of the target function is preceded by an early phase in which networks learn the optimal constant solution (OCS) - that is, initial model responses mirror the distribution of target labels, while entirely ignoring information provided in the input. Using a hierarchical category learning task, we derive exact solutions for learning dynamics in deep linear networks trained with bias terms. Even when initialized to zero, this simple architectural feature induces substantial changes in early dynamics. We identify hallmarks of this early OCS phase and illustrate how these signatures are observed in deep linear networks and larger, more complex (and nonlinear) convolutional neural networks solving a hierarchical learning task based on MNIST and CIFAR10. We explain these observations by proving that deep linear networks necessarily learn the OCS during early learning. To further probe the generality of our results, we train human learners over the course of three days on the category learning task. We then identify qualitative signatures of this early OCS phase in terms of the dynamics of true negative (correct-rejection) rates. Surprisingly, we find the same early reliance on the OCS in the behaviour of human learners. Finally, we show that learning of the OCS can emerge even in the absence of bias terms and is equivalently driven by generic correlations in the input data. Overall, our work suggests the OCS as a universal learning principle in supervised, error-corrective learning, and the mechanistic reasons for its prevalence.
☆ Mind the Graph When Balancing Data for Fairness or Robustness
Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that, in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.
☆ A Critical Analysis of the Theoretical Framework of the Extreme Learning Machine
Despite the number of successful applications of the Extreme Learning Machine (ELM), we show that its underlying foundational principles do not have a rigorous mathematical justification. Specifically, we refute the proofs of two main statements, and we also create a dataset that provides a counterexample to the ELM learning algorithm and explain its design, which leads to many such counterexamples. Finally, we provide alternative statements of the foundations, which justify the efficiency of ELM in some theoretical cases.
☆ CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems
Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well known to be vulnerable to adversarial perturbations. Previous works on adversarial attacks have primarily focused on white-box attacks that directly perturb the states or actions of victim agents, often in scenarios with a limited number of attacks. However, gaining complete access to victim agents in real-world environments is exceedingly difficult. To create more realistic adversarial attacks, we introduce a novel method that involves injecting traitor agents into the CMARL system. We model this problem as a Traitor Markov Decision Process (TMDP), where traitors cannot directly attack the victim agents but can influence their formation or positioning through collisions. In TMDP, traitors are trained using the same MARL algorithm as the victim agents, with their reward function set as the negative of the victim agents' reward. Despite this, the training efficiency for traitors remains low because it is challenging for them to directly associate their actions with the victim agents' rewards. To address this issue, we propose the Curiosity-Driven Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and aggressiveness of attacks on the specified victim agents' policies while maintaining the optimal policy invariance of the traitors. Specifically, we employ a pre-trained Random Network Distillation (RND) module, where the extra reward generated by the RND module encourages traitors to explore states unencountered by the victim agents. Extensive experiments on various scenarios from SMAC demonstrate that our CuDA2 framework offers comparable or superior adversarial attack capabilities compared to other baselines.
☆ SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation
Despite the suitability of graphs for capturing the relational structures inherent in architectural layout designs, there is a notable dearth of research on interpreting architectural design space using graph-based representation learning and exploring architectural design graph generation. Concurrently, disentangled representation learning in graph generation faces challenges such as node permutation invariance and representation expressiveness. To address these challenges, we introduce an unsupervised disentangled representation learning framework, Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), aiming to generate architectural layout in the form of attributed adjacency multi-graphs while prioritizing representation disentanglement. The framework is designed with three alternative pipelines, each integrating a transformer-based edge-augmented encoder, a latent space disentanglement module, and a style-based decoder. These components collectively facilitate the decomposition of latent factors influencing architectural layout graph generation, enhancing generation fidelity and diversity. We also provide insights into optimizing the framework by systematically exploring graph feature augmentation schemes and evaluating their effectiveness for disentangling architectural layout representation through extensive experiments. Additionally, we contribute a new benchmark large-scale architectural layout graph dataset extracted from real-world floor plan images to facilitate the exploration of graph data-based architectural design representation space interpretation. This study pioneered disentangled representation learning for the architectural layout graph generation. The code and dataset of this study will be open-sourced.
☆ Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
☆ Make Some Noise: Unlocking Language Model Parallel Inference Capability through Noisy Training
Existing speculative decoding methods typically require additional model structure and training processes to assist the model for draft token generation. This makes the migration of acceleration methods to the new model more costly and more demanding on device memory. To address this problem, we propose the Make Some Noise (MSN) training framework as a replacement for the supervised fine-tuning stage of the large language model. The training method simply introduces some noise at the input for the model to learn the denoising task. It significantly enhances the parallel decoding capability of the model without affecting the original task capability. In addition, we propose a tree-based retrieval-augmented Jacobi (TR-Jacobi) decoding strategy to further improve the inference speed of MSN models. Experiments in both the general and code domains have shown that MSN can improve inference speed by 2.3-2.7x times without compromising model performance. The MSN model also achieves comparable acceleration ratios to the SOTA model with additional model structure on Spec-Bench.
comment: 11 pages, 6 figures
☆ GradCheck: Analyzing classifier guidance gradients for conditional diffusion sampling
To sample from an unconditionally trained Denoising Diffusion Probabilistic Model (DDPM), classifier guidance adds conditional information during sampling, but the gradients from classifiers, especially those not trained on noisy images, are often unstable. This study conducts a gradient analysis comparing robust and non-robust classifiers, as well as multiple gradient stabilization techniques. Experimental results demonstrate that these techniques significantly improve the quality of class-conditional samples for non-robust classifiers by providing more stable and informative classifier guidance gradients. The findings highlight the importance of gradient stability in enhancing the performance of classifier guidance, especially on non-robust classifiers.
☆ Double Momentum Method for Lower-Level Constrained Bilevel Optimization
Bilevel optimization (BO) has recently gained prominence in many machine learning applications due to its ability to capture the nested structure inherent in these problems. Recently, many hypergradient methods have been proposed as effective solutions for solving large-scale problems. However, current hypergradient methods for the lower-level constrained bilevel optimization (LCBO) problems need very restrictive assumptions, namely, where optimality conditions satisfy the differentiability and invertibility conditions and lack a solid analysis of the convergence rate. What's worse, existing methods require either double-loop updates, which are sometimes less efficient. To solve this problem, in this paper, we propose a new hypergradient of LCBO leveraging the theory of nonsmooth implicit function theorem instead of using the restrive assumptions. In addition, we propose a \textit{single-loop single-timescale} algorithm based on the double-momentum method and adaptive step size method and prove it can return a $(\delta, \epsilon)$-stationary point with $\tilde{\mathcal{O}}(d_2^2\epsilon^{-4})$ iterations. Experiments on two applications demonstrate the effectiveness of our proposed method.
comment: 27pages, 9 figures
☆ Forget but Recall: Incremental Latent Rectification in Continual Learning
Intrinsic capability to continuously learn a changing data stream is a desideratum of deep neural networks (DNNs). However, current DNNs suffer from catastrophic forgetting, which hinders remembering past knowledge. To mitigate this issue, existing Continual Learning (CL) approaches either retain exemplars for replay, regularize learning, or allocate dedicated capacity for new tasks. This paper investigates an unexplored CL direction for incremental learning called Incremental Latent Rectification or ILR. In a nutshell, ILR learns to propagate with correction (or rectify) the representation from the current trained DNN backward to the representation space of the old task, where performing predictive decisions is easier. This rectification process only employs a chain of small representation mapping networks, called rectifier units. Empirical experiments on several continual learning benchmarks, including CIFAR10, CIFAR100, and Tiny ImageNet, demonstrate the effectiveness and potential of this novel CL direction compared to existing representative CL methods.
☆ Generalizability of experimental studies
Experimental studies are a cornerstone of machine learning (ML) research. A common, but often implicit, assumption is that the results of a study will generalize beyond the study itself, e.g. to new data. That is, there is a high probability that repeating the study under different conditions will yield similar results. Despite the importance of the concept, the problem of measuring generalizability remains open. This is probably due to the lack of a mathematical formalization of experimental studies. In this paper, we propose such a formalization and develop a quantifiable notion of generalizability. This notion allows to explore the generalizability of existing studies and to estimate the number of experiments needed to achieve the generalizability of new studies. To demonstrate its usefulness, we apply it to two recently published benchmarks to discern generalizable and non-generalizable results. We also publish a Python module that allows our analysis to be repeated for other experimental studies.
comment: Under review
☆ Development of a digital tool for monitoring the behaviour of pre-weaned calves using accelerometer neck-collars
Automatic monitoring of calf behaviour is a promising way of assessing animal welfare from their first week on farms. This study aims to (i) develop machine learning models from accelerometer data to classify the main behaviours of pre-weaned calves and (ii) set up a digital tool for monitoring the behaviour of pre-weaned calves from the models' prediction. Thirty pre-weaned calves were equipped with a 3-D accelerometer attached to a neck-collar for two months and filmed simultaneously. The behaviours were annotated, resulting in 27.4 hours of observation aligned with the accelerometer data. The time-series were then split into 3 seconds windows. Two machine learning models were tuned using data from 80% of the calves: (i) a Random Forest model to classify between active and inactive behaviours using a set of 11 hand-craft features [model 1] and (ii) a RidgeClassifierCV model to classify between lying, running, drinking milk and other behaviours using ROCKET features [model 2]. The performance of the models was tested using data from the remaining 20% of the calves. Model 1 achieved a balanced accuracy of 0.92. Model 2 achieved a balanced accuracy of 0.84. Behavioural metrics such as daily activity ratio and episodes of running, lying, drinking milk, and other behaviours expressed over time were deduced from the predictions. All the development was finally embedded into a Python dashboard so that the individual calf metrics could be displayed directly from the raw accelerometer files.
☆ Stacked Confusion Reject Plots (SCORE)
Machine learning is more and more applied in critical application areas like health and driver assistance. To minimize the risk of wrong decisions, in such applications it is necessary to consider the certainty of a classification to reject uncertain samples. An established tool for this are reject curves that visualize the trade-off between the number of rejected samples and classification performance metrics. We argue that common reject curves are too abstract and hard to interpret by non-experts. We propose Stacked Confusion Reject Plots (SCORE) that offer a more intuitive understanding of the used data and the classifier's behavior. We present example plots on artificial Gaussian data to document the different options of SCORE and provide the code as a Python package.
comment: 6 pages, 2 figures
☆ Generative Modelling of Structurally Constrained Graphs
Graph diffusion models have emerged as state-of-the-art techniques in graph generation, yet integrating domain knowledge into these models remains challenging. Domain knowledge is particularly important in real-world scenarios, where invalid generated graphs hinder deployment in practical applications. Unconstrained and conditioned graph generative models fail to guarantee such domain-specific structural properties. We present ConStruct, a novel framework that allows for hard-constraining graph diffusion models to incorporate specific properties, such as planarity or acyclicity. Our approach ensures that the sampled graphs remain within the domain of graphs that verify the specified property throughout the entire trajectory in both the forward and reverse processes. This is achieved by introducing a specific edge-absorbing noise model and a new projector operator. ConStruct demonstrates versatility across several structural and edge-deletion invariant constraints and achieves state-of-the-art performance for both synthetic benchmarks and attributed real-world datasets. For example, by leveraging planarity in digital pathology graph datasets, the proposed method outperforms existing baselines and enhances generated data validity by up to 71.1 percentage points.
☆ Robustly Optimized Deep Feature Decoupling Network for Fatty Liver Diseases Detection MICCAI 2024
Current medical image classification efforts mainly aim for higher average performance, often neglecting the balance between different classes. This can lead to significant differences in recognition accuracy between classes and obvious recognition weaknesses. Without the support of massive data, deep learning faces challenges in fine-grained classification of fatty liver. In this paper, we propose an innovative deep learning framework that combines feature decoupling and adaptive adversarial training. Firstly, we employ two iteratively compressed decouplers to supervised decouple common features and specific features related to fatty liver in abdominal ultrasound images. Subsequently, the decoupled features are concatenated with the original image after transforming the color space and are fed into the classifier. During adversarial training, we adaptively adjust the perturbation and balance the adversarial strength by the accuracy of each class. The model will eliminate recognition weaknesses by correctly classifying adversarial samples, thus improving recognition robustness. Finally, the accuracy of our method improved by 4.16%, achieving 82.95%. As demonstrated by extensive experiments, our method is a generalized learning framework that can be directly used to eliminate the recognition weaknesses of any classifier while improving its average performance. Code is available at https://github.com/HP-ML/MICCAI2024.
comment: MICCAI 2024
☆ A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems
Since the creation of the Web, recommender systems (RSs) have been an indispensable mechanism in information filtering. State-of-the-art RSs primarily depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. To prevent over-parameterized embedding tables from harming scalability, both academia and industry have seen increasing efforts in compressing RS embeddings. However, despite the prosperity of lightweight embedding-based RSs (LERSs), a wide diversity is seen in evaluation protocols, resulting in obstacles when relating LERS performance to real-world usability. Moreover, despite the common goal of lightweight embeddings, LERSs are evaluated with a single choice between the two main recommendation tasks -- collaborative filtering and content-based recommendation. This lack of discussions on cross-task transferability hinders the development of unified, more scalable solutions. Motivated by these issues, this study investigates various LERSs' performance, efficiency, and cross-task transferability via a thorough benchmarking process. Additionally, we propose an efficient embedding compression method using magnitude pruning, which is an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of LERSs across the two tasks, shedding light on their effectiveness and generalizability. To support edge-based recommendations, we tested all LERSs on a Raspberry Pi 4, where the efficiency bottleneck is exposed. Finally, we conclude this paper with critical summaries of LERS performance, model selection suggestions, and underexplored challenges around LERSs for future research. To encourage future research, we publish source codes and artifacts at \href{this link}{https://github.com/chenxing1999/recsys-benchmark}.
☆ XAMI -- A Benchmark Dataset for Artefact Detection in XMM-Newton Optical Images SP
Reflected or scattered light produce artefacts in astronomical observations that can negatively impact the scientific study. Hence, automated detection of these artefacts is highly beneficial, especially with the increasing amounts of data gathered. Machine learning methods are well-suited to this problem, but currently there is a lack of annotated data to train such approaches to detect artefacts in astronomical observations. In this work, we present a dataset of images from the XMM-Newton space telescope Optical Monitoring camera showing different types of artefacts. We hand-annotated a sample of 1000 images with artefacts which we use to train automated ML methods. We further demonstrate techniques tailored for accurate detection and masking of artefacts using instance segmentation. We adopt a hybrid approach, combining knowledge from both convolutional neural networks (CNNs) and transformer-based models and use their advantages in segmentation. The presented method and dataset will advance artefact detection in astronomical observations by providing a reproducible baseline. All code and data are made available (https://github.com/ESA-Datalabs/XAMI-model and https://github.com/ESA-Datalabs/XAMI-dataset).
comment: submitted to SPAICE 2024
☆ ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data
In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms' efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings. We provide ALPBench here: https://github.com/ValentinMargraf/ActiveLearningPipelines.
☆ A review of unsupervised learning in astronomy
This review summarizes popular unsupervised learning methods, and gives an overview of their past, current, and future uses in astronomy. Unsupervised learning aims to organise the information content of a dataset, in such a way that knowledge can be extracted. Traditionally this has been achieved through dimensionality reduction techniques that aid the ranking of a dataset, for example through principal component analysis or by using auto-encoders, or simpler visualisation of a high dimensional space, for example through the use of a self organising map. Other desirable properties of unsupervised learning include the identification of clusters, i.e. groups of similar objects, which has traditionally been achieved by the k-means algorithm and more recently through density-based clustering such as HDBSCAN. More recently, complex frameworks have emerged, that chain together dimensionality reduction and clustering methods. However, no dataset is fully unknown. Thus, nowadays a lot of research has been directed towards self-supervised and semi-supervised methods that stand to gain from both supervised and unsupervised learning.
comment: 30 pages, 6 figures. Invited contribution to special issue in Astronomy & Computing
☆ Improving Realized LGD Approximation: A Novel Framework with XGBoost for Handling Missing Cash-Flow Data
The scope for the accurate calculation of the Loss Given Default (LGD) parameter is comprehensive in terms of financial data. In this research, we aim to explore methods for improving the approximation of realized LGD in conditions of limited access to the cash-flow data. We enhance the performance of the method which relies on the differences between exposure values (delta outstanding approach) by employing machine learning (ML) techniques. The research utilizes the data from the mortgage portfolio of one of the European countries and assumes a close resemblance to similar economic contexts. It incorporates non-financial variables and macroeconomic data related to the housing market, improving the accuracy of loss severity approximation. The proposed methodology attempts to mitigate the country-specific (related to the local legal) or portfolio-specific factors in aim to show the general advantage of applying ML techniques, rather than case-specific relation. We developed an XGBoost model that does not rely on cash-flow data yet enhances the accuracy of realized LGD estimation compared to results obtained with the delta outstanding approach. A novel aspect of our work is the detailed exploration of the delta outstanding approach and the methodology for addressing conditions of limited access to cash-flow data through machine learning models.
comment: 36 pages, 5 figures, 9 tables
☆ Towards Efficient and Scalable Training of Differentially Private Deep Learning ICML 2024
Differentially private stochastic gradient descent (DP-SGD) is the standard algorithm for training machine learning models under differential privacy (DP). The major drawback of DP-SGD is the drop in utility which prior work has comprehensively studied. However, in practice another major drawback that hinders the large-scale deployment is the significantly higher computational cost. We conduct a comprehensive empirical study to quantify the computational cost of training deep learning models under DP and benchmark methods that aim at reducing the cost. Among these are more efficient implementations of DP-SGD and training with lower precision. Finally, we study the scaling behaviour using up to 80 GPUs.
comment: 15 pages, 12 figures, Accepted to the Workshop on Advancing Neural Network Training at International Conference on Machine Learning (WANT@ICML 2024)
☆ BlockLLM: Memory-Efficient Adaptation of LLMs by Selecting and Optimizing the Right Coordinate Blocks
Training large language models (LLMs) for pretraining or adapting to new tasks and domains has become increasingly critical as their applications expand. However, as the model and the data sizes grow, the training process presents significant memory challenges, often requiring a prohibitive amount of GPU memory that may not be readily available. Existing methods such as low-rank adaptation (LoRA) add trainable low-rank matrix factorizations, altering the training dynamics and limiting the model's parameter search to a low-rank subspace. GaLore, a more recent method, employs Gradient Low-Rank Projection to reduce the memory footprint, in the full parameter training setting. However GaLore can only be applied to a subset of the LLM layers that satisfy the "reversibility" property, thus limiting their applicability. In response to these challenges, we introduce BlockLLM, an approach inspired by block coordinate descent. Our method carefully selects and updates a very small subset of the trainable parameters without altering any part of its architecture and training procedure. BlockLLM achieves state-of-the-art performance in both finetuning and pretraining tasks, while reducing the memory footprint of the underlying optimization process. Our experiments demonstrate that fine-tuning with only less than 5% of the parameters, BlockLLM achieves state-of-the-art perplexity scores on the GLUE benchmarks. On Llama model pretrained on C4 dataset, BlockLLM is able to train with significantly less memory than the state-of-the-art, while still maintaining competitive performance.
comment: 16 pages, 7 figures
☆ MatText: Do Language Models Need More than Text & Scale for Materials Modeling?
Effectively representing materials as text has the potential to leverage the vast advancements of large language models (LLMs) for discovering new materials. While LLMs have shown remarkable success in various domains, their application to materials science remains underexplored. A fundamental challenge is the lack of understanding of how to best utilize text-based representations for materials modeling. This challenge is further compounded by the absence of a comprehensive benchmark to rigorously evaluate the capabilities and limitations of these text representations in capturing the complexity of material systems. To address this gap, we propose MatText, a suite of benchmarking tools and datasets designed to systematically evaluate the performance of language models in modeling materials. MatText encompasses nine distinct text-based representations for material systems, including several novel representations. Each representation incorporates unique inductive biases that capture relevant information and integrate prior physical knowledge about materials. Additionally, MatText provides essential tools for training and benchmarking the performance of language models in the context of materials science. These tools include standardized dataset splits for each representation, probes for evaluating sensitivity to geometric factors, and tools for seamlessly converting crystal structures into text. Using MatText, we conduct an extensive analysis of the capabilities of language models in modeling materials. Our findings reveal that current language models consistently struggle to capture the geometric information crucial for materials modeling across all representations. Instead, these models tend to leverage local information, which is emphasized in some of our novel representations. Our analysis underscores MatText's ability to reveal shortcomings of text-based methods for materials design.
☆ EON-1: A Brain-Inspired Processor for Near-Sensor Extreme Edge Online Feature Extraction
For Edge AI applications, deploying online learning and adaptation on resource-constrained embedded devices can deal with fast sensor-generated streams of data in changing environments. However, since maintaining low-latency and power-efficient inference is paramount at the Edge, online learning and adaptation on the device should impose minimal additional overhead for inference. With this goal in mind, we explore energy-efficient learning and adaptation on-device for streaming-data Edge AI applications using Spiking Neural Networks (SNNs), which follow the principles of brain-inspired computing, such as high-parallelism, neuron co-located memory and compute, and event-driven processing. We propose EON-1, a brain-inspired processor for near-sensor extreme edge online feature extraction, that integrates a fast online learning and adaptation algorithm. We report results of only 1% energy overhead for learning, by far the lowest overhead when compared to other SoTA solutions, while attaining comparable inference accuracy. Furthermore, we demonstrate that EON-1 is up for the challenge of low-latency processing of HD and UHD streaming video in real-time, with learning enabled.
☆ Distance Recomputator and Topology Reconstructor for Graph Neural Networks
This paper introduces novel methodologies, the Distance Recomputator and Topology Reconstructor, aimed at enhancing Graph Neural Networks (GNNs). The Distance Recomputator dynamically recalibrates node distances within k-hop neighborhoods using a dynamic encoding scheme, thereby improving the accuracy and adaptability of node representations. Concurrently, the Topology Reconstructor adjusts local graph structures based on computed "similarity distances," optimizing network configurations for improved learning outcomes. These methods address the limitations of static node representations and fixed aggregation schemes in traditional GNNs, offering a more nuanced approach to modeling complex and dynamic graph topologies. Furthermore, our experimental evaluations demonstrate significant performance advantages over existing methods across various benchmark datasets. The proposed Distance Recomputator and Topology Reconstructor not only enhance node relationship modeling accuracy but also optimize information aggregation efficiency through an asynchronous aggregation mechanism. This approach proves particularly effective in scenarios involving dynamic or large-scale graphs, showcasing the methods' robustness and applicability in real-world graph learning tasks.
☆ Can We Trust the Performance Evaluation of Uncertainty Estimation Methods in Text Summarization?
Text summarization, a key natural language generation (NLG) task, is vital in various domains. However, the high cost of inaccurate summaries in risk-critical applications, particularly those involving human-in-the-loop decision-making, raises concerns about the reliability of uncertainty estimation on text summarization (UE-TS) evaluation methods. This concern stems from the dependency of uncertainty model metrics on diverse and potentially conflicting NLG metrics. To address this issue, we introduce a comprehensive UE-TS benchmark incorporating 31 NLG metrics across four dimensions. The benchmark evaluates the uncertainty estimation capabilities of two large language models and one pre-trained language model on three datasets, with human-annotation analysis incorporated where applicable. We also assess the performance of 14 common uncertainty estimation methods within this benchmark. Our findings emphasize the importance of considering multiple uncorrelated NLG metrics and diverse uncertainty estimation methods to ensure reliable and efficient evaluation of UE-TS techniques.
comment: 63 pages, 41 figures, 11 tables
☆ A Comprehensive Solution to Connect Speech Encoder and Large Language Model for ASR
Recent works have shown promising results in connecting speech encoders to large language models (LLMs) for speech recognition. However, several limitations persist, including limited fine-tuning options, a lack of mechanisms to enforce speech-text alignment, and high insertion errors especially in domain mismatch conditions. This paper presents a comprehensive solution to address these issues. We begin by investigating more thoughtful fine-tuning schemes. Next, we propose a matching loss to enhance alignment between modalities. Finally, we explore training and inference methods to mitigate high insertion errors. Experimental results on the Librispeech corpus demonstrate that partially fine-tuning the encoder and LLM using parameter-efficient methods, such as LoRA, is the most cost-effective approach. Additionally, the matching loss improves modality alignment, enhancing performance. The proposed training and inference methods significantly reduce insertion errors.
☆ AG-LSEC: Audio Grounded Lexical Speaker Error Correction INTERSPEECH 2024
Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
comment: Accepted at INTERSPEECH 2024
☆ Efficient, Multimodal, and Derivative-Free Bayesian Inference With Fisher-Rao Gradient Flows
In this paper, we study efficient approximate sampling for probability distributions known up to normalization constants. We specifically focus on a problem class arising in Bayesian inference for large-scale inverse problems in science and engineering applications. The computational challenges we address with the proposed methodology are: (i) the need for repeated evaluations of expensive forward models; (ii) the potential existence of multiple modes; and (iii) the fact that gradient of, or adjoint solver for, the forward model might not be feasible. While existing Bayesian inference methods meet some of these challenges individually, we propose a framework that tackles all three systematically. Our approach builds upon the Fisher-Rao gradient flow in probability space, yielding a dynamical system for probability densities that converges towards the target distribution at a uniform exponential rate. This rapid convergence is advantageous for the computational burden outlined in (i). We apply Gaussian mixture approximations with operator splitting techniques to simulate the flow numerically; the resulting approximation can capture multiple modes thus addressing (ii). Furthermore, we employ the Kalman methodology to facilitate a derivative-free update of these Gaussian components and their respective weights, addressing the issue in (iii). The proposed methodology results in an efficient derivative-free sampler flexible enough to handle multi-modal distributions: Gaussian Mixture Kalman Inversion (GMKI). The effectiveness of GMKI is demonstrated both theoretically and numerically in several experiments with multimodal target distributions, including proof-of-concept and two-dimensional examples, as well as a large-scale application: recovering the Navier-Stokes initial condition from solution data at positive times.
comment: 42 pages, 9 figures
☆ TopoGCL: Topological Graph Contrastive Learning
Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higher-order graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 11 out of 12 considered datasets and also exhibits robustness under noisy scenarios.
☆ Unlocking Continual Learning Abilities in Language Models
Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task information are often unavailable or costly to collect, hindering the availability of current CL approaches for LMs. To address this limitation, we introduce $\textbf{MIGU}$ ($\textbf{M}$agn$\textbf{I}$tude-based $\textbf{G}$radient $\textbf{U}$pdating for continual learning), a rehearsal-free and task-label-free method that only updates the model parameters with large magnitudes of output in LMs' linear layers. MIGU is based on our observation that the L1-normalized magnitude distribution of the output in LMs' linear layers is different when the LM models deal with different task data. By imposing this simple constraint on the gradient update process, we can leverage the inherent behaviors of LMs, thereby unlocking their innate CL abilities. Our experiments demonstrate that MIGU is universally applicable to all three LM architectures (T5, RoBERTa, and Llama2), delivering state-of-the-art or on-par performance across continual finetuning and continual pre-training settings on four CL benchmarks. For example, MIGU brings a 15.2% average accuracy improvement over conventional parameter-efficient finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly integrate with all three existing CL types to further enhance performance. Code is available at \href{https://github.com/wenyudu/MIGU}{this https URL}.
comment: preprint, 19 pages
☆ Expansive Synthesis: Generating Large-Scale Datasets from Minimal Samples
The challenge of limited availability of data for training in machine learning arises in many applications and the impact on performance and generalization is serious. Traditional data augmentation methods aim to enhance training with a moderately sufficient data set. Generative models like Generative Adversarial Networks (GANs) often face problematic convergence when generating significant and diverse data samples. Diffusion models, though effective, still struggle with high computational cost and long training times. This paper introduces an innovative Expansive Synthesis model that generates large-scale, high-fidelity datasets from minimal samples. The proposed approach exploits expander graph mappings and feature interpolation to synthesize expanded datasets while preserving the intrinsic data distribution and feature structural relationships. The rationale of the model is rooted in the non-linear property of neural networks' latent space and in its capture by a Koopman operator to yield a linear space of features to facilitate the construction of larger and enriched consistent datasets starting with a much smaller dataset. This process is optimized by an autoencoder architecture enhanced with self-attention layers and further refined for distributional consistency by optimal transport. We validate our Expansive Synthesis by training classifiers on the generated datasets and comparing their performance to classifiers trained on larger, original datasets. Experimental results demonstrate that classifiers trained on synthesized data achieve performance metrics on par with those trained on full-scale datasets, showcasing the model's potential to effectively augment training data. This work represents a significant advancement in data generation, offering a robust solution to data scarcity and paving the way for enhanced data availability in machine learning applications.
comment: 14 pages. arXiv admin note: text overlap with arXiv:2405.13866
☆ Inherent Challenges of Post-Hoc Membership Inference for Large Language Models
Large Language Models (LLMs) are often trained on vast amounts of undisclosed data, motivating the development of post-hoc Membership Inference Attacks (MIAs) to gain insight into their training data composition. However, in this paper, we identify inherent challenges in post-hoc MIA evaluation due to potential distribution shifts between collected member and non-member datasets. Using a simple bag-of-words classifier, we demonstrate that datasets used in recent post-hoc MIAs suffer from significant distribution shifts, in some cases achieving near-perfect distinction between members and non-members. This implies that previously reported high MIA performance may be largely attributable to these shifts rather than model memorization. We confirm that randomized, controlled setups eliminate such shifts and thus enable the development and fair evaluation of new MIAs. However, we note that such randomized setups are rarely available for the latest LLMs, making post-hoc data collection still required to infer membership for real-world LLMs. As a potential solution, we propose a Regression Discontinuity Design (RDD) approach for post-hoc data collection, which substantially mitigates distribution shifts. Evaluating various MIA methods on this RDD setup yields performance barely above random guessing, in stark contrast to previously reported results. Overall, our findings highlight the challenges in accurately measuring LLM memorization and the need for careful experimental design in (post-hoc) membership inference tasks.
☆ LABOR-LLM: Language-Based Occupational Representations with Large Language Models
Many empirical studies of labor market questions rely on estimating relatively simple predictive models using small, carefully constructed longitudinal survey datasets based on hand-engineered features. Large Language Models (LLMs), trained on massive datasets, encode vast quantities of world knowledge and can be used for the next job prediction problem. However, while an off-the-shelf LLM produces plausible career trajectories when prompted, the probability with which an LLM predicts a particular job transition conditional on career history will not, in general, align with the true conditional probability in a given population. Recently, Vafa et al. (2024) introduced a transformer-based "foundation model", CAREER, trained using a large, unrepresentative resume dataset, that predicts transitions between jobs; it further demonstrated how transfer learning techniques can be used to leverage the foundation model to build better predictive models of both transitions and wages that reflect conditional transition probabilities found in nationally representative survey datasets. This paper considers an alternative where the fine-tuning of the CAREER foundation model is replaced by fine-tuning LLMs. For the task of next job prediction, we demonstrate that models trained with our approach outperform several alternatives in terms of predictive performance on the survey data, including traditional econometric models, CAREER, and LLMs with in-context learning, even though the LLM can in principle predict job titles that are not allowed in the survey data. Further, we show that our fine-tuned LLM-based models' predictions are more representative of the career trajectories of various workforce subpopulations than off-the-shelf LLM models and CAREER. We conduct experiments and analyses that highlight the sources of the gains in the performance of our models for representative predictions.
☆ Efficient Document Ranking with Learnable Late Interactions
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.
☆ Empowering Interdisciplinary Insights with Dynamic Graph Embedding Trajectories
We developed DyGETViz, a novel framework for effectively visualizing dynamic graphs (DGs) that are ubiquitous across diverse real-world systems. This framework leverages recent advancements in discrete-time dynamic graph (DTDG) models to adeptly handle the temporal dynamics inherent in dynamic graphs. DyGETViz effectively captures both micro- and macro-level structural shifts within these graphs, offering a robust method for representing complex and massive dynamic graphs. The application of DyGETViz extends to a diverse array of domains, including ethology, epidemiology, finance, genetics, linguistics, communication studies, social studies, and international relations. Through its implementation, DyGETViz has revealed or confirmed various critical insights. These include the diversity of content sharing patterns and the degree of specialization within online communities, the chronological evolution of lexicons across decades, and the distinct trajectories exhibited by aging-related and non-related genes. Importantly, DyGETViz enhances the accessibility of scientific findings to non-domain experts by simplifying the complexities of dynamic graphs. Our framework is released as an open-source Python package for use across diverse disciplines. Our work not only addresses the ongoing challenges in visualizing and analyzing DTDG models but also establishes a foundational framework for future investigations into dynamic graph representation and analysis across various disciplines.
comment: 25 pages, 11 figures
♻ ☆ Towards Diverse Evaluation of Class Incremental Learning: A Representation Learning Perspective
Class incremental learning (CIL) algorithms aim to continually learn new object classes from incrementally arriving data while not forgetting past learned classes. The common evaluation protocol for CIL algorithms is to measure the average test accuracy across all classes learned so far -- however, we argue that solely focusing on maximizing the test accuracy may not necessarily lead to developing a CIL algorithm that also continually learns and updates the representations, which may be transferred to the downstream tasks. To that end, we experimentally analyze neural network models trained by CIL algorithms using various evaluation protocols in representation learning and propose new analysis methods. Our experiments show that most state-of-the-art algorithms prioritize high stability and do not significantly change the learned representation, and sometimes even learn a representation of lower quality than a naive baseline. However, we observe that these algorithms can still achieve high test accuracy because they enable a model to learn a classifier that closely resembles an estimated linear classifier trained for linear probing. Furthermore, the base model learned in the first task, which involves single-task learning, exhibits varying levels of representation quality across different algorithms, and this variance impacts the final performance of CIL algorithms. Therefore, we suggest that the representation-level evaluation should be considered as an additional recipe for more diverse evaluation for CIL algorithms.
comment: CoLLAs 2024 camera-ready version
♻ ☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using $1/\sqrt{v}$, Adam-mini uses the average of $v$ within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down $\geq$ 90% $v$ in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
♻ ☆ Discrete Multimodal Transformers with a Pretrained Large Language Model for Mixed-Supervision Speech Processing
Recent work on discrete speech tokenization has paved the way for models that can seamlessly perform multiple tasks across modalities, e.g., speech recognition, text to speech, speech to speech translation. Moreover, large language models (LLMs) pretrained from vast text corpora contain rich linguistic information that can improve accuracy in a variety of tasks. In this paper, we present a decoder-only Discrete Multimodal Language Model (DMLM), which can be flexibly applied to multiple tasks (ASR, T2S, S2TT, etc.) and modalities (text, speech, vision). We explore several critical aspects of discrete multi-modal models, including the loss function, weight initialization, mixed training supervision, and codebook. Our results show that DMLM benefits significantly, across multiple tasks and datasets, from a combination of supervised and unsupervised training. Moreover, for ASR, it benefits from initializing DMLM from a pretrained LLM, and from a codebook derived from Whisper activations.
♻ ☆ Regularization and Optimal Multiclass Learning COLT 2024
The quintessential learning algorithm of empirical risk minimization (ERM) is known to fail in various settings for which uniform convergence does not characterize learning. It is therefore unsurprising that the practice of machine learning is rife with considerably richer algorithmic techniques for successfully controlling model capacity. Nevertheless, no such technique or principle has broken away from the pack to characterize optimal learning in these more general settings. The purpose of this work is to characterize the role of regularization in perhaps the simplest setting for which ERM fails: multiclass learning with arbitrary label sets. Using one-inclusion graphs (OIGs), we exhibit optimal learning algorithms that dovetail with tried-and-true algorithmic principles: Occam's Razor as embodied by structural risk minimization (SRM), the principle of maximum entropy, and Bayesian reasoning. Most notably, we introduce an optimal learner which relaxes structural risk minimization on two dimensions: it allows the regularization function to be "local" to datapoints, and uses an unsupervised learning stage to learn this regularizer at the outset. We justify these relaxations by showing that they are necessary: removing either dimension fails to yield a near-optimal learner. We also extract from OIGs a combinatorial sequence we term the Hall complexity, which is the first to characterize a problem's transductive error rate exactly. Lastly, we introduce a generalization of OIGs and the transductive learning setting to the agnostic case, where we show that optimal orientations of Hamming graphs -- judged using nodes' outdegrees minus a system of node-dependent credits -- characterize optimal learners exactly. We demonstrate that an agnostic version of the Hall complexity again characterizes error rates exactly, and exhibit an optimal learner using maximum entropy programs.
comment: COLT 2024; 43 pages
♻ ☆ Enhancing Active Learning for Sentinel 2 Imagery through Contrastive Learning and Uncertainty Estimation
In this paper, we introduce a novel method designed to enhance label efficiency in satellite imagery analysis by integrating semi-supervised learning (SSL) with active learning strategies. Our approach utilizes contrastive learning together with uncertainty estimations via Monte Carlo Dropout (MC Dropout), with a particular focus on Sentinel-2 imagery analyzed using the Eurosat dataset. We explore the effectiveness of our method in scenarios featuring both balanced and unbalanced class distributions. Our results show that the proposed method performs better than several other popular methods in this field, enabling significant savings in labeling effort while maintaining high classification accuracy. These findings highlight the potential of our approach to facilitate scalable and cost-effective satellite image analysis, particularly advantageous for extensive environmental monitoring and land use classification tasks.
♻ ☆ Diverse Part Synthesis for 3D Shape Creation
Methods that use neural networks for synthesizing 3D shapes in the form of a part-based representation have been introduced over the last few years. These methods represent shapes as a graph or hierarchy of parts and enable a variety of applications such as shape sampling and reconstruction. However, current methods do not allow easily regenerating individual shape parts according to user preferences. In this paper, we investigate techniques that allow the user to generate multiple, diverse suggestions for individual parts. Specifically, we experiment with multimodal deep generative models that allow sampling diverse suggestions for shape parts and focus on models which have not been considered in previous work on shape synthesis. To provide a comparative study of these techniques, we introduce a method for synthesizing 3D shapes in a part-based representation and evaluate all the part suggestion techniques within this synthesis method. In our method, which is inspired by previous work, shapes are represented as a set of parts in the form of implicit functions which are then positioned in space to form the final shape. Synthesis in this representation is enabled by a neural network architecture based on an implicit decoder and a spatial transformer. We compare the various multimodal generative models by evaluating their performance in generating part suggestions. Our contribution is to show with qualitative and quantitative evaluations which of the new techniques for multimodal part generation perform the best and that a synthesis method based on the top-performing techniques allows the user to more finely control the parts that are generated in the 3D shapes while maintaining high shape fidelity when reconstructing shapes.
♻ ☆ The Best Arm Evades: Near-optimal Multi-pass Streaming Lower Bounds for Pure Exploration in Multi-armed Bandits COLT 2024
We give a near-optimal sample-pass trade-off for pure exploration in multi-armed bandits (MABs) via multi-pass streaming algorithms: any streaming algorithm with sublinear memory that uses the optimal sample complexity of $O(\frac{n}{\Delta^2})$ requires $\Omega(\frac{\log{(1/\Delta)}}{\log\log{(1/\Delta)}})$ passes. Here, $n$ is the number of arms and $\Delta$ is the reward gap between the best and the second-best arms. Our result matches the $O(\log(\frac{1}{\Delta}))$-pass algorithm of Jin et al. [ICML'21] (up to lower order terms) that only uses $O(1)$ memory and answers an open question posed by Assadi and Wang [STOC'20].
comment: COLT 2024
♻ ☆ Fast gradient-free activation maximization for neurons in spiking neural networks
Elements of neural networks, both biological and artificial, can be described by their selectivity for specific cognitive features. Understanding these features is important for understanding the inner workings of neural networks. For a living system, such as a neuron, whose response to a stimulus is unknown and not differentiable, the only way to reveal these features is through a feedback loop that exposes it to a large set of different stimuli. The properties of these stimuli should be varied iteratively in order to maximize the neuronal response. To utilize this feedback loop for a biological neural network, it is important to run it quickly and efficiently in order to reach the stimuli that maximizes certain neurons' activation with the least number of iterations possible. Here we present a framework with an efficient design for such a loop. We successfully tested it on an artificial spiking neural network (SNN), which is a model that simulates the asynchronous spiking activity of neurons in living brains. Our optimization method for activation maximization is based on the low-rank Tensor Train decomposition of the discrete activation function. The optimization space is the latent parameter space of images generated by SN-GAN or VQ-VAE generative models. To our knowledge, this is the first time that effective AM has been applied to SNNs. We track changes in the optimal stimuli for artificial neurons during training and show that highly selective neurons can form already in the early epochs of training and in the early layers of a convolutional spiking network. This formation of refined optimal stimuli is associated with an increase in classification accuracy. Some neurons, especially in the deeper layers, may gradually change the concepts they are selective for during learning, potentially explaining their importance for model performance.
♻ ☆ A Temporal Stochastic Bias Correction using a Machine Learning Attention model
Climate models are biased with respect to real-world observations. They usually need to be adjusted before being used in impact studies. The suite of statistical methods that enable such adjustments is called bias correction (BC). However, BC methods currently struggle to adjust temporal biases. Because they mostly disregard the dependence between consecutive time points. As a result, climate statistics with long-range temporal properties, such as heatwave duration and frequency, cannot be corrected accurately. This makes it more difficult to produce reliable impact studies on such climate statistics. This paper offers a novel BC methodology to correct temporal biases. This is made possible by rethinking the philosophy behind BC. We will introduce BC as a time-indexed regression task with stochastic outputs. Rethinking BC enables us to adapt state-of-the-art machine learning (ML) attention models and thereby learn different types of biases, including temporal asynchronicities. With a case study of heatwave duration statistics in Abuja, Nigeria, and Tokyo, Japan, we show more accurate results than current climate model outputs and alternative BC methods.
comment: 38 pages, 31 figures
♻ ☆ A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
♻ ☆ XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies CVPR 2024
We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.
comment: CVPR 2024 Highlight. Code: https://github.com/nv-tlabs/XCube/ Website: https://research.nvidia.com/labs/toronto-ai/xcube/
♻ ☆ Analysis of learning a flow-based generative model from limited sample complexity
We study the problem of training a flow-based generative model, parametrized by a two-layer autoencoder, to sample from a high-dimensional Gaussian mixture. We provide a sharp end-to-end analysis of the problem. First, we provide a tight closed-form characterization of the learnt velocity field, when parametrized by a shallow denoising auto-encoder trained on a finite number $n$ of samples from the target distribution. Building on this analysis, we provide a sharp description of the corresponding generative flow, which pushes the base Gaussian density forward to an approximation of the target density. In particular, we provide closed-form formulae for the distance between the mean of the generated mixture and the mean of the target mixture, which we show decays as $\Theta_n(\frac{1}{n})$. Finally, this rate is shown to be in fact Bayes-optimal.
♻ ☆ PiPar: Pipeline Parallelism for Collaborative Machine Learning
Collaborative machine learning (CML) techniques, such as federated learning, have been proposed to train deep learning models across multiple mobile devices and a server. CML techniques are privacy-preserving as a local model that is trained on each device instead of the raw data from the device is shared with the server. However, CML training is inefficient due to low resource utilization. We identify idling resources on the server and devices due to sequential computation and communication as the principal cause of low resource utilization. A novel framework PiPar that leverages pipeline parallelism for CML techniques is developed to substantially improve resource utilization. A new training pipeline is designed to parallelize the computations on different hardware resources and communication on different bandwidth resources, thereby accelerating the training process in CML. A low overhead automated parameter selection method is proposed to optimize the pipeline, maximizing the utilization of available resources. The experimental results confirm the validity of the underlying approach of PiPar and highlight that when compared to federated learning: (i) the idle time of the server can be reduced by up to 64.1x, and (ii) the overall training time can be accelerated by up to 34.6x under varying network conditions for a collection of six small and large popular deep neural networks and four datasets without sacrificing accuracy. It is also experimentally demonstrated that PiPar achieves performance benefits when incorporating differential privacy methods and operating in environments with heterogeneous devices and changing bandwidths.
♻ ☆ Feudal Graph Reinforcement Learning
Graph-based representations and message-passing modular policies constitute prominent approaches to tackling composable control problems in Reinforcement Learning (RL). However, as shown by recent graph deep learning literature, such local message-passing operators can create information bottlenecks and hinder global coordination. The issue becomes more serious in tasks requiring high-level planning. In this work, we propose a novel methodology, named Feudal Graph Reinforcement Learning (FGRL), that addresses such challenges by relying on hierarchical RL and a pyramidal message-passing architecture. In particular, FGRL defines a hierarchy of policies where high-level commands are propagated from the top of the hierarchy down through a layered graph structure. The bottom layers mimic the morphology of the physical system, while the upper layers correspond to higher-order sub-modules. The resulting agents are then characterized by a committee of policies where actions at a certain level set goals for the level below, thus implementing a hierarchical decision-making structure that can naturally implement task decomposition. We evaluate the proposed framework on a graph clustering problem and MuJoCo locomotion tasks; simulation results show that FGRL compares favorably against relevant baselines. Furthermore, an in-depth analysis of the command propagation mechanism provides evidence that the introduced message-passing scheme favors learning hierarchical decision-making policies.
♻ ☆ Multi-Modal Conformal Prediction Regions with Simple Structures by Optimizing Convex Shape Templates
Conformal prediction is a statistical tool for producing prediction regions for machine learning models that are valid with high probability. A key component of conformal prediction algorithms is a \emph{non-conformity score function} that quantifies how different a model's prediction is from the unknown ground truth value. Essentially, these functions determine the shape and the size of the conformal prediction regions. While prior work has gone into creating score functions that produce multi-model prediction regions, such regions are generally too complex for use in downstream planning and control problems. We propose a method that optimizes parameterized \emph{shape template functions} over calibration data, which results in non-conformity score functions that produce prediction regions with minimum volume. Our approach results in prediction regions that are \emph{multi-modal}, so they can properly capture residuals of distributions that have multiple modes, and \emph{practical}, so each region is convex and can be easily incorporated into downstream tasks, such as a motion planner using conformal prediction regions. Our method applies to general supervised learning tasks, while we illustrate its use in time-series prediction. We provide a toolbox and present illustrative case studies of F16 fighter jets and autonomous vehicles, showing an up to $68\%$ reduction in prediction region area compared to a circular baseline region.
comment: Accepted to L4DC 2024. 14 pages, 3 figures. The source code and toolbox are available at https://github.com/nandantumu/conformal_region_designer
♻ ☆ Locally Differentially Private Distributed Online Learning with Guaranteed Optimality
Distributed online learning is gaining increased traction due to its unique ability to process large-scale datasets and streaming data. To address the growing public awareness and concern on privacy protection, plenty of algorithms have been proposed to enable differential privacy in distributed online optimization and learning. However, these algorithms often face the dilemma of trading learning accuracy for privacy. By exploiting the unique characteristics of online learning, this paper proposes an approach that tackles the dilemma and ensures both differential privacy and learning accuracy in distributed online learning. More specifically, while ensuring a diminishing expected instantaneous regret, the approach can simultaneously ensure a finite cumulative privacy budget, even in the infinite time horizon. To cater for the fully distributed setting, we adopt the local differential-privacy framework, which avoids the reliance on a trusted data curator, and, hence, provides stronger protection than the classic "centralized" (global) differential privacy. To the best of our knowledge, this is the first algorithm that successfully ensures both rigorous local differential privacy and learning accuracy. The effectiveness of the proposed algorithm is evaluated using machine learning tasks, including logistic regression on the the "mushrooms" datasets and CNN-based image classification on the "MNIST" and "CIFAR-10" datasets.
comment: 23 pages, 9 figures
♻ ☆ GLAD: Improving Latent Graph Generative Modeling with Simple Quantization ICML 2024
Exploring the graph latent structures has not garnered much attention in the graph generative research field. Yet, exploiting the latent space is as crucial as working on the data space for discrete data such as graphs. However, previous methods either failed to preserve the permutation symmetry of graphs or lacked an effective approaches to model appropriately within the latent space. To mitigate those issues, we propose a simple, yet effective discrete latent graph diffusion generative model. Our model, namely GLAD, not only overcomes the drawbacks of existing latent approaches, but also alleviates inherent issues present in diffusion methods applied on the graph space. We validate our generative model on the molecular benchmark datasets, on which it demonstrates competitive performance compared with the state-of-the-art baselines.
comment: Accepted in the 2nd Structured Probabilistic Inference & Generative Modeling workshop of ICML 2024
♻ ☆ Controlling Moments with Kernel Stein Discrepancies
Kernel Stein discrepancies (KSDs) measure the quality of a distributional approximation and can be computed even when the target density has an intractable normalizing constant. Notable applications include the diagnosis of approximate MCMC samplers and goodness-of-fit tests for unnormalized statistical models. The present work analyzes the convergence control properties of KSDs. We first show that standard KSDs used for weak convergence control fail to control moment convergence. To address this limitation, we next provide sufficient conditions under which alternative diffusion KSDs control both moment and weak convergence. As an immediate consequence we develop, for each $q > 0$, the first KSDs known to exactly characterize $q$-Wasserstein convergence.
comment: 103 pages, 10 figures
♻ ☆ Insights into the Lottery Ticket Hypothesis and Iterative Magnitude Pruning
Lottery ticket hypothesis for deep neural networks emphasizes the importance of initialization used to re-train the sparser networks obtained using the iterative magnitude pruning process. An explanation for why the specific initialization proposed by the lottery ticket hypothesis tends to work better in terms of generalization (and training) performance has been lacking. Moreover, the underlying principles in iterative magnitude pruning, like the pruning of smaller magnitude weights and the role of the iterative process, lack full understanding and explanation. In this work, we attempt to provide insights into these phenomena by empirically studying the volume/geometry and loss landscape characteristics of the solutions obtained at various stages of the iterative magnitude pruning process.
♻ ☆ Fundamental Bounds on Online Strategic Classification
We study the problem of online binary classification where strategic agents can manipulate their observable features in predefined ways, modeled by a manipulation graph, in order to receive a positive classification. We show this setting differs in fundamental ways from non-strategic online classification. For instance, whereas in the non-strategic case, a mistake bound of $\ln|H|$ is achievable via the halving algorithm when the target function belongs to a known class $H$, we show that no deterministic algorithm can achieve a mistake bound $o(\Delta)$ in the strategic setting, where $\Delta$ is the maximum degree of the manipulation graph (even when $|H|=O(\Delta)$). We obtain an algorithm achieving mistake bound $O(\Delta\ln|H|)$. We also extend this to the agnostic setting and obtain an algorithm with a $\Delta$ multiplicative regret, and we show no deterministic algorithm can achieve $o(\Delta)$ multiplicative regret. Next, we study two randomized models based on whether the random choices are made before or after agents respond, and show they exhibit fundamental differences. In the first model, at each round the learner deterministically chooses a probability distribution over classifiers inducing expected values on each vertex (probabilities of being classified as positive), which the strategic agents respond to. We show that any learner in this model has to suffer linear regret. On the other hand, in the second model, while the adversary who selects the next agent must respond to the learner's probability distribution over classifiers, the agent then responds to the actual hypothesis classifier drawn from this distribution. Surprisingly, we show this model is more advantageous to the learner, and we design randomized algorithms that achieve sublinear regret bounds against both oblivious and adaptive adversaries.
♻ ☆ MgNO: Efficient Parameterization of Linear Operators via Multigrid
In this work, we propose a concise neural operator architecture for operator learning. Drawing an analogy with a conventional fully connected neural network, we define the neural operator as follows: the output of the $i$-th neuron in a nonlinear operator layer is defined by $\mathcal O_i(u) = \sigma\left( \sum_j \mathcal W_{ij} u + \mathcal B_{ij}\right)$. Here, $\mathcal W_{ij}$ denotes the bounded linear operator connecting $j$-th input neuron to $i$-th output neuron, and the bias $\mathcal B_{ij}$ takes the form of a function rather than a scalar. Given its new universal approximation property, the efficient parameterization of the bounded linear operators between two neurons (Banach spaces) plays a critical role. As a result, we introduce MgNO, utilizing multigrid structures to parameterize these linear operators between neurons. This approach offers both mathematical rigor and practical expressivity. Additionally, MgNO obviates the need for conventional lifting and projecting operators typically required in previous neural operators. Moreover, it seamlessly accommodates diverse boundary conditions. Our empirical observations reveal that MgNO exhibits superior ease of training compared to other CNN-based models, while also displaying a reduced susceptibility to overfitting when contrasted with spectral-type neural operators. We demonstrate the efficiency and accuracy of our method with consistently state-of-the-art performance on different types of partial differential equations (PDEs).
♻ ☆ Learning with Noisy Labels through Learnable Weighting and Centroid Similarity
We introduce a novel method for training machine learning models in the presence of noisy labels, which are prevalent in domains such as medical diagnosis and autonomous driving and have the potential to degrade a model's generalization performance. Inspired by established literature that highlights how deep learning models are prone to overfitting to noisy samples in the later epochs of training, we propose a strategic approach. This strategy leverages the distance to class centroids in the latent space and incorporates a discounting mechanism, aiming to diminish the influence of samples that lie distant from all class centroids. By doing so, we effectively counteract the adverse effects of noisy labels. The foundational premise of our approach is the assumption that samples situated further from their respective class centroid in the initial stages of training are more likely to be associated with noise. Our methodology is grounded in robust theoretical principles and has been validated empirically through extensive experiments on several benchmark datasets. Our results show that our method consistently outperforms the existing state-of-the-art techniques, achieving significant improvements in classification accuracy in the presence of noisy labels. The code for our proposed loss function and supplementary materials is available at https://github.com/wanifarooq/NCOD
♻ ☆ Generalized Graph Prompt: Toward a Unification of Pre-Training and Downstream Tasks on Graphs
Graph neural networks have emerged as a powerful tool for graph representation learning, but their performance heavily relies on abundant task-specific supervision. To reduce labeling requirement, the "pre-train, prompt" paradigms have become increasingly common. However, existing study of prompting on graphs is limited, lacking a universal treatment to appeal to different downstream tasks. In this paper, we propose GraphPrompt, a novel pre-training and prompting framework on graphs. GraphPrompt not only unifies pre-training and downstream tasks into a common task template but also employs a learnable prompt to assist a downstream task in locating the most relevant knowledge from the pre-trained model in a task-specific manner. To further enhance GraphPrompt in these two stages, we extend it into GraphPrompt+ with two major enhancements. First, we generalize several popular graph pre-training tasks beyond simple link prediction to broaden the compatibility with our task template. Second, we propose a more generalized prompt design that incorporates a series of prompt vectors within every layer of the pre-trained graph encoder, in order to capitalize on the hierarchical information across different layers beyond just the readout layer. Finally, we conduct extensive experiments on five public datasets to evaluate and analyze GraphPrompt and GraphPrompt+.
comment: Extension of "GraphPrompt: Unifying Pre-Training and Downstream Tasks for Graph Neural Networks". arXiv admin note: substantial text overlap with arXiv:2302.08043
♻ ☆ A Cost-Efficient FPGA Implementation of Tiny Transformer Model using Neural ODE
Transformer has been adopted to a wide range of tasks and shown to outperform CNNs and RNNs while it suffers from high training cost and computational complexity. To address these issues, a hybrid approach has become a recent research trend, which replaces a part of ResNet with an MHSA (Multi-Head Self-Attention). In this paper, we propose a lightweight hybrid model which uses Neural ODE (Ordinary Differential Equation) as a backbone instead of ResNet for 12.1$\times$ parameter reduction. For the STL10 dataset, the proposed model achieves 80.15% top-1 accuracy which is comparable to ResNet50. Then, the proposed model is deployed on a modest-sized FPGA device for edge computing. To further reduce FPGA resource utilization, the model is quantized following QAT (Quantization Aware Training) scheme instead of PTQ (Post Training Quantization) to suppress the accuracy loss. As a result, an extremely lightweight Transformer-based model can be implemented on resource-limited FPGAs. The weights of the feature extraction network are stored on-chip to minimize the memory transfer overhead, allowing faster inference. By eliminating the overhead of memory transfers, inference can be executed seamlessly, leading to accelerated inference. The proposed FPGA implementation achieves a 34.01$\times$ speedup for the backbone and MHSA parts, and it achieves an overall 9.85$\times$ speedup when taking into account software pre- and post-processing. It also achieves an overall 7.10$\times$ higher energy efficiency compared to the ARM Cortex-A53 CPU.
♻ ☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
♻ ☆ Instance-level quantitative saliency in multiple sclerosis lesion segmentation
In recent years, explainable methods for artificial intelligence (XAI) have tried to reveal and describe models' decision mechanisms in the case of classification tasks. However, XAI for semantic segmentation and in particular for single instances has been little studied to date. Understanding the process underlying automatic segmentation of single instances is crucial to reveal what information was used to detect and segment a given object of interest. In this study, we proposed two instance-level explanation maps for semantic segmentation based on SmoothGrad and Grad-CAM++ methods. Then, we investigated their relevance for the detection and segmentation of white matter lesions (WML), a magnetic resonance imaging (MRI) biomarker in multiple sclerosis (MS). 687 patients diagnosed with MS for a total of 4043 FLAIR and MPRAGE MRI scans were collected at the University Hospital of Basel, Switzerland. Data were randomly split into training, validation and test sets to train a 3D U-Net for MS lesion segmentation. We observed 3050 true positive (TP), 1818 false positive (FP), and 789 false negative (FN) cases. We generated instance-level explanation maps for semantic segmentation, by developing two XAI methods based on SmoothGrad and Grad-CAM++. We investigated: 1) the distribution of gradients in saliency maps with respect to both input MRI sequences; 2) the model's response in the case of synthetic lesions; 3) the amount of perilesional tissue needed by the model to segment a lesion. Saliency maps (based on SmoothGrad) in FLAIR showed positive values inside a lesion and negative in its neighborhood. Peak values of saliency maps generated for these four groups of volumes presented distributions that differ significantly from one another, suggesting a quantitative nature of the proposed saliency. Contextual information of 7mm around the lesion border was required for their segmentation.
♻ ☆ Overcoming the Paradox of Certified Training with Gaussian Smoothing
Training neural networks with high certified accuracy against adversarial examples remains an open problem despite significant efforts. While certification methods can effectively leverage tight convex relaxations for bound computation, in training, these methods perform worse than looser relaxations. Prior work hypothesized that this is caused by the discontinuity and perturbation sensitivity of the loss surface induced by these tighter relaxations. In this work, we show theoretically that Gaussian Loss Smoothing can alleviate both issues. We confirm this empirically by proposing a certified training method combining PGPE, an algorithm computing gradients of a smoothed loss, with different convex relaxations. When using this training method, we observe that tighter bounds indeed lead to strictly better networks. While scaling PGPE training remains challenging due to high computational cost, we show that by using a not theoretically sound, yet much cheaper smoothing approximation, we obtain better certified accuracies than state-of-the-art methods when training on the same network architecture. Our results clearly demonstrate the promise of Gaussian Loss Smoothing for training certifiably robust neural networks.
♻ ☆ Aligning Large Language Models by On-Policy Self-Judgment ACL 2024
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.
comment: Published as a main conference paper at ACL 2024
♻ ☆ DEM: A Method for Certifying Deep Neural Network Classifier Outputs in Aerospace SC 2024
Software development in the aerospace domain requires adhering to strict, high-quality standards. While there exist regulatory guidelines for commercial software in this domain (e.g., ARP-4754 and DO-178), these do not apply to software with deep neural network (DNN) components. Consequently, it is unclear how to allow aerospace systems to benefit from the deep learning revolution. Our work here seeks to address this challenge with a novel, output-centric approach for DNN certification. Our method employs statistical verification techniques, and has the key advantage of being able to flag specific inputs for which the DNN's output may be unreliable - so that they may be later inspected by a human expert. To achieve this, our method conducts a statistical analysis of the DNN's predictions for other, nearby inputs, in order to detect inconsistencies. This is in contrast to existing techniques, which typically attempt to certify the entire DNN, as opposed to individual outputs. Our method uses the DNN as a black-box, and makes no assumptions about its topology. We hope that this work constitutes another step towards integrating DNNs in safety-critical applications - especially in the aerospace domain, where high standards of quality and reliability are crucial.
comment: This is a preprint version of a paper that will appear at 43rd Digital Avionics Systems Conference (DASC 2024)
♻ ☆ Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores ICML 2024
In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.
comment: Accepted at the 1st Machine Learning for Life and Material Sciences Workshop at ICML 2024
♻ ☆ Accelerating Look-ahead in Bayesian Optimization: Multilevel Monte Carlo is All you Need ICML 2024
We leverage multilevel Monte Carlo (MLMC) to improve the performance of multi-step look-ahead Bayesian optimization (BO) methods that involve nested expectations and maximizations. Often these expectations must be computed by Monte Carlo (MC). The complexity rate of naive MC degrades for nested operations, whereas MLMC is capable of achieving the canonical MC convergence rate for this type of problem, independently of dimension and without any smoothness assumptions. Our theoretical study focuses on the approximation improvements for twoand three-step look-ahead acquisition functions, but, as we discuss, the approach is generalizable in various ways, including beyond the context of BO. Our findings are verified numerically and the benefits of MLMC for BO are illustrated on several benchmark examples. Code is available at https://github.com/Shangda-Yang/MLMCBO .
comment: Preprint ICML 2024
♻ ☆ In value-based deep reinforcement learning, a pruned network is a good network
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables value-based agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks, using only a small fraction of the full network parameters.
♻ ☆ Accelerating Electronic Stopping Power Predictions by 10 Million Times with a Combination of Time-Dependent Density Functional Theory and Machine Learning
Knowing the rate at which particle radiation releases energy in a material, the stopping power, is key to designing nuclear reactors, medical treatments, semiconductor and quantum materials, and many other technologies. While the nuclear contribution to stopping power, i.e., elastic scattering between atoms, is well understood in the literature, the route for gathering data on the electronic contribution has for decades remained costly and reliant on many simplifying assumptions, including that materials are isotropic. We establish a method that combines time-dependent density functional theory (TDDFT) and machine learning to reduce the time to assess new materials to mere hours on a supercomputer and provides valuable data on how atomic details influence electronic stopping. Our approach uses TDDFT to compute the electronic stopping contributions to stopping power from first principles in several directions and then machine learning to interpolate to other directions at a cost of 10 million times fewer core-hours. We demonstrate the combined approach in a study of proton irradiation in aluminum and employ it to predict how the depth of maximum energy deposition, the "Bragg Peak," varies depending on incident angle -- a quantity otherwise inaccessible to modelers. The lack of any experimental information requirement makes our method applicable to most materials, and its speed makes it a prime candidate for enabling quantum-to-continuum models of radiation damage. The prospect of reusing valuable TDDFT data for training the model make our approach appealing for applications in the age of materials data science.
♻ ☆ Bayesian Exploration Networks
Bayesian reinforcement learning (RL) offers a principled and elegant approach for sequential decision making under uncertainty. Most notably, Bayesian agents do not face an exploration/exploitation dilemma, a major pathology of frequentist methods. However theoretical understanding of model-free approaches is lacking. In this paper, we introduce a novel Bayesian model-free formulation and the first analysis showing that model-free approaches can yield Bayes-optimal policies. We show all existing model-free approaches make approximations that yield policies that can be arbitrarily Bayes-suboptimal. As a first step towards model-free Bayes optimality, we introduce the Bayesian exploration network (BEN) which uses normalising flows to model both the aleatoric uncertainty (via density estimation) and epistemic uncertainty (via variational inference) in the Bellman operator. In the limit of complete optimisation, BEN learns true Bayes-optimal policies, but like in variational expectation-maximisation, partial optimisation renders our approach tractable. Empirical results demonstrate that BEN can learn true Bayes-optimal policies in tasks where existing model-free approaches fail.
comment: Typos fixed and provided clearer proof of Theorem 3.2
♻ ☆ Representation Surgery: Theory and Practice of Affine Steering ICML 2024
Language models often exhibit undesirable behavior, e.g., generating toxic or gender-biased text. In the case of neural language models, an encoding of the undesirable behavior is often present in the model's representations. Thus, one natural (and common) approach to prevent the model from exhibiting undesirable behavior is to steer the model's representations in a manner that reduces the probability of it generating undesirable text. This paper investigates the formal and empirical properties of steering functions, i.e., transformation of the neural language model's representations that alter its behavior. First, we derive two optimal, in the least-squares sense, affine steering functions under different constraints. Our theory provides justification for existing approaches and offers a novel, improved steering approach. Second, we offer a series of experiments that demonstrate the empirical effectiveness of the methods in mitigating bias and reducing toxic generation.
comment: Accepted in ICML 2024
♻ ☆ Jigsaw: Supporting Designers to Prototype Multimodal Applications by Chaining AI Foundation Models
Recent advancements in AI foundation models have made it possible for them to be utilized off-the-shelf for creative tasks, including ideating design concepts or generating visual prototypes. However, integrating these models into the creative process can be challenging as they often exist as standalone applications tailored to specific tasks. To address this challenge, we introduce Jigsaw, a prototype system that employs puzzle pieces as metaphors to represent foundation models. Jigsaw allows designers to combine different foundation model capabilities across various modalities by assembling compatible puzzle pieces. To inform the design of Jigsaw, we interviewed ten designers and distilled design goals. In a user study, we showed that Jigsaw enhanced designers' understanding of available foundation model capabilities, provided guidance on combining capabilities across different modalities and tasks, and served as a canvas to support design exploration, prototyping, and documentation.
comment: https://jigsaw.to
♻ ☆ Practical Membership Inference Attacks against Fine-tuned Large Language Models via Self-prompt Calibration
Membership Inference Attacks (MIA) aim to infer whether a target data record has been utilized for model training or not. Prior attempts have quantified the privacy risks of language models (LMs) via MIAs, but there is still no consensus on whether existing MIA algorithms can cause remarkable privacy leakage on practical Large Language Models (LLMs). Existing MIAs designed for LMs can be classified into two categories: reference-free and reference-based attacks. They are both based on the hypothesis that training records consistently strike a higher probability of being sampled. Nevertheless, this hypothesis heavily relies on the overfitting of target models, which will be mitigated by multiple regularization methods and the generalization of LLMs. The reference-based attack seems to achieve promising effectiveness in LLMs, which measures a more reliable membership signal by comparing the probability discrepancy between the target model and the reference model. However, the performance of reference-based attack is highly dependent on a reference dataset that closely resembles the training dataset, which is usually inaccessible in the practical scenario. Overall, existing MIAs are unable to effectively unveil privacy leakage over practical fine-tuned LLMs that are overfitting-free and private. We propose a Membership Inference Attack based on Self-calibrated Probabilistic Variation (SPV-MIA). Specifically, since memorization in LLMs is inevitable during the training process and occurs before overfitting, we introduce a more reliable membership signal, probabilistic variation, which is based on memorization rather than overfitting. Furthermore, we introduce a self-prompt approach, which constructs the dataset to fine-tune the reference model by prompting the target LLM itself. In this manner, the adversary can collect a dataset with a similar distribution from public APIs.
comment: Repo: https://github.com/wjfu99/MIA-LLMs
♻ ☆ A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models
Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.
comment: Repo: https://github.com/wjfu99/MIA-Gen
♻ ☆ Efficient 3D Molecular Generation with Flow Matching and Scale Optimal Transport
Generative models for 3D drug design have gained prominence recently for their potential to design ligands directly within protein pockets. Current approaches, however, often suffer from very slow sampling times or generate molecules with poor chemical validity. Addressing these limitations, we propose Semla, a scalable E(3)-equivariant message passing architecture. We further introduce a molecular generation model, SemlaFlow, which is trained using flow matching along with scale optimal transport, a novel extension of equivariant optimal transport. Our model produces state-of-the-art results on benchmark datasets with just 100 sampling steps. Crucially, SemlaFlow samples high quality molecules with as few as 20 steps, corresponding to a two order-of-magnitude speed-up compared to state-of-the-art, without sacrificing performance. Furthermore, we highlight limitations of current evaluation methods for 3D generation and propose new benchmark metrics for unconditional molecular generators. Finally, using these new metrics, we compare our model's ability to generate high quality samples against current approaches and further demonstrate SemlaFlow's strong performance.
comment: Preprint. Code to be released upon full publication
♻ ☆ Towards Unbiased Calibration using Meta-Regularization
Model miscalibration has been frequently identified in modern deep neural networks. Recent work aims to improve model calibration directly through a differentiable calibration proxy. However, the calibration produced is often biased due to the binning mechanism. In this work, we propose to learn better-calibrated models via meta-regularization, which has two components: (1) gamma network (gamma-net), a meta learner that outputs sample-wise gamma values (continuous variable) for Focal loss for regularizing the backbone network; (2) smooth expected calibration error (SECE), a Gaussian-kernel based, unbiased, and differentiable surrogate to ECE that enables the smooth optimization of gamma-Net. We evaluate the effectiveness of the proposed approach in regularizing neural networks towards improved and unbiased calibration on three computer vision datasets. We empirically demonstrate that: (a) learning sample-wise gamma as continuous variables can effectively improve calibration; (b) SECE smoothly optimizes gamma-net towards unbiased and robust calibration with respect to the binning schemes; and (c) the combination of gamma-net and SECE achieves the best calibration performance across various calibration metrics while retaining very competitive predictive performance as compared to multiple recently proposed methods.
comment: 20 pages. Accepted at TMLR: https://openreview.net/forum?id=Yf8iHCfG4W
♻ ☆ Embodied Question Answering via Multi-LLM Systems
Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.
comment: 17 pages, 13 Figures, 4 Tables
♻ ☆ High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data
This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive-cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism, and outperforming even advanced algorithms designed for small and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on synthetic data further confirm HPClust's exceptional scalability and robustness to noise.
comment: Published in the MDPI "Mathematics" journal
♻ ☆ DE-COP: Detecting Copyrighted Content in Language Models Training Data
How can we detect if copyrighted content was used in the training process of a language model, considering that the training data is typically undisclosed? We are motivated by the premise that a language model is likely to identify verbatim excerpts from its training text. We propose DE-COP, a method to determine whether a piece of copyrighted content was included in training. DE-COP's core approach is to probe an LLM with multiple-choice questions, whose options include both verbatim text and their paraphrases. We construct BookTection, a benchmark with excerpts from 165 books published prior and subsequent to a model's training cutoff, along with their paraphrases. Our experiments show that DE-COP surpasses the prior best method by 9.6% in detection performance (AUC) on models with logits available. Moreover, DE-COP also achieves an average accuracy of 72% for detecting suspect books on fully black-box models where prior methods give approximately 4% accuracy. The code and datasets are available at https://github.com/LeiLiLab/DE-COP.
♻ ☆ Evaluating ML-Based Anomaly Detection Across Datasets of Varied Integrity: A Case Study
Cybersecurity remains a critical challenge in the digital age, with network traffic flow anomaly detection being a key pivotal instrument in the fight against cyber threats. In this study, we address the prevalent issue of data integrity in network traffic datasets, which are instrumental in developing machine learning (ML) models for anomaly detection. We introduce two refined versions of the CICIDS-2017 dataset, NFS-2023-nTE and NFS-2023-TE, processed using NFStream to ensure methodologically sound flow expiration and labeling. Our research contrasts the performance of the Random Forest (RF) algorithm across the original CICIDS-2017, its refined counterparts WTMC-2021 and CRiSIS-2022, and our NFStream-generated datasets, in both binary and multi-class classification contexts. We observe that the RF model exhibits exceptional robustness, achieving consistent high-performance metrics irrespective of the underlying dataset quality, which prompts a critical discussion on the actual impact of data integrity on ML efficacy. Our study underscores the importance of continual refinement and methodological rigor in dataset generation for network security research. As the landscape of network threats evolves, so must the tools and techniques used to detect and analyze them.
♻ ☆ Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes
We study the error introduced by entropy regularization of infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. We provide a lower bound matching our upper bound up to a polynomial factor. Our proof relies on the correspondence of the solutions of entropy-regularized Markov decision processes with gradient flows of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. Further, this correspondence allows us to identify the limit of the gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of the Kakade gradient flow which corresponds to a time-continuous version of the natural policy gradient method. We use this to show that for entropy-regularized natural policy gradient methods the overall error decays exponentially in the square root of the number of iterations improving existing sublinear guarantees.
comment: 26 pages, 1 figure
♻ ☆ Low-Cost Privacy-Aware Decentralized Learning
This paper introduces ZIP-DL, a novel privacy-aware decentralized learning (DL) algorithm that exploits correlated noise to provide strong privacy protection against a local adversary while yielding efficient convergence guarantees for a low communication cost. The progressive neutralization of the added noise during the distributed aggregation process results in ZIP-DL fostering a high model accuracy under privacy guarantees. ZIP-DL further uses a single communication round between each gradient descent, thus minimizing communication overhead. We provide theoretical guarantees for both convergence speed and privacy guarantees, thereby making ZIP-DL applicable to practical scenarios. Our extensive experimental study shows that ZIP-DL significantly outperforms the state-of-the-art in terms of vulnerability/accuracy trade-off. In particular, ZIP-DL (i) reduces the efficacy of linkability attacks by up to 52 percentage points compared to baseline DL, (ii) improves accuracy by up to 37 percent w.r.t. the state-of-the-art privacy-preserving mechanism operating under the same threat model as ours, when configured to provide the same protection against membership inference attacks, and (iii) reduces communication by up to 10.5x against the same competitor for the same level of protection.
♻ ☆ A Numerical Proof of Shell Model Turbulence Closure
The development of turbulence closure models, parametrizing the influence of small non-resolved scales on the dynamics of large resolved ones, is an outstanding theoretical challenge with vast applicative relevance. We present a closure, based on deep recurrent neural networks, that quantitatively reproduces, within statistical errors, Eulerian and Lagrangian structure functions and the intermittent statistics of the energy cascade, including those of subgrid fluxes. To achieve high-order statistical accuracy, and thus a stringent statistical test, we employ shell models of turbulence. Our results encourage the development of similar approaches for 3D Navier-Stokes turbulence.
♻ ☆ Classification with neural networks with quadratic decision functions
Neural networks with quadratic decision functions have been introduced as alternatives to standard neural networks with affine linear ones. They are advantageous when the objects or classes to be identified are compact and of basic geometries like circles, ellipses etc. In this paper we investigate the use of such ansatz functions for classification. In particular we test and compare the algorithm on the MNIST dataset for classification of handwritten digits and for classification of subspecies. We also show, that the implementation can be based on the neural network structure in the software Tensorflow and Keras, respectively.
♻ ☆ Telecom Language Models: Must They Be Large?
The increasing interest in Large Language Models (LLMs) within the telecommunications sector underscores their potential to revolutionize operational efficiency. However, the deployment of these sophisticated models is often hampered by their substantial size and computational demands, raising concerns about their viability in resource-constrained environments. Addressing this challenge, recent advancements have seen the emergence of small language models that surprisingly exhibit performance comparable to their larger counterparts in many tasks, such as coding and common-sense reasoning. Phi-2, a compact yet powerful model, exemplifies this new wave of efficient small language models. This paper conducts a comprehensive evaluation of Phi-2's intrinsic understanding of the telecommunications domain. Recognizing the scale-related limitations, we enhance Phi-2's capabilities through a Retrieval-Augmented Generation approach, meticulously integrating an extensive knowledge base specifically curated with telecom standard specifications. The enhanced Phi-2 model demonstrates a profound improvement in accuracy, answering questions about telecom standards with a precision that closely rivals the more resource-intensive GPT-3.5. The paper further explores the refined capabilities of Phi-2 in addressing problem-solving scenarios within the telecom sector, highlighting its potential and limitations.
♻ ☆ Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series
Online unsupervised detection of anomalies is crucial to guarantee the correct operation of cyber-physical systems and the safety of humans interacting with them. State-of-the-art approaches based on deep learning via neural networks achieve outstanding performance at anomaly recognition, evaluating the discrepancy between a normal model of the system (with no anomalies) and the real-time stream of sensor time series. However, large training data and time are typically required, and explainability is still a challenge to identify the root of the anomaly and implement predictive maintainance. In this paper, we use causal discovery to learn a normal causal graph of the system, and we evaluate the persistency of causal links during real-time acquisition of sensor data to promptly detect anomalies. On two benchmark anomaly detection datasets, we show that our method has higher training efficiency, outperforms the accuracy of state-of-the-art neural architectures and correctly identifies the sources of >10 different anomalies. The code is at https://github.com/Isla-lab/causal_anomaly_detection.
comment: In publication for IEEE Conference on Automation and Smart Engineering (CASE) 2024
♻ ☆ FedPop: Federated Population-based Hyperparameter Tuning
Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their "training-after-tuning" framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both client and server sides. Compared with prior tuning methods, FedPop employs an online "tuning-while-training" framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP tuning methods for FL.
comment: Code: https://github.com/HaokunChen245/FedPop
♻ ☆ SoK: Facial Deepfake Detectors
Deepfakes have rapidly emerged as a profound and serious threat to society, primarily due to their ease of creation and dissemination. This situation has triggered an accelerated development of deepfake detection technologies. However, many existing detectors rely heavily on lab-generated datasets for validation, which may not effectively prepare them for novel, emerging, and real-world deepfake techniques. In this paper, we conduct an extensive and comprehensive review and analysis of the latest state-of-the-art deepfake detectors, evaluating them against several critical criteria. These criteria facilitate the categorization of these detectors into 4 high-level groups and 13 fine-grained sub-groups, all aligned with a unified standard conceptual framework. This classification and framework offer deep and practical insights into the factors that affect detector efficacy. We assess the generalizability of 16 leading detectors across various standard attack scenarios, including black-box, white-box, and gray-box settings. Our systematized analysis and experimentation lay the groundwork for a deeper understanding of deepfake detectors and their generalizability, paving the way for future research focused on creating detectors adept at countering various attack scenarios. Additionally, this work offers insights for developing more proactive defenses against deepfakes.
comment: 18 pages, 6 figures, 5 table, under peer-review
♻ ☆ On the numerical reliability of nonsmooth autodiff: a MaxPool case study
This paper considers the reliability of automatic differentiation (AD) for neural networks involving the nonsmooth MaxPool operation. We investigate the behavior of AD across different precision levels (16, 32, 64 bits) and convolutional architectures (LeNet, VGG, and ResNet) on various datasets (MNIST, CIFAR10, SVHN, and ImageNet). Although AD can be incorrect, recent research has shown that it coincides with the derivative almost everywhere, even in the presence of nonsmooth operations (such as MaxPool and ReLU). On the other hand, in practice, AD operates with floating-point numbers (not real numbers), and there is, therefore, a need to explore subsets on which AD can be numerically incorrect. These subsets include a bifurcation zone (where AD is incorrect over reals) and a compensation zone (where AD is incorrect over floating-point numbers but correct over reals). Using SGD for the training process, we study the impact of different choices of the nonsmooth Jacobian for the MaxPool function on the precision of 16 and 32 bits. These findings suggest that nonsmooth MaxPool Jacobians with lower norms help maintain stable and efficient test accuracy, whereas those with higher norms can result in instability and decreased performance. We also observe that the influence of MaxPool's nonsmooth Jacobians on learning can be reduced by using batch normalization, Adam-like optimizers, or increasing the precision level.
♻ ☆ Computational-Statistical Gaps for Improper Learning in Sparse Linear Regression
We study computational-statistical gaps for improper learning in sparse linear regression. More specifically, given $n$ samples from a $k$-sparse linear model in dimension $d$, we ask what is the minimum sample complexity to efficiently (in time polynomial in $d$, $k$, and $n$) find a potentially dense estimate for the regression vector that achieves non-trivial prediction error on the $n$ samples. Information-theoretically this can be achieved using $\Theta(k \log (d/k))$ samples. Yet, despite its prominence in the literature, there is no polynomial-time algorithm known to achieve the same guarantees using less than $\Theta(d)$ samples without additional restrictions on the model. Similarly, existing hardness results are either restricted to the proper setting, in which the estimate must be sparse as well, or only apply to specific algorithms. We give evidence that efficient algorithms for this task require at least (roughly) $\Omega(k^2)$ samples. In particular, we show that an improper learning algorithm for sparse linear regression can be used to solve sparse PCA problems (with a negative spike) in their Wishart form, in regimes in which efficient algorithms are widely believed to require at least $\Omega(k^2)$ samples. We complement our reduction with low-degree and statistical query lower bounds for the sparse PCA problems from which we reduce. Our hardness results apply to the (correlated) random design setting in which the covariates are drawn i.i.d. from a mean-zero Gaussian distribution with unknown covariance.
comment: 24 pages; updated typos, some explanations, and references
♻ ☆ Direct Multi-Turn Preference Optimization for Language Agents
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
♻ ☆ Efficiently Predicting Mutational Effect on Homologous Proteins by Evolution Encoding
Predicting protein properties is paramount for biological and medical advancements. Current protein engineering mutates on a typical protein, called the wild-type, to construct a family of homologous proteins and study their properties. Yet, existing methods easily neglect subtle mutations, failing to capture the effect on the protein properties. To this end, we propose EvolMPNN, Evolution-aware Message Passing Neural Network, an efficient model to learn evolution-aware protein embeddings. EvolMPNN samples sets of anchor proteins, computes evolutionary information by means of residues and employs a differentiable evolution-aware aggregation scheme over these sampled anchors. This way, EvolMPNN can efficiently utilise a novel message-passing method to capture the mutation effect on proteins with respect to the anchor proteins. Afterwards, the aggregated evolution-aware embeddings are integrated with sequence embeddings to generate final comprehensive protein embeddings. Our model shows up to 6.4% better than state-of-the-art methods and attains 36X inference speedup in comparison with large pre-trained models. Code and models are available at https://github.com/zhiqiangzhongddu/EvolMPNN.
♻ ☆ Harnessing Large Language Models as Post-hoc Correctors
As Machine Learning (ML) models grow in size and demand higher-quality training data, the expenses associated with re-training and fine-tuning these models are escalating rapidly. Inspired by recent impressive achievements of Large Language Models (LLMs) in different fields, this paper delves into the question: can LLMs efficiently improve an ML's performance at a minimal cost? We show that, through our proposed training-free framework LlmCorr, an LLM can work as a post-hoc corrector to propose corrections for the predictions of an arbitrary ML model. In particular, we form a contextual knowledge database by incorporating the dataset's label information and the ML model's predictions on the validation dataset. Leveraging the in-context learning capability of LLMs, we ask the LLM to summarise the instances in which the ML model makes mistakes and the correlation between primary predictions and true labels. Following this, the LLM can transfer its acquired knowledge to suggest corrections for the ML model's predictions. Our experimental results on text analysis and the challenging molecular predictions show that \model improves the performance of a number of models by up to 39%.
♻ ☆ TabVFL: Improving Latent Representation in Vertical Federated Learning
Autoencoders are popular neural networks that are able to compress high dimensional data to extract relevant latent information. TabNet is a state-of-the-art neural network model designed for tabular data that utilizes an autoencoder architecture for training. Vertical Federated Learning (VFL) is an emerging distributed machine learning paradigm that allows multiple parties to train a model collaboratively on vertically partitioned data while maintaining data privacy. The existing design of training autoencoders in VFL is to train a separate autoencoder in each participant and aggregate the latent representation later. This design could potentially break important correlations between feature data of participating parties, as each autoencoder is trained on locally available features while disregarding the features of others. In addition, traditional autoencoders are not specifically designed for tabular data, which is ubiquitous in VFL settings. Moreover, the impact of client failures during training on the model robustness is under-researched in the VFL scene. In this paper, we propose TabVFL, a distributed framework designed to improve latent representation learning using the joint features of participants. The framework (i) preserves privacy by mitigating potential data leakage with the addition of a fully-connected layer, (ii) conserves feature correlations by learning one latent representation vector, and (iii) provides enhanced robustness against client failures during training phase. Extensive experiments on five classification datasets show that TabVFL can outperform the prior work design, with 26.12% of improvement on f1-score.
comment: This document is a preprint of a paper accepted at IEEE SRDS 2024
♻ ☆ CoSMo: a Framework to Instantiate Conditioned Process Simulation Models
Process simulation is gaining attention for its ability to assess potential performance improvements and risks associated with business process changes. The existing literature presents various techniques, generally grounded in process models discovered from event log data or built upon deep learning algorithms. These techniques have specific strengths and limitations. Traditional data-driven approaches offer increased interpretability, while deep learning-based excel at generalizing changes across large event logs. However, the practical application of deep learning faces challenges related to managing stochasticity and integrating information for what-if analysis. This paper introduces a novel recurrent neural architecture tailored to discover COnditioned process Simulation MOdels (CoSMo) based on user-based constraints or any other nature of a-priori knowledge. This architecture facilitates the simulation of event logs that adhere to specific constraints by incorporating declarative-based rules into the learning phase as an attempt to fill the gap of incorporating information into deep learning models to perform what-if analysis. Experimental validation illustrates CoSMo's efficacy in simulating event logs while adhering to predefined declarative conditions, emphasizing both control-flow and data-flow perspectives.
♻ ☆ Straight-Through meets Sparse Recovery: the Support Exploration Algorithm
The {\it straight-through estimator} (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical successes.To make a step forward in this comprehension, we apply STE to a well-understood problem: {\it sparse support recovery}. We introduce the {\it Support Exploration Algorithm} (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of $A$ are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix $A$ satisfies the {\it Restricted Isometry Property} (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.
♻ ☆ Adaptive Collaborative Correlation Learning-based Semi-Supervised Multi-Label Feature Selection
Semi-supervised multi-label feature selection has recently been developed to solve the curse of dimensionality problem in high-dimensional multi-label data with certain samples missing labels. Although many efforts have been made, most existing methods use a predefined graph approach to capture the sample similarity or the label correlation. In this manner, the presence of noise and outliers within the original feature space can undermine the reliability of the resulting sample similarity graph. It also fails to precisely depict the label correlation due to the existence of unknown labels. Besides, these methods only consider the discriminative power of selected features, while neglecting their redundancy. In this paper, we propose an Adaptive Collaborative Correlation lEarning-based Semi-Supervised Multi-label Feature Selection (Access-MFS) method to address these issues. Specifically, a generalized regression model equipped with an extended uncorrelated constraint is introduced to select discriminative yet irrelevant features and maintain consistency between predicted and ground-truth labels in labeled data, simultaneously. Then, the instance correlation and label correlation are integrated into the proposed regression model to adaptively learn both the sample similarity graph and the label similarity graph, which mutually enhance feature selection performance. Extensive experimental results demonstrate the superiority of the proposed Access-MFS over other state-of-the-art methods.
♻ ☆ Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples
In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.
comment: Camera-ready version
♻ ☆ A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models ICML 2024
Watermarking techniques offer a promising way to identify machine-generated content via embedding covert information into the contents generated from language models. A challenge in the domain lies in preserving the distribution of original generated content after watermarking. Our research extends and improves upon existing watermarking framework, placing emphasis on the importance of a \textbf{Di}stribution-\textbf{P}reserving (DiP) watermark. Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking (distribution-preserving), is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens (resilient). DiPmark operates by selecting a random set of tokens prior to the generation of a word, then modifying the token distribution through a distribution-preserving reweight function to enhance the probability of these selected tokens during the sampling process. Extensive empirical evaluation on various language models and tasks demonstrates our approach's distribution-preserving property, accessibility, and resilience, making it a effective solution for watermarking tasks that demand impeccable quality preservation.
comment: ICML 2024
♻ ☆ Approximation Theory of Tree Tensor Networks: Tensorized Multivariate Functions
We study the approximation of multivariate functions with tensor networks (TNs). The main conclusion of this work is an answer to the following two questions: ``What are the approximation capabilities of TNs?" and "What is an appropriate model class of functions that can be approximated with TNs?" To answer the former, we show that TNs can (near to) optimally replicate $h$-uniform and $h$-adaptive approximation, for any smoothness order of the target function. Tensor networks thus exhibit universal expressivity w.r.t. isotropic, anisotropic and mixed smoothness spaces that is comparable with more general neural networks families such as deep rectified linear unit (ReLU) networks. Put differently, TNs have the capacity to (near to) optimally approximate many function classes -- without being adapted to the particular class in question. To answer the latter, as a candidate model class we consider approximation classes of TNs and show that these are (quasi-)Banach spaces, that many types of classical smoothness spaces are continuously embedded into said approximation classes and that TN approximation classes are themselves not embedded in any classical smoothness space.
comment: This work is a continuation of M. Ali and A. Nouy. Approximation theory of tree tensor networks: Tensorized univariate functions. Constructive Approximation, 58(2):463-544, 2023. It is also available in two parts on arXiv: for part I see arXiv:2007.00118, for part II see arXiv:2007.00128
♻ ☆ Latent Optimal Paths by Gumbel Propagation for Variational Bayesian Dynamic Programming ICML 2024
We propose the stochastic optimal path which solves the classical optimal path problem by a probability-softening solution. This unified approach transforms a wide range of DP problems into directed acyclic graphs in which all paths follow a Gibbs distribution. We show the equivalence of the Gibbs distribution to a message-passing algorithm by the properties of the Gumbel distribution and give all the ingredients required for variational Bayesian inference of a latent path, namely Bayesian dynamic programming (BDP). We demonstrate the usage of BDP in the latent space of variational autoencoders (VAEs) and propose the BDP-VAE which captures structured sparse optimal paths as latent variables. This enables end-to-end training for generative tasks in which models rely on unobserved structural information. At last, we validate the behavior of our approach and showcase its applicability in two real-world applications: text-to-speech and singing voice synthesis. Our implementation code is available at \url{https://github.com/XinleiNIU/LatentOptimalPathsBayesianDP}.
comment: Accepted by ICML 2024
♻ ☆ NExT-GPT: Any-to-Any Multimodal LLM ICML 2024
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
comment: ICML 2024 (Oral)
♻ ☆ Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity ICML
On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as 38% -- both overall and on outliers.
comment: ICML-MFPL 2023 Workshop Oral, SPIGM@ICML2024
♻ ☆ Local primordial non-Gaussianity from the large-scale clustering of photometric DESI luminous red galaxies
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the primary sources of systematic error, and employ linear regression and artificial neural networks to alleviate non-cosmological excess clustering on large scales. Our methods are tested against simulations with and without $\fnl$ and systematics, showing superior performance of the neural network treatment. The neural network with a set of nine imaging property maps passes our systematic null test criteria, and is chosen as the fiducial treatment. Assuming the universality relation, we find $\fnl = 34^{+24(+50)}_{-44(-73)}$ at 68\%(95\%) confidence. We apply a series of robustness tests (e.g., cuts on imaging, declination, or scales used) that show consistency in the obtained constraints. We study how the regression method biases the measured angular power-spectrum and degrades the $\fnl$ constraining power. The use of the nine maps more than doubles the uncertainty compared to using only the three primary maps in the regression. Our results thus motivate the development of more efficient methods that avoid over-correction, protect large-scale clustering information, and preserve constraining power. Additionally, our results encourage further studies of $\fnl$ with DESI spectroscopic samples, where the inclusion of 3D clustering modes should help separate imaging systematics and lessen the degradation in the $\fnl$ uncertainty.
comment: 21 pages, 17 figures, 7 tables (Appendix excluded). Published in MNRAS
♻ ☆ Distillation Enhanced Time Series Forecasting Network with Momentum Contrastive Learning
Contrastive representation learning is crucial in time series analysis as it alleviates the issue of data noise and incompleteness as well as sparsity of supervision signal. However, existing constrastive learning frameworks usually focus on intral-temporal features, which fails to fully exploit the intricate nature of time series data. To address this issue, we propose DE-TSMCL, an innovative distillation enhanced framework for long sequence time series forecasting. Specifically, we design a learnable data augmentation mechanism which adaptively learns whether to mask a timestamp to obtain optimized sub-sequences. Then, we propose a contrastive learning task with momentum update to explore inter-sample and intra-temporal correlations of time series to learn the underlying structure feature on the unlabeled time series. Meanwhile, we design a supervised task to learn more robust representations and facilitate the contrastive learning process. Finally, we jointly optimize the above two tasks. By developing model loss from multiple tasks, we can learn effective representations for downstream forecasting task. Extensive experiments, in comparison with state-of-the-arts, well demonstrate the effectiveness of DE-TSMCL, where the maximum improvement can reach to 27.3%.
♻ ☆ CEST-KAN: Kolmogorov-Arnold Networks for CEST MRI Data Analysis
Purpose: This study aims to propose and investigate the feasibility of using Kolmogorov-Arnold Network (KAN) for CEST MRI data analysis (CEST-KAN). Methods: CEST MRI data were acquired from twelve healthy volunteers at 3T. Data from ten subjects were used for training, while the remaining two were reserved for testing. The performance of multi-layer perceptron (MLP) and KAN models with the same network settings were evaluated and compared to the conventional multi-pool Lorentzian fitting (MPLF) method in generating water and multiple CEST contrasts, including amide, relayed nuclear Overhauser effect (rNOE), and magnetization transfer (MT). Results: The water and CEST maps generated by both MLP and KAN were visually comparable to the MPLF results. However, the KAN model demonstrated higher accuracy in extrapolating the CEST fitting metrics, as evidenced by the smaller validation loss during training and smaller absolute error during testing. Voxel-wise correlation analysis showed that all four CEST fitting metrics generated by KAN consistently exhibited higher Pearson coefficients than the MLP results, indicating superior performance. Moreover, the KAN models consistently outperformed the MLP models in varying hidden layer numbers despite longer training time. Conclusion: In this study, we demonstrated for the first time the feasibility of utilizing KAN for CEST MRI data analysis, highlighting its superiority over MLP in this task. The findings suggest that CEST-KAN has the potential to be a robust and reliable post-analysis tool for CEST MRI in clinical settings.
♻ ☆ Graph-Augmented LLMs for Personalized Health Insights: A Case Study in Sleep Analysis
Health monitoring systems have revolutionized modern healthcare by enabling the continuous capture of physiological and behavioral data, essential for preventive measures and early health intervention. While integrating this data with Large Language Models (LLMs) has shown promise in delivering interactive health advice, traditional methods like Retrieval-Augmented Generation (RAG) and fine-tuning often fail to fully utilize the complex, multi-dimensional, and temporally relevant data from wearable devices. These conventional approaches typically provide limited actionable and personalized health insights due to their inadequate capacity to dynamically integrate and interpret diverse health data streams. In response, this paper introduces a graph-augmented LLM framework designed to significantly enhance the personalization and clarity of health insights. Utilizing a hierarchical graph structure, the framework captures inter and intra-patient relationships, enriching LLM prompts with dynamic feature importance scores derived from a Random Forest Model. The effectiveness of this approach is demonstrated through a sleep analysis case study involving 20 college students during the COVID-19 lockdown, highlighting the potential of our model to generate actionable and personalized health insights efficiently. We leverage another LLM to evaluate the insights for relevance, comprehensiveness, actionability, and personalization, addressing the critical need for models that process and interpret complex health data effectively. Our findings show that augmenting prompts with our framework yields significant improvements in all 4 criteria. Through our framework, we can elicit well-crafted, more thoughtful responses tailored to a specific patient.
♻ ☆ GLoRe: When, Where, and How to Improve LLM Reasoning via Global and Local Refinements
State-of-the-art language models can exhibit impressive reasoning refinement capabilities on math, science or coding tasks. However, recent work demonstrates that even the best models struggle to identify \textit{when and where to refine} without access to external feedback. Outcome-based Reward Models (\textbf{ORMs}), trained to predict correctness of the final answer indicating when to refine, offer one convenient solution for deciding when to refine. Process Based Reward Models (\textbf{PRMs}), trained to predict correctness of intermediate steps, can then be used to indicate where to refine. But they are expensive to train, requiring extensive human annotations. In this paper, we propose Stepwise ORMs (\textbf{SORMs}) which are trained, only on synthetic data, to approximate the expected future reward of the optimal policy or $V^{\star}$. More specifically, SORMs are trained to predict the correctness of the final answer when sampling the current policy many times (rather than only once as in the case of ORMs). Our experiments show that SORMs can more accurately detect incorrect reasoning steps compared to ORMs, thus improving downstream accuracy when doing refinements. We then train \textit{global} refinement models, which take only the question and a draft solution as input and predict a corrected solution, and \textit{local} refinement models which also take as input a critique indicating the location of the first reasoning error. We generate training data for both models synthetically by reusing data used to train the SORM. We find combining global and local refinements, using the ORM as a reranker, significantly outperforms either one individually, as well as a best of three sample baseline. With this strategy we can improve the accuracy of a LLaMA-2 13B model (already fine-tuned with RL) on GSM8K from 53\% to 65\% when greedily sampled.
♻ ☆ Gradient Coding with Iterative Block Leverage Score Sampling
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distributed computational networks, \textit{i.e.} stragglers. We replicate the data across the distributed network, to attain the approximation guarantees through the induced sampling distribution. The significance and main contribution of this work, is that it unifies randomized numerical linear algebra with approximate coded computing, while attaining an induced $\ell_2$-subspace embedding through uniform sampling. The transition to uniform sampling is done without applying a random projection, as in the case of the subsampled randomized Hadamard transform. Furthermore, by incorporating this technique to coded computing, our scheme is an iterative sketching approach to approximately solving linear regression. We also propose weighting when sketching takes place through sampling with replacement, for further compression.
comment: 26 pages, 6 figures, 1 table
♻ ☆ Neural Optimization with Adaptive Heuristics for Intelligent Marketing System KDD 2024
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
comment: KDD 2024
♻ ☆ Towards Synthesizing Twelve-Lead Electrocardiograms from Two Asynchronous Leads
The electrocardiogram (ECG) records electrical signals in a non-invasive way to observe the condition of the heart, typically looking at the heart from 12 different directions. Several types of the cardiac disease are diagnosed by using 12-lead ECGs Recently, various wearable devices have enabled immediate access to the ECG without the use of wieldy equipment. However, they only provide ECGs with a couple of leads. This results in an inaccurate diagnosis of cardiac disease due to lacking of required leads. We propose a deep generative model for ECG synthesis from two asynchronous leads to ten leads. It first represents a heart condition referring to two leads, and then generates ten leads based on the represented heart condition. Both the rhythm and amplitude of leads generated resemble those of the original ones, while the technique removes noise and the baseline wander appearing in the original leads. As a data augmentation method, our model improves the classification performance of models compared with models using ECGs with only one or two leads.
♻ ☆ Listening to the Noise: Blind Denoising with Gibbs Diffusion
In recent years, denoising problems have become intertwined with the development of deep generative models. In particular, diffusion models are trained like denoisers, and the distribution they model coincide with denoising priors in the Bayesian picture. However, denoising through diffusion-based posterior sampling requires the noise level and covariance to be known, preventing blind denoising. We overcome this limitation by introducing Gibbs Diffusion (GDiff), a general methodology addressing posterior sampling of both the signal and the noise parameters. Assuming arbitrary parametric Gaussian noise, we develop a Gibbs algorithm that alternates sampling steps from a conditional diffusion model trained to map the signal prior to the family of noise distributions, and a Monte Carlo sampler to infer the noise parameters. Our theoretical analysis highlights potential pitfalls, guides diagnostic usage, and quantifies errors in the Gibbs stationary distribution caused by the diffusion model. We showcase our method for 1) blind denoising of natural images involving colored noises with unknown amplitude and spectral index, and 2) a cosmology problem, namely the analysis of cosmic microwave background data, where Bayesian inference of "noise" parameters means constraining models of the evolution of the Universe.
comment: 12+9 pages, 7+5 figures, 1+1 tables; accepted to 2024 International Conference on Machine Learning; code: https://github.com/rubenohana/Gibbs-Diffusion
Systems and Control 35
☆ Automatic Parameter Tuning of Self-Driving Vehicles
Modern automated driving solutions utilize trajectory planning and control components with numerous parameters that need to be tuned for different driving situations and vehicle types to achieve optimal performance. This paper proposes a method to automatically tune such parameters to resemble expert demonstrations. We utilize a cost function which captures deviations of the closed-loop operation of the controller from the recorded desired driving behavior. Parameter tuning is then accomplished by using local optimization techniques. Three optimization alternatives are compared in a case study, where a trajectory planner is tuned for lane following in a real-world driving scenario. The results suggest that the proposed approach improves manually tuned initial parameters significantly even with respect to noisy demonstration data.
comment: Preprint to appear at the 2024 IEEE Conference on Control Technology and Applications (CCTA)
☆ Neuro-Modeling Infused EMT Analytics
The paper presents a systematic approach to developing Physics-Informed neuro-Models (PIM) for the transient analysis of power grids interconnected with renewables. PIM verifies itself as an adequate digital twin of power components, taking full advantage of physical constraints while requiring only a small fraction of data for training. Three new contributions are presented: 1) An PINN-enabled neuro-modeling approach is devised to construct an accurate EMT model; 2) A data-physics hybrid learning approach is substantiated to demonstrate its data efficiency and adaptability at various levels of data; 3) A balanced-adaptive PIM exemplifies its applicability when multiple terms are predicted simultaneously while maintaining alignment with physical principles. Under various operational scenarios, tests on rotating and static electric components as well as an IEEE test system verifies the efficacy and efficiency of the new learning-infused transient grid analytics.
☆ Position and Time Determination without Prior State Knowledge via Onboard Optical Observations of Delta Scuti Variable Stars
We present a navigation concept for solving the lost in space and time problem using optical observations of $\delta$ Scuti variable stars. Only a small number of techniques exist that allow a spacecraft to recover from being lost in both space and time, which can be caused by a failure of the onboard clock and navigation systems. Optical observations of $\delta$ Scuti stars, which can be collected onboard from star trackers or navigation cameras, may enable faster and/or more accurate position and time solutions compared to existing techniques without requiring additional equipment. Our results indicate that one day of observation by the OSIRIS-APEX MapCam may enable position and time determination accuracy within 0.05 au (3$\sigma$) and 15 s (3$\sigma$).
comment: 28 pages, 9 figures, to be submitted to the Journal of the Astronautical Sciences
☆ Optimizing Energy-Efficient Braking Trajectories with Anticipatory Road Data for Automated Vehicles
Trajectory planning in automated driving typically focuses on satisfying safety and comfort requirements within the vehicle's onboard sensor range. This paper introduces a method that leverages anticipatory road data, such as speed limits, road slopes, and traffic lights, beyond the local perception range to optimize energy-efficient braking trajectories. For that, coasting, which reduces energy consumption, and active braking are combined to transition from the current vehicle velocity to a lower target velocity at a given distance ahead. Finding the switching instants between the coasting phases and the continuous control for the braking phase is addressed as an optimal trade-off between maximizing coasting periods and minimizing braking effort. The resulting switched optimal control problem is solved by deriving necessary optimality conditions. To facilitate the incorporation of additional feasibility constraints for multi-phase trajectories, a sub-optimal alternative solution based on parametric optimization is proposed. Both methods are compared in simulation.
comment: Preprint to appear at the European Control Conference 2024 (ECC'24)
☆ Constructing structured tensor priors for Bayesian inverse problems
Specifying a prior distribution is an essential part of solving Bayesian inverse problems. The prior encodes a belief on the nature of the solution and this regularizes the problem. In this article we completely characterize a Gaussian prior that encodes the belief that the solution is a structured tensor. We first define the notion of (A,b)-constrained tensors and show that they describe a large variety of different structures such as Hankel, circulant, triangular, symmetric, and so on. Then we completely characterize the Gaussian probability distribution of such tensors by specifying its mean vector and covariance matrix. Furthermore, explicit expressions are proved for the covariance matrix of tensors whose entries are invariant under a permutation. These results unlock a whole new class of priors for Bayesian inverse problems. We illustrate how new kernel functions can be designed and efficiently computed and apply our results on two particular Bayesian inverse problems: completing a Hankel matrix from a few noisy measurements and learning an image classifier of handwritten digits. The effectiveness of the proposed priors is demonstrated for both problems. All applications have been implemented as reactive Pluto notebooks in Julia.
☆ Optimal DC-link Voltage from Weight and Loss Perspective for eVTOLs
Electric vertical takeoff and landing (eVTOL) aircraft are emerging as a modern transportation solution aimed at reducing urban traffic congestion and improving the carbon footprint. The power architecture in eVTOLs is defined by the dc bus formed by the battery packs and the power converter used to drive eVTOL motors. A high dc bus voltage is preferred for the power architecture since it can reduce the weight of power cables for a given power rating. However, the impact of high dc bus voltage on the efficiency of the drivetrain power converter must be considered, since reduced efficiency leads to poor battery pack utilization. In this paper, a systematic optimization study is performed considering SiC-based inverter for the drivetrain power converter. Optimal value of dc bus voltage is determined considering the flight profile of eVTOLs. A power converter topology is proposed that can provide optimal performance and enhance the lifetime of the batteries along with providing better monitoring, diagnostics and protection. The optimization strategy is validated experimentally, demonstrating the proposed power architecture's ability to maximize efficiency while enhancing the safety of the battery energy storage system in eVTOLs.
☆ Operating envelopes for the grid-constrained use of distributed flexibility in balancing markets
The increasing share of distributed energy sources enhances the participation potential of distributed flexibility in the provision of system services. However, this participation can endanger the grid-safety of the distribution networks (DNs) from which this flexibility originates. In this paper, the use of operating envelopes (OE) to enable the grid-safe procurement of distributed flexibility in centralized balancing markets is proposed. Two classes of approaches for calculating OEs (one-step and two-step methods) are compared in terms of the level of distribution grid safety they can provide, the impact they can have on the market efficiency, and the volume of discarded flexibility they can yield. A case study considering different system scenarios, based on Monte Carlo simulations, highlights a trade-off between the market efficiency, DN flexibility resource utilization, and the grid safety delivered by the different OE methods. The results showcase that the use of the two-step OE approach results in a more grid-secure albeit less-efficient use of distributed flexibility.
comment: 5 pages
☆ Robust Pareto Design of GaN HEMTs for Millimeter-Wave Applications
This paper introduces a robust Pareto design approach for selecting Gallium Nitride (GaN) High Electron Mobility Transistors (HEMTs), particularly for power amplifier (PA) and low-noise amplifier (LNA) designs in 5G applications. We consider five key design variables and two settings (PAs and LNAs) where we have multiple objectives. We assess designs based on three critical objectives, evaluating each by its worst-case performance across a range of Gate-Source Voltages ($V_{\text{GS}}$). We conduct simulations across a range of $V_{\text{GS}}$ values to ensure a thorough and robust analysis. For PAs, the optimization goals are to maximize the worst-case modulated average output power ($P_{\text{out,avg}}$) and power-added efficiency ($PAE_{\text{avg}}$) while minimizing the worst-case average junction temperature ($T_{\text{j,avg}}$) under a modulated 64-QAM signal stimulus. In contrast, for LNAs, the focus is on maximizing the worst-case maximum oscillation frequency ($f_{\text{max}}$) and Gain, and minimizing the worst-case minimum noise figure ($NF_{\text{min}}$). We utilize a derivative-free optimization method to effectively identify robust Pareto optimal device designs. This approach enhances our comprehension of the trade-off space, facilitating more informed decision-making. Furthermore, this method is general across different applications. Although it does not guarantee a globally optimal design, we demonstrate its effectiveness in GaN device sizing. The primary advantage of this method is that it enables the attainment of near-optimal or even optimal designs with just a fraction of the simulations required for an exhaustive full-grid search.
☆ Prioritized experience replay-based DDQN for Unmanned Vehicle Path Planning
Path planning module is a key module for autonomous vehicle navigation, which directly affects its operating efficiency and safety. In complex environments with many obstacles, traditional planning algorithms often cannot meet the needs of intelligence, which may lead to problems such as dead zones in unmanned vehicles. This paper proposes a path planning algorithm based on DDQN and combines it with the prioritized experience replay method to solve the problem that traditional path planning algorithms often fall into dead zones. A series of simulation experiment results prove that the path planning algorithm based on DDQN is significantly better than other methods in terms of speed and accuracy, especially the ability to break through dead zones in extreme environments. Research shows that the path planning algorithm based on DDQN performs well in terms of path quality and safety. These research results provide an important reference for the research on automatic navigation of autonomous vehicles.
comment: 4 pages, 6 figures, 2024 5th International Conference on Information Science, Parallel and Distributed Systems
☆ A Near-Field Super-Resolution Network for Accelerating Antenna Characterization
We present a deep neural network-enabled method to accelerate near-field (NF) antenna measurement. We develop a Near-field Super-resolution Network (NFS-Net) to reconstruct significantly undersampled near-field data as high-resolution data, which considerably reduces the number of sampling points required for NF measurement and thus improves measurement efficiency. The high-resolution near-field data reconstructed by the network is further processed by a near-field-to-far-field (NF2FF) transformation to obtain far-field antenna radiation patterns. Our experiments demonstrate that the NFS-Net exhibits both accuracy and generalizability in restoring high-resolution near-field data from low-resolution input. The NF measurement workflow that combines the NFS-Net and the NF2FF algorithm enables accurate radiation pattern characterization with only 11% of the Nyquist rate samples. Though the experiments in this study are conducted on a planar setup with a uniform grid, the proposed method can serve as a universal strategy to accelerate measurements under different setups and conditions.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
☆ Coded Kalman Filtering over MIMO Gaussian Channels with Feedback
We consider the problem of remotely stabilizing a linear dynamical system. In this setting, a sensor co-located with the system communicates the system's state to a controller over a noisy communication channel with feedback. The objective of the controller (decoder) is to use the channel outputs to estimate the vector state with finite zero-delay mean squared error (MSE) at the infinite horizon. It has been shown in [1] that for a vector Gauss-Markov source and either a single-input multiple-output (SIMO) or a multiple-input single-output (MISO) channel, linear codes require the minimum capacity to achieve finite MSE. This paper considers the more general problem of linear zero-delay joint-source channel coding (JSCC) of a vector-valued source over a multiple-input multiple-output (MIMO) Gaussian channel with feedback. We study sufficient and necessary conditions for linear codes to achieve finite MSE. For sufficiency, we introduce a coding scheme where each unstable source mode is allocated to a single channel for estimation. Our proof for the necessity of this scheme relies on a matrix-algebraic conjecture that we prove to be true if either the source or channel is scalar. We show that linear codes achieve finite MSE for a scalar source over a MIMO channel if and only if the best scalar sub-channel can achieve finite MSE. Finally, we provide a new counter-example demonstrating that linear codes are generally sub-optimal for coding over MIMO channels.
comment: Accepted for presentation at the 2024 IEEE International Symposium on Information Theory
☆ Hierarchical Framework for Optimizing Wildfire Surveillance and Suppression using Human-Autonomous Teaming
The integration of manned and unmanned aircraft can help improve wildfire response. Wildfire containment failures occur when resources available to first responders, who execute the initial stages of wildfire management referred to as the initial attack, are ineffective or insufficient. Initial attack surveillance and suppression models have linked action spaces and objectives, making their optimization computationally challenging. The initial attack may be formulated as a multi-agent partially observable Markov decision process (MPOMDP). We divide the initial attack MPOMDP into surveillance and suppression processes with their respective planners operating on different, but constant, time scales. A hierarchical framework iterates between surveillance and suppression planners while also providing collision avoidance. This framework is exemplified by a set of multi-rotor unmanned aircraft surveying an initial attack fire while a manned helicopter suppresses the fire with a water bucket. Wildfire-specific solver extensions are formulated to reduce the otherwise vast action spaces. The hierarchical framework outperforms firefighting techniques and a myopic baseline by up to 242% for moderate wildfires and 60% for rapid wildfires when simulated in abstracted and actual case studies. We also validate the early dispatching of additional suppression assets using regression models to ensure wildfire containment to thresholds established by wildfire agencies.
☆ The Role of Electric Grid Research in Addressing Climate Change
Addressing the urgency of climate change necessitates a coordinated and inclusive effort from all relevant stakeholders. Critical to this effort is the modeling, analysis, control, and integration of technological innovations within the electric energy system, which plays a crucial role in scaling up climate change solutions. This perspective article presents a set of research challenges and opportunities in the area of electric power systems that would be crucial in accelerating Gigaton-level decarbonization. Furthermore, it highlights institutional challenges associated with developing market mechanisms and regulatory architectures, ensuring that incentives are aligned for stakeholders to effectively implement the technological solutions on a large scale.
comment: 17 pages, 2 figures
☆ Koopman-LQR Controller for Quadrotor UAVs from Data
Quadrotor systems are common and beneficial for many fields, but their intricate behavior often makes it challenging to design effective and optimal control strategies. Some traditional approaches to nonlinear control often rely on local linearizations or complex nonlinear models, which can be inaccurate or computationally expensive. We present a data-driven approach to identify the dynamics of a given quadrotor system using Koopman operator theory. Koopman theory offers a framework for representing nonlinear dynamics as linear operators acting on observable functions of the state space. This allows to approximate nonlinear systems with globally linear models in a higher dimensional space, which can be analyzed and controlled using standard linear optimal control techniques. We leverage the method of extended dynamic mode decomposition (EDMD) to identify Koopman operator from data with total least squares. We demonstrate that the identified model can be stabilized and controllable by designing a controller using linear quadratic regulator (LQR).
☆ Input-to-state stability meets small-gain theory
Input-to-state stability (ISS) unifies global asymptotic stability with respect to variations of initial conditions with robustness with respect to external disturbances. First, we present Lyapunov characterizations for input-to-state stability as well as ISS superpositions theorems showing relations of ISS to other robust stability properties. Next, we present one of the characteristic applications of the ISS framework - the design of event-based control schemes for the stabilization of nonlinear systems. In the second half of the paper, we focus on small-gain theorems for stability analysis of finite and infinite networks with input-to-state stable components. First, we present a classical small-gain theorem in terms of trajectories for the feedback interconnection of 2 nonlinear systems. Finally, a recent Lyapunov-based small-gain result for a network with infinitely many ISS components is shown.
☆ Delta-V-Optimal Centralized Guidance Strategy For Under-actuated N-Satellite Formations
This paper addresses the computation of Delta-V-optimal, safe, relative orbit reconfigurations for satellite formations in a centralized fashion. The formations under consideration comprise an uncontrolled chief spacecraft flying with an arbitrary number, N, of deputy satellites, where each deputy is equipped with a single electric thruster. Indeed, this represents a technological solution that is becoming widely employed by the producers of small-satellite platforms. While adopting a single electric thruster does reduce the required power, weight, and size of the orbit control system, it comes at the cost of rendering the satellite under-actuated. In this setting, the satellite can provide a desired thrust vector only after an attitude maneuver is carried out to redirect the thruster nozzle opposite to the desired thrust direction. In order to further extend the applicability range of such under-actuated platforms, guidance strategies are developed to support different reconfiguration scenarios for N-satellite formations. This paper starts from a classical non-convex quadratically constrained trajectory optimization formulation, which passes through multiple simplifications and approximations to arrive to two novel convex formulations, namely a second-order cone programming formulation, and a linear programming one. Out of five guidance formulations proposed in this article, the most promising three were compared through an extensive benchmark analysis that is applied to fifteen of the most widely-used solvers. This benchmark experiment provides information about the key distinctions between the different problem formulations, and under which conditions each one of them can be recommended.
☆ Equity-aware Load Shedding Optimization
Load shedding is usually the last resort to balance generation and demand to maintain stable operation of the electric grid after major disturbances. Current load-shedding optimization practices focus mainly on the physical optimality of the network power flow. This might lead to an uneven allocation of load curtailment, disadvantaging some loads more than others. Addressing this oversight, this paper introduces an innovative equity-aware load-shedding optimization model that emphasizes a fair allocation of load curtailment across the network. By proposing a novel equity indicator for load shedding and integrating it into an ACOPF-based optimization framework, we offer grid operators a more balanced and equitable load shedding strategy. Case studies highlight the importance of equity considerations in determining optimal load curtailment between buses.
comment: Contact email for corresponding and first author: allen.fangxin@gmail.com
☆ Translation of Nagumo's Foundational Work on Barrier Functions: On the Location of Integral Curves of Ordinary Differential Equations
In 1942, Prof. Mitio Nagumo published his seminal paper on the location of integral curves of ordinary differential equations. Nagumo's paper provides the foundation of the set invariance of ordinary differential equations and barrier functions, which have recently gained popularity for the control design of safety critical dynamical systems. This translation shall serve the community with an easily accessible version of the original 1942 paper in English. A copy of Nagumo's paper in German is also attached as a reference. That copy was created by the Boeing Company, Germany, in an attempt to improve pdf format readability of the original paper.
♻ ☆ Multi-Modal Conformal Prediction Regions with Simple Structures by Optimizing Convex Shape Templates
Conformal prediction is a statistical tool for producing prediction regions for machine learning models that are valid with high probability. A key component of conformal prediction algorithms is a \emph{non-conformity score function} that quantifies how different a model's prediction is from the unknown ground truth value. Essentially, these functions determine the shape and the size of the conformal prediction regions. While prior work has gone into creating score functions that produce multi-model prediction regions, such regions are generally too complex for use in downstream planning and control problems. We propose a method that optimizes parameterized \emph{shape template functions} over calibration data, which results in non-conformity score functions that produce prediction regions with minimum volume. Our approach results in prediction regions that are \emph{multi-modal}, so they can properly capture residuals of distributions that have multiple modes, and \emph{practical}, so each region is convex and can be easily incorporated into downstream tasks, such as a motion planner using conformal prediction regions. Our method applies to general supervised learning tasks, while we illustrate its use in time-series prediction. We provide a toolbox and present illustrative case studies of F16 fighter jets and autonomous vehicles, showing an up to $68\%$ reduction in prediction region area compared to a circular baseline region.
comment: Accepted to L4DC 2024. 14 pages, 3 figures. The source code and toolbox are available at https://github.com/nandantumu/conformal_region_designer
♻ ☆ Neuromorphic Control of a Pendulum
We illustrate the potential of neuromorphic control on the simple mechanical model of a pendulum, with both event-based actuation and sensing. The controller and the pendulum are regarded as event-based systems that occasionally interact to coordinate their respective rhythms. Control occurs through a proper timing of the interacting events. We illustrate the mixed nature of the control design: the design of a rhythmic automaton, able to generate the right sequence of events, and the design of a feedback regulator, able to tune the timing of events.
comment: 6 pages, 10 figures. Accepted by IEEE Control Systems Letters
♻ ☆ PretVM: Predictable, Efficient Virtual Machine for Real-Time Concurrency
This paper introduces the Precision-Timed Virtual Machine (PretVM), an intermediate platform facilitating the execution of quasi-static schedules compiled from a subset of programs written in the Lingua Franca (LF) coordination language. The subset consists of those programs that in principle should have statically verifiable and predictable timing behavior. The PretVM provides a schedule with well-defined worst-case timing bounds. The PretVM provides a clean separation between application logic and coordination logic, yielding more analyzable program executions. Experiments compare the PretVM against the default (more dynamic) LF scheduler and show that it delivers time-accurate deterministic execution.
♻ ☆ Convex Optimization of Initial Perturbations toward Quantitative Weather Control
This study proposes introducing convex optimization to find initial perturbations of atmospheric models for realizing specified changes in subsequent forecasts. In the proposed method, we formulate and solve an inverse problem to find effective perturbations in atmospheric variables so that controlled variables satisfy specified changes at a specified time. The proposed method first constructs a sensitivity matrix of controlled variables, such as accumulated precipitation, to the initial atmospheric variables, such as temperature and humidity, through sensitivity analysis using numerical weather prediction (NWP) models. The sensitivity matrix is used to solve the inverse problem as convex optimization, in which a global optimal solution can be found computationally efficiently. The proposed method was validated through a benchmark warm bubble experiment using an NWP model. The experiments showed that identified perturbation successfully realized specified spatial distributions of accumulated precipitation. These results demonstrated the possibility of controlling the real atmosphere by solving inverse problems and adding small perturbations to atmospheric states.
comment: revised to improve readability; some figures in Appendix omitted to improve conciseness
♻ ☆ Low-rank approximated Kalman-Bucy filters using Oja's principal component flow for linear time-invariant systems
The Kalman-Bucy filter is extensively utilized across various applications. However, its computational complexity increases significantly in large-scale systems. To mitigate this challenge, a low-rank approximated Kalman--Bucy filter was proposed, comprising Oja's principal component flow and a low-dimensional Riccati differential equation. Previously, the estimation error was confirmed solely for linear time-invariant systems with a symmetric system matrix. This study extends the application by eliminating the constraint on the symmetricity of the system matrix and describes the equilibrium points of the Oja flow along with their stability for general matrices. In addition, the domain of attraction for a set of stable equilibrium points is estimated. Based on these findings, we demonstrate that the low-rank approximated Kalman--Bucy filter with a suitable rank maintains a bounded estimation error covariance matrix if the system is controllable and observable.
comment: fixed a few typos
♻ ☆ Coordinating O&M and Logistical Resources to Enhance Post-Disaster Resilience of Interdependent Power and Natural Gas Distribution Systems
Electric power and natural gas systems are becoming increasingly interdependent, driven by the growth of natural gas-fired generation and the electrification of the gas industry. Recent energy crises have underscored the urgent need for enhanced resilience in these interdependent systems. In response to this challenge, this paper focuses on the interdependent electric power and natural gas distribution systems and proposes a comprehensive strategy for enhancing their resilience, which involves optimally coordinating various resources encompassing emergency operation and maintenance (O&M) and logistical support, to efficiently restore damaged services and infrastructure. Specifically, generators, including those with fuel switching capability, are dispatched to provide emergency power, with the logistical process of fuel supply being particularly managed by coordinated scheduling of mobile fuel tankers. Additionally, integrated demand responses for power and gas are dispatched to alleviate operational stresses using a zonewise approach that groups customers with intertwined energy supplies into individual, dispatchable units. Furthermore, the strategy incorporates the repair of damaged facilities, with particular consideration given to the varying repair efficiencies dependent on the dispatched repair units, aiming to closely simulate actual repair scenarios. The strategy is encapsulated in a mixed-integer second-order cone programming model, with case studies conducted to validate its effectiveness.
comment: 10 pages, 9 figures
♻ ☆ Essentially Sharp Estimates on the Entropy Regularization Error in Discrete Discounted Markov Decision Processes
We study the error introduced by entropy regularization of infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength both in a weighted KL-divergence and in value with a problem-specific exponent. We provide a lower bound matching our upper bound up to a polynomial factor. Our proof relies on the correspondence of the solutions of entropy-regularized Markov decision processes with gradient flows of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. Further, this correspondence allows us to identify the limit of the gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of the Kakade gradient flow which corresponds to a time-continuous version of the natural policy gradient method. We use this to show that for entropy-regularized natural policy gradient methods the overall error decays exponentially in the square root of the number of iterations improving existing sublinear guarantees.
comment: 26 pages, 1 figure
♻ ☆ Mean-Square Exponential Stabilization of Mixed-Autonomy Traffic PDE System
Control of mixed-autonomy traffic where Human-driven Vehicles (HVs) and Autonomous Vehicles (AVs) coexist on the road has gained increasing attention over the recent decades. This paper addresses the boundary stabilization problem for mixed traffic on freeways. The traffic dynamics are described by uncertain coupled hyperbolic partial differential equations (PDEs) with Markov jumping parameters, which aim to address the distinctive driving strategies between AVs and HVs. Considering that the spacing policies of AVs vary in mixed traffic, the stochastic impact area of AVs is governed by a continuous Markov chain. The interactions between HVs and AVs such as overtaking or lane changing are mainly induced by impact areas. Using backstepping design, we develop a full-state feedback boundary control law to stabilize the deterministic system (nominal system). Applying Lyapunov analysis, we demonstrate that the nominal backstepping control law is able to stabilize the traffic system with Markov jumping parameters, provided the nominal parameters are sufficiently close to the stochastic ones on average. The mean-square exponential stability conditions are derived, and the results are validated by numerical simulations.
♻ ☆ Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series
Online unsupervised detection of anomalies is crucial to guarantee the correct operation of cyber-physical systems and the safety of humans interacting with them. State-of-the-art approaches based on deep learning via neural networks achieve outstanding performance at anomaly recognition, evaluating the discrepancy between a normal model of the system (with no anomalies) and the real-time stream of sensor time series. However, large training data and time are typically required, and explainability is still a challenge to identify the root of the anomaly and implement predictive maintainance. In this paper, we use causal discovery to learn a normal causal graph of the system, and we evaluate the persistency of causal links during real-time acquisition of sensor data to promptly detect anomalies. On two benchmark anomaly detection datasets, we show that our method has higher training efficiency, outperforms the accuracy of state-of-the-art neural architectures and correctly identifies the sources of >10 different anomalies. The code is at https://github.com/Isla-lab/causal_anomaly_detection.
comment: In publication for IEEE Conference on Automation and Smart Engineering (CASE) 2024
♻ ☆ When Large Language Models Meet Optical Networks: Paving the Way for Automation
Since the advent of GPT, large language models (LLMs) have brought about revolutionary advancements in all walks of life. As a superior natural language processing (NLP) technology, LLMs have consistently achieved state-of-the-art performance on numerous areas. However, LLMs are considered to be general-purpose models for NLP tasks, which may encounter challenges when applied to complex tasks in specialized fields such as optical networks. In this study, we propose a framework of LLM-empowered optical networks, facilitating intelligent control of the physical layer and efficient interaction with the application layer through an LLM-driven agent (AI-Agent) deployed in the control layer. The AI-Agent can leverage external tools and extract domain knowledge from a comprehensive resource library specifically established for optical networks. This is achieved through user input and well-crafted prompts, enabling the generation of control instructions and result representations for autonomous operation and maintenance in optical networks. To improve LLM's capability in professional fields and stimulate its potential on complex tasks, the details of performing prompt engineering, establishing domain knowledge library, and implementing complex tasks are illustrated in this study. Moreover, the proposed framework is verified on two typical tasks: network alarm analysis and network performance optimization. The good response accuracies and sematic similarities of 2,400 test situations exhibit the great potential of LLM in optical networks.
♻ ☆ Who Plays First? Optimizing the Order of Play in Stackelberg Games with Many Robots
We consider the multi-agent spatial navigation problem of computing the socially optimal order of play, i.e., the sequence in which the agents commit to their decisions, and its associated equilibrium in an N-player Stackelberg trajectory game. We model this problem as a mixed-integer optimization problem over the space of all possible Stackelberg games associated with the order of play's permutations. To solve the problem, we introduce Branch and Play (B&P), an efficient and exact algorithm that provably converges to a socially optimal order of play and its Stackelberg equilibrium. As a subroutine for B&P, we employ and extend sequential trajectory planning, i.e., a popular multi-agent control approach, to scalably compute valid local Stackelberg equilibria for any given order of play. We demonstrate the practical utility of B&P to coordinate air traffic control, swarm formation, and delivery vehicle fleets. We find that B&P consistently outperforms various baselines, and computes the socially optimal equilibrium.
comment: Robotics: Science and Systems (RSS) 2024
♻ ☆ Multi-AUV Kinematic Task Assignment based on Self-organizing Map Neural Network and Dubins Path Generator
To deal with the task assignment problem of multi-AUV systems under kinematic constraints, which means steering capability constraints for underactuated AUVs or other vehicles likely, an improved task assignment algorithm is proposed combining the Dubins Path algorithm with improved SOM neural network algorithm. At first, the aimed tasks are assigned to the AUVs by improved SOM neural network method based on workload balance and neighborhood function. When there exists kinematic constraints or obstacles which may cause failure of trajectory planning, task re-assignment will be implemented by change the weights of SOM neurals, until the AUVs can have paths to reach all the targets. Then, the Dubins paths are generated in several limited cases. AUV's yaw angle is limited, which result in new assignments to the targets. Computation flow is designed so that the algorithm in MATLAB and Python can realizes the path planning to multiple targets. Finally, simulation results prove that the proposed algorithm can effectively accomplish the task assignment task for multi-AUV system.
♻ ☆ A Simple Finite-Time Analysis of TD Learning with Linear Function Approximation
We study the finite-time convergence of TD learning with linear function approximation under Markovian sampling. Existing proofs for this setting either assume a projection step in the algorithm to simplify the analysis, or require a fairly intricate argument to ensure stability of the iterates. We ask: \textit{Is it possible to retain the simplicity of a projection-based analysis without actually performing a projection step in the algorithm?} Our main contribution is to show this is possible via a novel two-step argument. In the first step, we use induction to prove that under a standard choice of a constant step-size $\alpha$, the iterates generated by TD learning remain uniformly bounded in expectation. In the second step, we establish a recursion that mimics the steady-state dynamics of TD learning up to a bounded perturbation on the order of $O(\alpha^2)$ that captures the effect of Markovian sampling. Combining these pieces leads to an overall approach that considerably simplifies existing proofs. We conjecture that our inductive proof technique will find applications in the analyses of more complex stochastic approximation algorithms, and conclude by providing some examples of such applications.
♻ ☆ Bounding Stochastic Safety: Leveraging Freedman's Inequality with Discrete-Time Control Barrier Functions
When deployed in the real world, safe control methods must be robust to unstructured uncertainties such as modeling error and external disturbances. Typical robust safety methods achieve their guarantees by always assuming that the worst-case disturbance will occur. In contrast, this paper utilizes Freedman's inequality in the context of discrete-time control barrier functions (DTCBFs) and c-martingales to provide stronger (less conservative) safety guarantees for stochastic systems. Our approach accounts for the underlying disturbance distribution instead of relying exclusively on its worst-case bound and does not require the barrier function to be upper-bounded, which makes the resulting safety probability bounds more directly useful for intuitive safety constraints such as signed distance. We compare our results with existing safety guarantees, such as input-to-state safety (ISSf) and martingale results that rely on Ville's inequality. When the assumptions for all methods hold, we provide a range of parameters for which our guarantee is stronger. Finally, we present simulation examples, including a bipedal walking robot, that demonstrate the utility and tightness of our safety guarantee.
comment: 10 pages, 3 figures
♻ ☆ Learning Low-dimensional Latent Dynamics from High-dimensional Observations: Non-asymptotics and Lower Bounds
In this paper, we focus on learning a linear time-invariant (LTI) model with low-dimensional latent variables but high-dimensional observations. We provide an algorithm that recovers the high-dimensional features, i.e. column space of the observer, embeds the data into low dimensions and learns the low-dimensional model parameters. Our algorithm enjoys a sample complexity guarantee of order $\tilde{\mathcal{O}}(n/\epsilon^2)$, where $n$ is the observation dimension. We further establish a fundamental lower bound indicating this complexity bound is optimal up to logarithmic factors and dimension-independent constants. We show that this inevitable linear factor of $n$ is due to the learning error of the observer's column space in the presence of high-dimensional noises. Extending our results, we consider a meta-learning problem inspired by various real-world applications, where the observer column space can be collectively learned from datasets of multiple LTI systems. An end-to-end algorithm is then proposed, facilitating learning LTI systems from a meta-dataset which breaks the sample complexity lower bound in certain scenarios.
♻ ☆ The Brain-Inspired Cooperative Shared Control for Brain-Machine Interface
In the practical application of brain-machine interface technology, the problem often faced is the low information content and high noise of the neural signals collected by the electrode and the difficulty of decoding by the decoder, which makes it difficult for the robotic to obtain stable instructions to complete the task. The idea based on the principle of cooperative shared control can be achieved by extracting general motor commands from brain activity, while the fine details of the movement can be hosted to the robot for completion, or the brain can have complete control. This study proposes a brain-machine interface shared control system based on spiking neural networks for robotic arm movement control and wheeled robots wheel speed control and steering, respectively. The former can reliably control the robotic arm to move to the destination position, while the latter controls the wheeled robots for object tracking and map generation. The results show that the shared control based on brain-inspired intelligence can perform some typical tasks in complex environments and positively improve the fluency and ease of use of brain-machine interaction, and also demonstrate the potential of this control method in clinical applications of brain-machine interfaces.
comment: This article need to update the corrected figure and data
Multiagent Systems 12
☆ Fairness in Social Influence Maximization via Optimal Transport
We study fairness in social influence maximization, whereby one seeks to select seeds that spread a given information throughout a network, ensuring balanced outreach among different communities (e.g. demographic groups). In the literature, fairness is often quantified in terms of the expected outreach within individual communities. In this paper, we demonstrate that such fairness metrics can be misleading since they ignore the stochastic nature of information diffusion processes. When information diffusion occurs in a probabilistic manner, multiple outreach scenarios can occur. As such, outcomes such as "in 50% of the cases, no one of group 1 receives the information and everyone in group 2 receives it and in other 50%, the opposite happens", which always results in largely unfair outcomes, are classified as fair by a variety of fairness metrics in the literature. We tackle this problem by designing a new fairness metric, mutual fairness, that captures variability in outreach through optimal transport theory. We propose a new seed selection algorithm that optimizes both outreach and mutual fairness, and we show its efficacy on several real datasets. We find that our algorithm increases fairness with only a minor decrease (and at times, even an increase) in efficiency.
☆ CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems
Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well known to be vulnerable to adversarial perturbations. Previous works on adversarial attacks have primarily focused on white-box attacks that directly perturb the states or actions of victim agents, often in scenarios with a limited number of attacks. However, gaining complete access to victim agents in real-world environments is exceedingly difficult. To create more realistic adversarial attacks, we introduce a novel method that involves injecting traitor agents into the CMARL system. We model this problem as a Traitor Markov Decision Process (TMDP), where traitors cannot directly attack the victim agents but can influence their formation or positioning through collisions. In TMDP, traitors are trained using the same MARL algorithm as the victim agents, with their reward function set as the negative of the victim agents' reward. Despite this, the training efficiency for traitors remains low because it is challenging for them to directly associate their actions with the victim agents' rewards. To address this issue, we propose the Curiosity-Driven Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and aggressiveness of attacks on the specified victim agents' policies while maintaining the optimal policy invariance of the traitors. Specifically, we employ a pre-trained Random Network Distillation (RND) module, where the extra reward generated by the RND module encourages traitors to explore states unencountered by the victim agents. Extensive experiments on various scenarios from SMAC demonstrate that our CuDA2 framework offers comparable or superior adversarial attack capabilities compared to other baselines.
☆ Learnings from Implementation of a BDI Agent-based Battery-less Wireless Sensor
Battery-less embedded devices powered by energy harvesting are increasingly being used in wireless sensing applications. However, their limited and often uncertain energy availability challenges designing application programs. To examine if BDI-based agent programming can address this challenge, we used it for a real-life application involving an environmental sensor that works on energy harvested from ambient light. This yielded the first ever implementation of a BDI agent on a low-power battery-less and energy-harvesting embedded system. Furthermore, it uncovered conceptual integration challenges between embedded systems and BDI-based agent programming that, if overcome, will simplify the deployment of more autonomous systems on low-power devices with non-deterministic energy availability. Specifically, we (1) mapped essential device states to default \textit{internal} beliefs, (2) recognized and addressed the need for beliefs in general to be \textit{short-} or \textit{long-term}, and (3) propose dynamic annotation of intentions with their run-time energy impact. We show that incorporating these extensions not only simplified the programming but also improved code readability and understanding of its behavior.
☆ Model Checking of vGOAL AAMAS
Developing autonomous decision-making requires safety assurance. Agent programming languages like AgentSpeak and Gwendolen provide tools for programming autonomous decision-making. However, despite numerous efforts to apply model checking to these languages, challenges persist such as a faithful semantic mapping between agent programs and the generated models, efficient model generation, and efficient model checking. As an extension of the agent programming language GOAL, vGOAL has been proposed to formally specify autonomous decisions with an emphasis on safety. This paper tackles the mentioned challenges through two automated model-checking processes for vGOAL: one for Computation Tree Logic and another for Probabilistic Computation Tree Logic. Compared with the existing model-checking approaches of agent programming languages, it has three main advantages. First, it efficiently performs automated model-checking analysis for a given vGOAL specification, including efficiently generating input models for NuSMV and Storm and leveraging these efficient model checkers. Second, the semantic equivalence is established for both nondeterministic models and probabilistic models of vGOAL: from vGOAL to transition systems or DTMCs. Third, an algorithm is proposed for efficiently detecting errors, which is particularly useful for vGOAL specifications that describe complex scenarios. Validation and experiments in a real-world autonomous logistic system with three autonomous mobile robots illustrate both the efficiency and practical usability of the automated CTL and PCTL model-checking process for vGOAL.
comment: 21 pages, 2 figures, it is a draft version of a paper that plans to submit to JAAMAS
☆ Hierarchical Framework for Optimizing Wildfire Surveillance and Suppression using Human-Autonomous Teaming
The integration of manned and unmanned aircraft can help improve wildfire response. Wildfire containment failures occur when resources available to first responders, who execute the initial stages of wildfire management referred to as the initial attack, are ineffective or insufficient. Initial attack surveillance and suppression models have linked action spaces and objectives, making their optimization computationally challenging. The initial attack may be formulated as a multi-agent partially observable Markov decision process (MPOMDP). We divide the initial attack MPOMDP into surveillance and suppression processes with their respective planners operating on different, but constant, time scales. A hierarchical framework iterates between surveillance and suppression planners while also providing collision avoidance. This framework is exemplified by a set of multi-rotor unmanned aircraft surveying an initial attack fire while a manned helicopter suppresses the fire with a water bucket. Wildfire-specific solver extensions are formulated to reduce the otherwise vast action spaces. The hierarchical framework outperforms firefighting techniques and a myopic baseline by up to 242% for moderate wildfires and 60% for rapid wildfires when simulated in abstracted and actual case studies. We also validate the early dispatching of additional suppression assets using regression models to ensure wildfire containment to thresholds established by wildfire agencies.
☆ The Overcooked Generalisation Challenge
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents' zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OGC. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.
comment: 9 pages
☆ Towards Hypermedia Environments for Adaptive Coordination in Industrial Automation
Electromechanical systems manage physical processes through a network of inter-connected components. Today, programming the interactions required for coordinating these components is largely a manual process. This process is time-consuming and requires manual adaptation when system features change. To overcome this issue, we use autonomous software agents that process semantic descriptions of the system to determine coordination requirements and constraints; on this basis, they then interact with one another to control the system in a decentralized and coordinated manner.Our core insight is that coordination requirements between individual components are, ultimately, largely due to underlying physical interdependencies between the components, which can be (and, in many cases, already are) semantically modeled in automation projects. Agents then use hypermedia to discover, at run time, the plans and protocols required for enacting the coordination. A key novelty of our approach is the use of hypermedia-driven interaction: it reduces coupling in the system and enables its run-time adaptation as features change.
♻ ☆ Locally Differentially Private Distributed Online Learning with Guaranteed Optimality
Distributed online learning is gaining increased traction due to its unique ability to process large-scale datasets and streaming data. To address the growing public awareness and concern on privacy protection, plenty of algorithms have been proposed to enable differential privacy in distributed online optimization and learning. However, these algorithms often face the dilemma of trading learning accuracy for privacy. By exploiting the unique characteristics of online learning, this paper proposes an approach that tackles the dilemma and ensures both differential privacy and learning accuracy in distributed online learning. More specifically, while ensuring a diminishing expected instantaneous regret, the approach can simultaneously ensure a finite cumulative privacy budget, even in the infinite time horizon. To cater for the fully distributed setting, we adopt the local differential-privacy framework, which avoids the reliance on a trusted data curator, and, hence, provides stronger protection than the classic "centralized" (global) differential privacy. To the best of our knowledge, this is the first algorithm that successfully ensures both rigorous local differential privacy and learning accuracy. The effectiveness of the proposed algorithm is evaluated using machine learning tasks, including logistic regression on the the "mushrooms" datasets and CNN-based image classification on the "MNIST" and "CIFAR-10" datasets.
comment: 23 pages, 9 figures
♻ ☆ Detection of Malicious Agents in Social Learning
Non-Bayesian social learning is a framework for distributed hypothesis testing aimed at learning the true state of the environment. Traditionally, the agents are assumed to receive observations conditioned on the same true state, although it is also possible to examine the case of heterogeneous models across the graph. One important special case is when heterogeneity is caused by the presence of malicious agents whose goal is to move the agents toward a wrong hypothesis. In this work, we propose an algorithm that allows discovering the true state of every individual agent based on the sequence of their beliefs. In so doing, the methodology is also able to locate malicious behavior.
♻ ☆ TornadoDrone: Bio-inspired DRL-based Drone Landing on 6D Platform with Wind Force Disturbances
Autonomous drone navigation faces a critical challenge in achieving accurate landings on dynamic platforms, especially under unpredictable conditions such as wind turbulence. Our research introduces TornadoDrone, a novel Deep Reinforcement Learning (DRL) model that adopts bio-inspired mechanisms to adapt to wind forces, mirroring the natural adaptability seen in birds. This model, unlike traditional approaches, derives its adaptability from indirect cues such as changes in position and velocity, rather than direct wind force measurements. TornadoDrone was rigorously trained in the gym-pybullet-drone simulator, which closely replicates the complexities of wind dynamics in the real world. Through extensive testing with Crazyflie 2.1 drones in both simulated and real windy conditions, TornadoDrone demonstrated a high performance in maintaining high-precision landing accuracy on moving platforms, surpassing conventional control methods such as PID controllers with Extended Kalman Filters. The study not only highlights the potential of DRL to tackle complex aerodynamic challenges but also paves the way for advanced autonomous systems that can adapt to environmental changes in real-time. The success of TornadoDrone signifies a leap forward in drone technology, particularly for critical applications such as surveillance and emergency response, where reliability and precision are paramount.
comment: Submitted to IEEE. arXiv admin note: substantial text overlap with arXiv:2403.06572
♻ ☆ Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.
♻ ☆ Community Consensus: Converging Locally despite Adversaries and Heterogeneous Connectivity
We introduce the concept of community consensus in the presence of malicious agents using a well-known median-based consensus algorithm. We consider networks that have multiple well-connected regions that we term communities, characterized by specific robustness and minimum degree properties. Prior work derives conditions on properties that are necessary and sufficient for achieving global consensus in a network. This, however, requires the minimum degree of the network graph to be proportional to the number of malicious agents in the network, which is not very practical in large networks. In this work, we present a natural generalization of this previous result. We characterize cases where, although global consensus is not reached, some subsets of agents $V_i$ will still converge to the same values $M_i$ among themselves. To reach this new type of consensus, we define more relaxed requirements in terms of the number of malicious agents in each community, and the number $k$ of edges connecting an agent in a community to agents external to the community.
Artificial Intelligence 140
☆ EXTRACT: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data
Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, EXTRACT, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that EXTRACT can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://www.jessezhang.net/projects/extract/.
comment: 22 pages, 13 figures
☆ BMIKE-53: Investigating Cross-Lingual Knowledge Editing with In-Context Learning
Large language models (LLMs) possess extensive parametric knowledge, but this knowledge is difficult to update with new information because retraining is very expensive and infeasible for closed-source models. Knowledge editing (KE) has emerged as a viable solution for updating the knowledge of LLMs without compromising their overall performance. On-the-fly KE methods, inspired by in-context learning (ICL), have shown great promise and allow LLMs to be treated as black boxes. In the past, KE was primarily employed in English contexts, whereas the potential for cross-lingual KE in current English-centric LLMs has not been fully explored. To foster more research in this direction, we introduce the BMIKE-53 benchmark for evaluating cross-lingual KE on 53 diverse languages across three KE task types. We also propose a gradient-free KE method called Multilingual In-context Knowledge Editing (MIKE) and evaluate it on BMIKE-53. Our evaluation focuses on cross-lingual knowledge transfer in terms of reliability, generality, locality, and portability, offering valuable insights and a framework for future research in cross-lingual KE. Our code and data are publicly accessible via the anonymous repository at https://anonymous.4open.science/r/MIKE.
comment: 12 pages, 4 figures
☆ DiffusionPDE: Generative PDE-Solving Under Partial Observation
We introduce a general framework for solving partial differential equations (PDEs) using generative diffusion models. In particular, we focus on the scenarios where we do not have the full knowledge of the scene necessary to apply classical solvers. Most existing forward or inverse PDE approaches perform poorly when the observations on the data or the underlying coefficients are incomplete, which is a common assumption for real-world measurements. In this work, we propose DiffusionPDE that can simultaneously fill in the missing information and solve a PDE by modeling the joint distribution of the solution and coefficient spaces. We show that the learned generative priors lead to a versatile framework for accurately solving a wide range of PDEs under partial observation, significantly outperforming the state-of-the-art methods for both forward and inverse directions.
comment: Project page: https://jhhuangchloe.github.io/Diffusion-PDE/
☆ Solving Hard Mizar Problems with Instantiation and Strategy Invention
In this work, we prove over 3000 previously ATP-unproved Mizar/MPTP problems by using several ATP and AI methods, raising the number of ATP-solved Mizar problems from 75\% to above 80\%. First, we start to experiment with the cvc5 SMT solver which uses several instantiation-based heuristics that differ from the superposition-based systems, that were previously applied to Mizar,and add many new solutions. Then we use automated strategy invention to develop cvc5 strategies that largely improve cvc5's performance on the hard problems. In particular, the best invented strategy solves over 14\% more problems than the best previously available cvc5 strategy. We also show that different clausification methods have a high impact on such instantiation-based methods, again producing many new solutions. In total, the methods solve 3021 (21.3\%) of the 14163 previously unsolved hard Mizar problems. This is a new milestone over the Mizar large-theory benchmark and a large strengthening of the hammer methods for Mizar.
☆ CaLMQA: Exploring culturally specific long-form question answering across 23 languages
Large language models (LLMs) are commonly used for long-form question answering, which requires them to generate paragraph-length answers to complex questions. While long-form QA has been well-studied in English via many different datasets and evaluation metrics, this research has not been extended to cover most other languages. To bridge this gap, we introduce CaLMQA, a collection of 2.6K complex questions spanning 23 languages, including under-resourced, rarely-studied languages such as Fijian and Kirundi. Our dataset includes both naturally-occurring questions collected from community web forums as well as questions written by native speakers, whom we hire for this purpose. Our process yields diverse, complex questions that reflect cultural topics (e.g. traditions, laws, news) and the language usage of native speakers. We conduct automatic evaluation across a suite of open- and closed-source models using our novel metric CaLMScore, which detects incorrect language and token repetitions in answers, and observe that the quality of LLM-generated answers degrades significantly for some low-resource languages. We perform human evaluation on a subset of models and see that model performance is significantly worse for culturally specific questions than for culturally agnostic questions. Our findings highlight the need for further research in LLM multilingual capabilities and non-English LFQA evaluation.
comment: 39 pages, 16 figures. Code and data available at https://github.com/2015aroras/CaLMQA
☆ Measuring and Benchmarking Large Language Models' Capabilities to Generate Persuasive Language
We are exposed to much information trying to influence us, such as teaser messages, debates, politically framed news, and propaganda - all of which use persuasive language. With the recent interest in Large Language Models (LLMs), we study the ability of LLMs to produce persuasive text. As opposed to prior work which focuses on particular domains or types of persuasion, we conduct a general study across various domains to measure and benchmark to what degree LLMs produce persuasive text - both when explicitly instructed to rewrite text to be more or less persuasive and when only instructed to paraphrase. To this end, we construct a new dataset, Persuasive-Pairs, of pairs each consisting of a short text and of a text rewritten by an LLM to amplify or diminish persuasive language. We multi-annotate the pairs on a relative scale for persuasive language. This data is not only a valuable resource in itself, but we also show that it can be used to train a regression model to predict a score of persuasive language between text pairs. This model can score and benchmark new LLMs across domains, thereby facilitating the comparison of different LLMs. Finally, we discuss effects observed for different system prompts. Notably, we find that different 'personas' in the system prompt of LLaMA3 change the persuasive language in the text substantially, even when only instructed to paraphrase. These findings underscore the importance of investigating persuasive language in LLM generated text.
☆ Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon
Memorization in language models is typically treated as a homogenous phenomenon, neglecting the specifics of the memorized data. We instead model memorization as the effect of a set of complex factors that describe each sample and relate it to the model and corpus. To build intuition around these factors, we break memorization down into a taxonomy: recitation of highly duplicated sequences, reconstruction of inherently predictable sequences, and recollection of sequences that are neither. We demonstrate the usefulness of our taxonomy by using it to construct a predictive model for memorization. By analyzing dependencies and inspecting the weights of the predictive model, we find that different factors influence the likelihood of memorization differently depending on the taxonomic category.
☆ Point-SAM: Promptable 3D Segmentation Model for Point Clouds
The development of 2D foundation models for image segmentation has been significantly advanced by the Segment Anything Model (SAM). However, achieving similar success in 3D models remains a challenge due to issues such as non-unified data formats, lightweight models, and the scarcity of labeled data with diverse masks. To this end, we propose a 3D promptable segmentation model (Point-SAM) focusing on point clouds. Our approach utilizes a transformer-based method, extending SAM to the 3D domain. We leverage part-level and object-level annotations and introduce a data engine to generate pseudo labels from SAM, thereby distilling 2D knowledge into our 3D model. Our model outperforms state-of-the-art models on several indoor and outdoor benchmarks and demonstrates a variety of applications, such as 3D annotation. Codes and demo can be found at https://github.com/zyc00/Point-SAM.
☆ Structured Unrestricted-Rank Matrices for Parameter Efficient Fine-tuning
Recent efforts to scale Transformer models have demonstrated rapid progress across a wide range of tasks (Wei et al., 2022). However, fine-tuning these models for downstream tasks is expensive due to their large parameter counts. Parameter-efficient fine-tuning (PEFT) approaches have emerged as a viable alternative by allowing us to fine-tune models by updating only a small number of parameters. In this work, we propose a general framework for parameter efficient fine-tuning (PEFT), based on structured unrestricted-rank matrices (SURM) which can serve as a drop-in replacement for popular approaches such as Adapters and LoRA. Unlike other methods like LoRA, SURMs provides more flexibility in finding the right balance between compactness and expressiveness. This is achieved by using low displacement rank matrices (LDRMs), which hasn't been used in this context before. SURMs remain competitive with baselines, often providing significant quality improvements while using a smaller parameter budget. SURMs achieve 5-7% accuracy gains on various image classification tasks while replacing low-rank matrices in LoRA. It also results in up to 12x reduction of the number of parameters in adapters (with virtually no loss in quality) on the GLUE benchmark.
comment: Work in progress
☆ Find Parent then Label Children: A Two-stage Taxonomy Completion Method with Pre-trained Language Model
Taxonomies, which organize domain concepts into hierarchical structures, are crucial for building knowledge systems and downstream applications. As domain knowledge evolves, taxonomies need to be continuously updated to include new concepts. Previous approaches have mainly focused on adding concepts to the leaf nodes of the existing hierarchical tree, which does not fully utilize the taxonomy's knowledge and is unable to update the original taxonomy structure (usually involving non-leaf nodes). In this paper, we propose a two-stage method called ATTEMPT for taxonomy completion. Our method inserts new concepts into the correct position by finding a parent node and labeling child nodes. Specifically, by combining local nodes with prompts to generate natural sentences, we take advantage of pre-trained language models for hypernym/hyponymy recognition. Experimental results on two public datasets (including six domains) show that ATTEMPT performs best on both taxonomy completion and extension tasks, surpassing existing methods.
☆ LLM Targeted Underperformance Disproportionately Impacts Vulnerable Users
While state-of-the-art Large Language Models (LLMs) have shown impressive performance on many tasks, there has been extensive research on undesirable model behavior such as hallucinations and bias. In this work, we investigate how the quality of LLM responses changes in terms of information accuracy, truthfulness, and refusals depending on three user traits: English proficiency, education level, and country of origin. We present extensive experimentation on three state-of-the-art LLMs and two different datasets targeting truthfulness and factuality. Our findings suggest that undesirable behaviors in state-of-the-art LLMs occur disproportionately more for users with lower English proficiency, of lower education status, and originating from outside the US, rendering these models unreliable sources of information towards their most vulnerable users.
☆ Compositional Models for Estimating Causal Effects
Many real-world systems can be represented as sets of interacting components. Examples of such systems include computational systems such as query processors, natural systems such as cells, and social systems such as families. Many approaches have been proposed in traditional (associational) machine learning to model such structured systems, including statistical relational models and graph neural networks. Despite this prior work, existing approaches to estimating causal effects typically treat such systems as single units, represent them with a fixed set of variables and assume a homogeneous data-generating process. We study a compositional approach for estimating individual treatment effects (ITE) in structured systems, where each unit is represented by the composition of multiple heterogeneous components. This approach uses a modular architecture to model potential outcomes at each component and aggregates component-level potential outcomes to obtain the unit-level potential outcomes. We discover novel benefits of the compositional approach in causal inference - systematic generalization to estimate counterfactual outcomes of unseen combinations of components and improved overlap guarantees between treatment and control groups compared to the classical methods for causal effect estimation. We also introduce a set of novel environments for empirically evaluating the compositional approach and demonstrate the effectiveness of our approach using both simulated and real-world data.
☆ Data curation via joint example selection further accelerates multimodal learning
Data curation is an essential component of large-scale pretraining. In this work, we demonstrate that jointly selecting batches of data is more effective for learning than selecting examples independently. Multimodal contrastive objectives expose the dependencies between data and thus naturally yield criteria for measuring the joint learnability of a batch. We derive a simple and tractable algorithm for selecting such batches, which significantly accelerate training beyond individually-prioritized data points. As performance improves by selecting from larger super-batches, we also leverage recent advances in model approximation to reduce the associated computational overhead. As a result, our approach--multimodal contrastive learning with joint example selection (JEST)--surpasses state-of-the-art models with up to 13$\times$ fewer iterations and 10$\times$ less computation. Essential to the performance of JEST is the ability to steer the data selection process towards the distribution of smaller, well-curated datasets via pretrained reference models, exposing the level of data curation as a new dimension for neural scaling laws.
comment: Main text: 9 pages, 5 figures, 3 tables, 1 algorithm. Appendix: 7 pages, 5 figures, 1 table, 2. algorithm
☆ HGTDP-DTA: Hybrid Graph-Transformer with Dynamic Prompt for Drug-Target Binding Affinity Prediction
Drug target binding affinity (DTA) is a key criterion for drug screening. Existing experimental methods are time-consuming and rely on limited structural and domain information. While learning-based methods can model sequence and structural information, they struggle to integrate contextual data and often lack comprehensive modeling of drug-target interactions. In this study, we propose a novel DTA prediction method, termed HGTDP-DTA, which utilizes dynamic prompts within a hybrid Graph-Transformer framework. Our method generates context-specific prompts for each drug-target pair, enhancing the model's ability to capture unique interactions. The introduction of prompt tuning further optimizes the prediction process by filtering out irrelevant noise and emphasizing task-relevant information, dynamically adjusting the input features of the molecular graph. The proposed hybrid Graph-Transformer architecture combines structural information from Graph Convolutional Networks (GCNs) with sequence information captured by Transformers, facilitating the interaction between global and local information. Additionally, we adopted the multi-view feature fusion method to project molecular graph views and affinity subgraph views into a common feature space, effectively combining structural and contextual information. Experiments on two widely used public datasets, Davis and KIBA, show that HGTDP-DTA outperforms state-of-the-art DTA prediction methods in both prediction performance and generalization ability.
☆ Unified Auto-Encoding with Masked Diffusion
At the core of both successful generative and self-supervised representation learning models there is a reconstruction objective that incorporates some form of image corruption. Diffusion models implement this approach through a scheduled Gaussian corruption process, while masked auto-encoder models do so by masking patches of the image. Despite their different approaches, the underlying similarity in their methodologies suggests a promising avenue for an auto-encoder capable of both de-noising tasks. We propose a unified self-supervised objective, dubbed Unified Masked Diffusion (UMD), that combines patch-based and noise-based corruption techniques within a single auto-encoding framework. Specifically, UMD modifies the diffusion transformer (DiT) training process by introducing an additional noise-free, high masking representation step in the diffusion noising schedule, and utilizes a mixed masked and noised image for subsequent timesteps. By integrating features useful for diffusion modeling and for predicting masked patch tokens, UMD achieves strong performance in downstream generative and representation learning tasks, including linear probing and class-conditional generation. This is achieved without the need for heavy data augmentations, multiple views, or additional encoders. Furthermore, UMD improves over the computational efficiency of prior diffusion based methods in total training time. We release our code at https://github.com/philippe-eecs/small-vision.
comment: 19 Pages, 8 Figures, 3Tables
☆ LLM-ARC: Enhancing LLMs with an Automated Reasoning Critic
We introduce LLM-ARC, a neuro-symbolic framework designed to enhance the logical reasoning capabilities of Large Language Models (LLMs), by combining them with an Automated Reasoning Critic (ARC). LLM-ARC employs an Actor-Critic method where the LLM Actor generates declarative logic programs along with tests for semantic correctness, while the Automated Reasoning Critic evaluates the code, runs the tests and provides feedback on test failures for iterative refinement. Implemented using Answer Set Programming (ASP), LLM-ARC achieves a new state-of-the-art accuracy of 88.32% on the FOLIO benchmark which tests complex logical reasoning capabilities. Our experiments demonstrate significant improvements over LLM-only baselines, highlighting the importance of logic test generation and iterative self-refinement. We achieve our best result using a fully automated self-supervised training loop where the Actor is trained on end-to-end dialog traces with Critic feedback. We discuss potential enhancements and provide a detailed error analysis, showcasing the robustness and efficacy of LLM-ARC for complex natural language reasoning tasks.
☆ DKPROMPT: Domain Knowledge Prompting Vision-Language Models for Open-World Planning
Vision-language models (VLMs) have been applied to robot task planning problems, where the robot receives a task in natural language and generates plans based on visual inputs. While current VLMs have demonstrated strong vision-language understanding capabilities, their performance is still far from being satisfactory in planning tasks. At the same time, although classical task planners, such as PDDL-based, are strong in planning for long-horizon tasks, they do not work well in open worlds where unforeseen situations are common. In this paper, we propose a novel task planning and execution framework, called DKPROMPT, which automates VLM prompting using domain knowledge in PDDL for classical planning in open worlds. Results from quantitative experiments show that DKPROMPT outperforms classical planning, pure VLM-based and a few other competitive baselines in task completion rate.
☆ MDHA: Multi-Scale Deformable Transformer with Hybrid Anchors for Multi-View 3D Object Detection
Multi-view 3D object detection is a crucial component of autonomous driving systems. Contemporary query-based methods primarily depend either on dataset-specific initialization of 3D anchors, introducing bias, or utilize dense attention mechanisms, which are computationally inefficient and unscalable. To overcome these issues, we present MDHA, a novel sparse query-based framework, which constructs adaptive 3D output proposals using hybrid anchors from multi-view, multi-scale input. Fixed 2D anchors are combined with depth predictions to form 2.5D anchors, which are projected to obtain 3D proposals. To ensure high efficiency, our proposed Anchor Encoder performs sparse refinement and selects the top-k anchors and features. Moreover, while existing multi-view attention mechanisms rely on projecting reference points to multiple images, our novel Circular Deformable Attention mechanism only projects to a single image but allows reference points to seamlessly attend to adjacent images, improving efficiency without compromising on performance. On the nuScenes val set, it achieves 46.4% mAP and 55.0% NDS with a ResNet101 backbone. MDHA significantly outperforms the baseline, where anchor proposals are modelled as learnable embeddings.
☆ Leveraging Large Language Models for Software Model Completion: Results from Industrial and Public Datasets
Modeling structure and behavior of software systems plays a crucial role in the industrial practice of software engineering. As with other software engineering artifacts, software models are subject to evolution. Supporting modelers in evolving software models with recommendations for model completions is still an open problem, though. In this paper, we explore the potential of large language models for this task. In particular, we propose an approach, retrieval-augmented generation, leveraging large language models, model histories, and retrieval-augmented generation for model completion. Through experiments on three datasets, including an industrial application, one public open-source community dataset, and one controlled collection of simulated model repositories, we evaluate the potential of large language models for model completion with retrieval-augmented generation. We found that large language models are indeed a promising technology for supporting software model evolution (62.30% semantically correct completions on real-world industrial data and up to 86.19% type-correct completions). The general inference capabilities of large language models are particularly useful when dealing with concepts for which there are few, noisy, or no examples at all.
☆ ELIZA Reinterpreted: The world's first chatbot was not intended as a chatbot at all
ELIZA, often considered the world's first chatbot, was written by Joseph Weizenbaum in the early 1960s. Weizenbaum did not intend to invent the chatbot, but rather to build a platform for research into human-machine conversation and the important cognitive processes of interpretation and misinterpretation. His purpose was obscured by ELIZA's fame, resulting in large part from the fortuitous timing of it's creation, and it's escape into the wild. In this paper I provide a rich historical context for ELIZA's creation, demonstrating that ELIZA arose from the intersection of some of the central threads in the technical history of AI. I also briefly discuss how ELIZA escaped into the world, and how its accidental escape, along with several coincidental turns of the programming language screws, led both to the misapprehension that ELIZA was intended as a chatbot, and to the loss of the original ELIZA to history for over 50 years.
comment: In review in IEEE Annals of the History of Computing (submitted Apr 2024)
☆ Banishing LLM Hallucinations Requires Rethinking Generalization
Despite their powerful chat, coding, and reasoning abilities, Large Language Models (LLMs) frequently hallucinate. Conventional wisdom suggests that hallucinations are a consequence of a balance between creativity and factuality, which can be mitigated, but not eliminated, by grounding the LLM in external knowledge sources. Through extensive systematic experiments, we show that these traditional approaches fail to explain why LLMs hallucinate in practice. Specifically, we show that LLMs augmented with a massive Mixture of Memory Experts (MoME) can easily memorize large datasets of random numbers. We corroborate these experimental findings with a theoretical construction showing that simple neural networks trained to predict the next token hallucinate when the training loss is above a threshold as it usually does in practice when training on internet scale data. We interpret our findings by comparing against traditional retrieval methods for mitigating hallucinations. We use our findings to design a first generation model for removing hallucinations -- Lamini-1 -- that stores facts in a massive mixture of millions of memory experts that are retrieved dynamically.
☆ BayTTA: Uncertainty-aware medical image classification with optimized test-time augmentation using Bayesian model averaging
Test-time augmentation (TTA) is a well-known technique employed during the testing phase of computer vision tasks. It involves aggregating multiple augmented versions of input data. Combining predictions using a simple average formulation is a common and straightforward approach after performing TTA. This paper introduces a novel framework for optimizing TTA, called BayTTA (Bayesian-based TTA), which is based on Bayesian Model Averaging (BMA). First, we generate a model list associated with different variations of the input data created through TTA. Then, we use BMA to combine model predictions weighted by their respective posterior probabilities. Such an approach allows one to take into account model uncertainty, and thus to enhance the predictive performance of the related machine learning or deep learning model. We evaluate the performance of BayTTA on various public data, including three medical image datasets comprising skin cancer, breast cancer, and chest X-ray images and two well-known gene editing datasets, CRISPOR and GUIDE-seq. Our experimental results indicate that BayTTA can be effectively integrated into state-of-the-art deep learning models used in medical image analysis as well as into some popular pre-trained CNN models such as VGG-16, MobileNetV2, DenseNet201, ResNet152V2, and InceptionRes-NetV2, leading to the enhancement in their accuracy and robustness performance.
☆ Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Contrastive Language--Image Pre-training (CLIP) has manifested remarkable improvements in zero-shot classification and cross-modal vision-language tasks. Yet, from a geometrical point of view, the CLIP embedding space has been found to have a pronounced modality gap. This gap renders the embedding space overly sparse and disconnected, with different modalities being densely distributed in distinct subregions of the hypersphere. In this work, we aim at answering two main questions: 1. Does sharing the parameter space between the multi-modal encoders reduce the modality gap? 2. Can the gap be mitigated by pushing apart the uni-modal embeddings via intra-modality separation? We design AlignCLIP, in order to answer these questions and show that answers to both questions are positive. Through extensive experiments, we show that AlignCLIP achieves noticeable enhancements in the cross-modal alignment of the embeddings, and thereby, reduces the modality gap, while maintaining the performance across several downstream evaluations, such as zero-shot image classification, zero-shot multi-modal retrieval and zero-shot semantic text similarity.
☆ Aligning Diffusion Models with Noise-Conditioned Perception
Recent advancements in human preference optimization, initially developed for Language Models (LMs), have shown promise for text-to-image Diffusion Models, enhancing prompt alignment, visual appeal, and user preference. Unlike LMs, Diffusion Models typically optimize in pixel or VAE space, which does not align well with human perception, leading to slower and less efficient training during the preference alignment stage. We propose using a perceptual objective in the U-Net embedding space of the diffusion model to address these issues. Our approach involves fine-tuning Stable Diffusion 1.5 and XL using Direct Preference Optimization (DPO), Contrastive Preference Optimization (CPO), and supervised fine-tuning (SFT) within this embedding space. This method significantly outperforms standard latent-space implementations across various metrics, including quality and computational cost. For SDXL, our approach provides 60.8\% general preference, 62.2\% visual appeal, and 52.1\% prompt following against original open-sourced SDXL-DPO on the PartiPrompts dataset, while significantly reducing compute. Our approach not only improves the efficiency and quality of human preference alignment for diffusion models but is also easily integrable with other optimization techniques. The training code and LoRA weights will be available here: https://huggingface.co/alexgambashidze/SDXL\_NCP-DPO\_v0.1
☆ KANQAS: Kolmogorov Arnold Network for Quantum Architecture Search
Quantum architecture search~(QAS) is a promising direction for optimization and automated design of quantum circuits towards quantum advantage. Recent techniques in QAS focus on machine learning-based approaches from reinforcement learning, like deep Q-network. While multi-layer perceptron-based deep Q-networks have been applied for QAS, their interpretability remains challenging due to the high number of parameters. In this work, we evaluate the practicality of KANs in quantum architecture search problems, analyzing their efficiency in terms of the probability of success, frequency of optimal solutions and their dependencies on various degrees of freedom of the network. In a noiseless scenario, the probability of success and the number of optimal quantum circuit configurations to generate the multi-qubit maximally entangled states are significantly higher than MLPs. Moreover in noisy scenarios, KAN can achieve a better fidelity in approximating maximally entangled state than MLPs, where the performance of the MLP significantly depends on the choice of activation function. Further investigation reveals that KAN requires a very small number of learnable parameters compared to MLPs, however, the average time of executing each episode for KAN is much higher.
comment: 10 pages and 4 figures
☆ CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024
As large language models (LLMs) constantly evolve, ensuring their safety remains a critical research problem. Previous red-teaming approaches for LLM safety have primarily focused on single prompt attacks or goal hijacking. To the best of our knowledge, we are the first to study LLM safety in multi-turn dialogue coreference. We created a dataset of 1,400 questions across 14 categories, each featuring multi-turn coreference safety attacks. We then conducted detailed evaluations on five widely used open-source LLMs. The results indicated that under multi-turn coreference safety attacks, the highest attack success rate was 56% with the LLaMA2-Chat-7b model, while the lowest was 13.9% with the Mistral-7B-Instruct model. These findings highlight the safety vulnerabilities in LLMs during dialogue coreference interactions.
comment: Submitted to EMNLP 2024
☆ Self-assessment, Exhibition, and Recognition: a Review of Personality in Large Language Models
As large language models (LLMs) appear to behave increasingly human-like in text-based interactions, more and more researchers become interested in investigating personality in LLMs. However, the diversity of psychological personality research and the rapid development of LLMs have led to a broad yet fragmented landscape of studies in this interdisciplinary field. Extensive studies across different research focuses, different personality psychometrics, and different LLMs make it challenging to have a holistic overview and further pose difficulties in applying findings to real-world applications. In this paper, we present a comprehensive review by categorizing current studies into three research problems: self-assessment, exhibition, and recognition, based on the intrinsic characteristics and external manifestations of personality in LLMs. For each problem, we provide a thorough analysis and conduct in-depth comparisons of their corresponding solutions. Besides, we summarize research findings and open challenges from current studies and further discuss their underlying causes. We also collect extensive publicly available resources to facilitate interested researchers and developers. Lastly, we discuss the potential future research directions and application scenarios. Our paper is the first comprehensive survey of up-to-date literature on personality in LLMs. By presenting a clear taxonomy, in-depth analysis, promising future directions, and extensive resource collections, we aim to provide a better understanding and facilitate further advancements in this emerging field.
☆ Aligning Programming Language and Natural Language: Exploring Design Choices in Multi-Modal Transformer-Based Embedding for Bug Localization
Bug localization refers to the identification of source code files which is in a programming language and also responsible for the unexpected behavior of software using the bug report, which is a natural language. As bug localization is labor-intensive, bug localization models are employed to assist software developers. Due to the domain difference between source code files and bug reports, modern bug-localization systems, based on deep learning models, rely heavily on embedding techniques that project bug reports and source code files into a shared vector space. The creation of an embedding involves several design choices, but the impact of these choices on the quality of embedding and the performance of bug localization models remains unexplained in current research. To address this gap, our study evaluated 14 distinct embedding models to gain insights into the effects of various design choices. Subsequently, we developed bug localization models utilizing these embedding models to assess the influence of these choices on the performance of the localization models. Our findings indicate that the pre-training strategies significantly affect the quality of the embedding. Moreover, we discovered that the familiarity of the embedding models with the data has a notable impact on the bug localization model's performance. Notably, when the training and testing data are collected from different projects, the performance of the bug localization models exhibits substantial fluctuations.
☆ Diffusion-based Adversarial Purification for Intrusion Detection
The escalating sophistication of cyberattacks has encouraged the integration of machine learning techniques in intrusion detection systems, but the rise of adversarial examples presents a significant challenge. These crafted perturbations mislead ML models, enabling attackers to evade detection or trigger false alerts. As a reaction, adversarial purification has emerged as a compelling solution, particularly with diffusion models showing promising results. However, their purification potential remains unexplored in the context of intrusion detection. This paper demonstrates the effectiveness of diffusion models in purifying adversarial examples in network intrusion detection. Through a comprehensive analysis of the diffusion parameters, we identify optimal configurations maximizing adversarial robustness with minimal impact on normal performance. Importantly, this study reveals insights into the relationship between diffusion noise and diffusion steps, representing a novel contribution to the field. Our experiments are carried out on two datasets and against 5 adversarial attacks. The implementation code is publicly available.
☆ Towards Compositional Interpretability for XAI
Artificial intelligence (AI) is currently based largely on black-box machine learning models which lack interpretability. The field of eXplainable AI (XAI) strives to address this major concern, being critical in high-stakes areas such as the finance, legal and health sectors. We present an approach to defining AI models and their interpretability based on category theory. For this we employ the notion of a compositional model, which sees a model in terms of formal string diagrams which capture its abstract structure together with its concrete implementation. This comprehensive view incorporates deterministic, probabilistic and quantum models. We compare a wide range of AI models as compositional models, including linear and rule-based models, (recurrent) neural networks, transformers, VAEs, and causal and DisCoCirc models. Next we give a definition of interpretation of a model in terms of its compositional structure, demonstrating how to analyse the interpretability of a model, and using this to clarify common themes in XAI. We find that what makes the standard 'intrinsically interpretable' models so transparent is brought out most clearly diagrammatically. This leads us to the more general notion of compositionally-interpretable (CI) models, which additionally include, for instance, causal, conceptual space, and DisCoCirc models. We next demonstrate the explainability benefits of CI models. Firstly, their compositional structure may allow the computation of other quantities of interest, and may facilitate inference from the model to the modelled phenomenon by matching its structure. Secondly, they allow for diagrammatic explanations for their behaviour, based on influence constraints, diagram surgery and rewrite explanations. Finally, we discuss many future directions for the approach, raising the question of how to learn such meaningfully structured models in practice.
☆ Leveraging Reinforcement Learning in Red Teaming for Advanced Ransomware Attack Simulations
Ransomware presents a significant and increasing threat to individuals and organizations by encrypting their systems and not releasing them until a large fee has been extracted. To bolster preparedness against potential attacks, organizations commonly conduct red teaming exercises, which involve simulated attacks to assess existing security measures. This paper proposes a novel approach utilizing reinforcement learning (RL) to simulate ransomware attacks. By training an RL agent in a simulated environment mirroring real-world networks, effective attack strategies can be learned quickly, significantly streamlining traditional, manual penetration testing processes. The attack pathways revealed by the RL agent can provide valuable insights to the defense team, helping them identify network weak points and develop more resilient defensive measures. Experimental results on a 152-host example network confirm the effectiveness of the proposed approach, demonstrating the RL agent's capability to discover and orchestrate attacks on high-value targets while evading honeyfiles (decoy files strategically placed to detect unauthorized access).
☆ Multi-property Steering of Large Language Models with Dynamic Activation Composition
Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models' intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.
☆ CDQuant: Accurate Post-training Weight Quantization of Large Pre-trained Models using Greedy Coordinate Descent
Large language models (LLMs) have recently demonstrated remarkable performance across diverse language tasks. But their deployment is often constrained by their substantial computational and storage requirements. Quantization has emerged as a key technique for addressing this challenge, enabling the compression of large models with minimal impact on performance. The recent GPTQ algorithm, a post-training quantization (PTQ) method, has proven highly effective for compressing LLMs, sparking a wave of research that leverages GPTQ as a core component. Recognizing the pivotal role of GPTQ in the PTQ landscape, we introduce CDQuant, a simple and scalable alternative to GPTQ with improved performance. CDQuant uses coordinate descent to minimize the layer-wise reconstruction loss to achieve high-quality quantized weights. Our algorithm is easy to implement and scales efficiently to models with hundreds of billions of parameters. Through extensive evaluation on the PaLM2 model family, we demonstrate that CDQuant consistently outperforms GPTQ across diverse model sizes and quantization levels. In particular, for INT2 quantization of PaLM2-Otter, CDQuant achieves a 10% reduction in perplexity compared to GPTQ.
☆ SincVAE: a New Approach to Improve Anomaly Detection on EEG Data Using SincNet and Variational Autoencoder
Over the past few decades, electroencephalography (EEG) monitoring has become a pivotal tool for diagnosing neurological disorders, particularly for detecting seizures. Epilepsy, one of the most prevalent neurological diseases worldwide, affects approximately the 1 \% of the population. These patients face significant risks, underscoring the need for reliable, continuous seizure monitoring in daily life. Most of the techniques discussed in the literature rely on supervised Machine Learning (ML) methods. However, the challenge of accurately labeling variations in epileptic EEG waveforms complicates the use of these approaches. Additionally, the rarity of ictal events introduces an high imbalancing within the data, which could lead to poor prediction performance in supervised learning approaches. Instead, a semi-supervised approach allows to train the model only on data not containing seizures, thus avoiding the issues related to the data imbalancing. This work proposes a semi-supervised approach for detecting epileptic seizures from EEG data, utilizing a novel Deep Learning-based method called SincVAE. This proposal incorporates the learning of an ad-hoc array of bandpass filter as a first layer of a Variational Autoencoder (VAE), potentially eliminating the preprocessing stage where informative band frequencies are identified and isolated. Results indicate that SincVAE improves seizure detection in EEG data and is capable of identifying early seizures during the preictal stage as well as monitoring patients throughout the postictal stage.
☆ Disce aut Deficere: Evaluating LLMs Proficiency on the INVALSI Italian Benchmark
Recent advancements in Large Language Models (LLMs) have significantly enhanced their ability to generate and manipulate human language, highlighting their potential across various applications. Evaluating LLMs in languages other than English is crucial for ensuring their linguistic versatility, cultural relevance, and applicability in diverse global contexts, thus broadening their usability and effectiveness. We tackle this challenge by introducing a structured benchmark using the INVALSI tests, a set of well-established assessments designed to measure educational competencies across Italy. Our study makes three primary contributions: Firstly, we adapt the INVALSI benchmark for automated LLM evaluation, which involves rigorous adaptation of the test format to suit automated processing while retaining the essence of the original tests. Secondly, we provide a detailed assessment of current LLMs, offering a crucial reference point for the academic community. Finally, we visually compare the performance of these models against human results. Additionally, researchers are invited to submit their models for ongoing evaluation, ensuring the benchmark remains a current and valuable resource.
☆ Can Large Language Models Understand DL-Lite Ontologies? An Empirical Study
Large language models (LLMs) have shown significant achievements in solving a wide range of tasks. Recently, LLMs' capability to store, retrieve and infer with symbolic knowledge has drawn a great deal of attention, showing their potential to understand structured information. However, it is not yet known whether LLMs can understand Description Logic (DL) ontologies. In this work, we empirically analyze the LLMs' capability of understanding DL-Lite ontologies covering 6 representative tasks from syntactic and semantic aspects. With extensive experiments, we demonstrate both the effectiveness and limitations of LLMs in understanding DL-Lite ontologies. We find that LLMs can understand formal syntax and model-theoretic semantics of concepts and roles. However, LLMs struggle with understanding TBox NI transitivity and handling ontologies with large ABoxes. We hope that our experiments and analyses provide more insights into LLMs and inspire to build more faithful knowledge engineering solutions.
☆ Enhancing LLM-Based Human-Robot Interaction with Nuances for Diversity Awareness
This paper presents a system for diversity-aware autonomous conversation leveraging the capabilities of large language models (LLMs). The system adapts to diverse populations and individuals, considering factors like background, personality, age, gender, and culture. The conversation flow is guided by the structure of the system's pre-established knowledge base, while LLMs are tasked with various functions, including generating diversity-aware sentences. Achieving diversity-awareness involves providing carefully crafted prompts to the models, incorporating comprehensive information about users, conversation history, contextual details, and specific guidelines. To assess the system's performance, we conducted both controlled and real-world experiments, measuring a wide range of performance indicators.
comment: 8 pages, 6 figures, 7 tables. This paper has been accepted for publication at IEEE ROMAN 2024
☆ On the consistency of hyper-parameter selection in value-based deep reinforcement learning
Deep reinforcement learning (deep RL) has achieved tremendous success on various domains through a combination of algorithmic design and careful selection of hyper-parameters. Algorithmic improvements are often the result of iterative enhancements built upon prior approaches, while hyper-parameter choices are typically inherited from previous methods or fine-tuned specifically for the proposed technique. Despite their crucial impact on performance, hyper-parameter choices are frequently overshadowed by algorithmic advancements. This paper conducts an extensive empirical study focusing on the reliability of hyper-parameter selection for value-based deep reinforcement learning agents, including the introduction of a new score to quantify the consistency and reliability of various hyper-parameters. Our findings not only help establish which hyper-parameters are most critical to tune, but also help clarify which tunings remain consistent across different training regimes.
☆ Enhancing Explainability of Knowledge Learning Paths: Causal Knowledge Networks
A reliable knowledge structure is a prerequisite for building effective adaptive learning systems and intelligent tutoring systems. Pursuing an explainable and trustworthy knowledge structure, we propose a method for constructing causal knowledge networks. This approach leverages Bayesian networks as a foundation and incorporates causal relationship analysis to derive a causal network. Additionally, we introduce a dependable knowledge-learning path recommendationHuman-Centric eXplainable AI in Education technique built upon this framework, improving teaching and learning quality while maintaining transparency in the decision-making process.
comment: 8 pages, 3 figures, Educational Data Mining 2024, Human-Centric eXplainable AI in Education
☆ Preserving Node Distinctness in Graph Autoencoders via Similarity Distillation
Graph autoencoders (GAEs), as a kind of generative self-supervised learning approach, have shown great potential in recent years. GAEs typically rely on distance-based criteria, such as mean-square-error (MSE), to reconstruct the input graph. However, relying solely on a single reconstruction criterion may lead to a loss of distinctiveness in the reconstructed graph, causing nodes to collapse into similar representations and resulting in sub-optimal performance. To address this issue, we have developed a simple yet effective strategy to preserve the necessary distinctness in the reconstructed graph. Inspired by the knowledge distillation technique, we found that the dual encoder-decoder architecture of GAEs can be viewed as a teacher-student relationship. Therefore, we propose transferring the knowledge of distinctness from the raw graph to the reconstructed graph, achieved through a simple KL constraint. Specifically, we compute pairwise node similarity scores in the raw graph and reconstructed graph. During the training process, the KL constraint is optimized alongside the reconstruction criterion. We conducted extensive experiments across three types of graph tasks, demonstrating the effectiveness and generality of our strategy. This indicates that the proposed approach can be employed as a plug-and-play method to avoid vague reconstructions and enhance overall performance.
☆ Benchmarking Mental State Representations in Language Models ICML 2024
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models' internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models' internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models' representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models' reasoning performance by steering their activations without the need to train any probe.
comment: ICML 2024 Workshop on Mechanistic Interpretability
Transformer-based Named Entity Recognition with Combined Data Representation
This study examines transformer-based models and their effectiveness in named entity recognition tasks. The study investigates data representation strategies, including single, merged, and context, which respectively use one sentence, multiple sentences, and sentences joined with attention to context per vector. Analysis shows that training models with a single strategy may lead to poor performance on different data representations. To address this limitation, the study proposes a combined training procedure that utilizes all three strategies to improve model stability and adaptability. The results of this approach are presented and discussed for four languages (English, Polish, Czech, and German) across various datasets, demonstrating the effectiveness of the combined strategy.
comment: 14 pages, 6 figures
☆ TSynD: Targeted Synthetic Data Generation for Enhanced Medical Image Classification
The usage of medical image data for the training of large-scale machine learning approaches is particularly challenging due to its scarce availability and the costly generation of data annotations, typically requiring the engagement of medical professionals. The rapid development of generative models allows towards tackling this problem by leveraging large amounts of realistic synthetically generated data for the training process. However, randomly choosing synthetic samples, might not be an optimal strategy. In this work, we investigate the targeted generation of synthetic training data, in order to improve the accuracy and robustness of image classification. Therefore, our approach aims to guide the generative model to synthesize data with high epistemic uncertainty, since large measures of epistemic uncertainty indicate underrepresented data points in the training set. During the image generation we feed images reconstructed by an auto encoder into the classifier and compute the mutual information over the class-probability distribution as a measure for uncertainty.We alter the feature space of the autoencoder through an optimization process with the objective of maximizing the classifier uncertainty on the decoded image. By training on such data we improve the performance and robustness against test time data augmentations and adversarial attacks on several classifications tasks.
☆ Dynamic Scheduling for Vehicle-to-Vehicle Communications Enhanced Federated Learning
Leveraging the computing and sensing capabilities of vehicles, vehicular federated learning (VFL) has been applied to edge training for connected vehicles. The dynamic and interconnected nature of vehicular networks presents unique opportunities to harness direct vehicle-to-vehicle (V2V) communications, enhancing VFL training efficiency. In this paper, we formulate a stochastic optimization problem to optimize the VFL training performance, considering the energy constraints and mobility of vehicles, and propose a V2V-enhanced dynamic scheduling (VEDS) algorithm to solve it. The model aggregation requirements of VFL and the limited transmission time due to mobility result in a stepwise objective function, which presents challenges in solving the problem. We thus propose a derivative-based drift-plus-penalty method to convert the long-term stochastic optimization problem to an online mixed integer nonlinear programming (MINLP) problem, and provide a theoretical analysis to bound the performance gap between the online solution and the offline optimal solution. Further analysis of the scheduling priority reduces the original problem into a set of convex optimization problems, which are efficiently solved using the interior-point method. Experimental results demonstrate that compared with the state-of-the-art benchmarks, the proposed algorithm enhances the image classification accuracy on the CIFAR-10 dataset by 3.18% and reduces the average displacement errors on the Argoverse trajectory prediction dataset by 10.21%.
comment: Submitted to IEEE for possible publication
☆ Enhancing Tool Retrieval with Iterative Feedback from Large Language Models
Tool learning aims to enhance and expand large language models' (LLMs) capabilities with external tools, which has gained significant attention recently. Current methods have shown that LLMs can effectively handle a certain amount of tools through in-context learning or fine-tuning. However, in real-world scenarios, the number of tools is typically extensive and irregularly updated, emphasizing the necessity for a dedicated tool retrieval component. Tool retrieval is nontrivial due to the following challenges: 1) complex user instructions and tool descriptions; 2) misalignment between tool retrieval and tool usage models. To address the above issues, we propose to enhance tool retrieval with iterative feedback from the large language model. Specifically, we prompt the tool usage model, i.e., the LLM, to provide feedback for the tool retriever model in multi-round, which could progressively improve the tool retriever's understanding of instructions and tools and reduce the gap between the two standalone components. We build a unified and comprehensive benchmark to evaluate tool retrieval models. The extensive experiments indicate that our proposed approach achieves advanced performance in both in-domain evaluation and out-of-domain evaluation.
☆ The Tree of Diffusion Life: Evolutionary Embeddings to Understand the Generation Process of Diffusion Models
Diffusion models generate high-quality samples by corrupting data with Gaussian noise and iteratively reconstructing it with deep learning, slowly transforming noisy images into refined outputs. Understanding this data evolution is important for interpretability but is complex due to its high-dimensional evolutionary nature. While traditional dimensionality reduction methods like t-distributed stochastic neighborhood embedding (t-SNE) aid in understanding high-dimensional spaces, they neglect evolutionary structure preservation. Hence, we propose Tree of Diffusion Life (TDL), a method to understand data evolution in the generative process of diffusion models. TDL samples a diffusion model's generative space via instances with varying prompts and employs image encoders to extract semantic meaning from these samples, projecting them to an intermediate space. It employs a novel evolutionary embedding algorithm that explicitly encodes the iterations while preserving the high-dimensional relations, facilitating the visualization of data evolution. This embedding leverages three metrics: a standard t-SNE loss to group semantically similar elements, a displacement loss to group elements from the same iteration step, and an instance alignment loss to align elements of the same instance across iterations. We present rectilinear and radial layouts to represent iterations, enabling comprehensive exploration. We assess various feature extractors and highlight TDL's potential with prominent diffusion models like GLIDE and Stable Diffusion with different prompt sets. TDL simplifies understanding data evolution within diffusion models, offering valuable insights into their functioning.
☆ Improving Grammatical Error Correction via Contextual Data Augmentation ACL 2024
Nowadays, data augmentation through synthetic data has been widely used in the field of Grammatical Error Correction (GEC) to alleviate the problem of data scarcity. However, these synthetic data are mainly used in the pre-training phase rather than the data-limited fine-tuning phase due to inconsistent error distribution and noisy labels. In this paper, we propose a synthetic data construction method based on contextual augmentation, which can ensure an efficient augmentation of the original data with a more consistent error distribution. Specifically, we combine rule-based substitution with model-based generation, using the generative model to generate a richer context for the extracted error patterns. Besides, we also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data. Experiments on CoNLL14 and BEA19-Test show that our proposed augmentation method consistently and substantially outperforms strong baselines and achieves the state-of-the-art level with only a few synthetic data.
comment: Accepted as Findings of ACL 2024
☆ Pseudo Labelling for Enhanced Masked Autoencoders
Masked Image Modeling (MIM)-based models, such as SdAE, CAE, GreenMIM, and MixAE, have explored different strategies to enhance the performance of Masked Autoencoders (MAE) by modifying prediction, loss functions, or incorporating additional architectural components. In this paper, we propose an enhanced approach that boosts MAE performance by integrating pseudo labelling for both class and data tokens, alongside replacing the traditional pixel-level reconstruction with token-level reconstruction. This strategy uses cluster assignments as pseudo labels to promote instance-level discrimination within the network, while token reconstruction requires generation of discrete tokens encapturing local context. The targets for pseudo labelling and reconstruction needs to be generated by a teacher network. To disentangle the generation of target pseudo labels and the reconstruction of the token features, we decouple the teacher into two distinct models, where one serves as a labelling teacher and the other as a reconstruction teacher. This separation proves empirically superior to a single teacher, while having negligible impact on throughput and memory consumption. Incorporating pseudo-labelling as an auxiliary task has demonstrated notable improvements in ImageNet-1K and other downstream tasks, including classification, semantic segmentation, and detection.
☆ CuDA2: An approach for Incorporating Traitor Agents into Cooperative Multi-Agent Systems
Cooperative Multi-Agent Reinforcement Learning (CMARL) strategies are well known to be vulnerable to adversarial perturbations. Previous works on adversarial attacks have primarily focused on white-box attacks that directly perturb the states or actions of victim agents, often in scenarios with a limited number of attacks. However, gaining complete access to victim agents in real-world environments is exceedingly difficult. To create more realistic adversarial attacks, we introduce a novel method that involves injecting traitor agents into the CMARL system. We model this problem as a Traitor Markov Decision Process (TMDP), where traitors cannot directly attack the victim agents but can influence their formation or positioning through collisions. In TMDP, traitors are trained using the same MARL algorithm as the victim agents, with their reward function set as the negative of the victim agents' reward. Despite this, the training efficiency for traitors remains low because it is challenging for them to directly associate their actions with the victim agents' rewards. To address this issue, we propose the Curiosity-Driven Adversarial Attack (CuDA2) framework. CuDA2 enhances the efficiency and aggressiveness of attacks on the specified victim agents' policies while maintaining the optimal policy invariance of the traitors. Specifically, we employ a pre-trained Random Network Distillation (RND) module, where the extra reward generated by the RND module encourages traitors to explore states unencountered by the victim agents. Extensive experiments on various scenarios from SMAC demonstrate that our CuDA2 framework offers comparable or superior adversarial attack capabilities compared to other baselines.
☆ Leave No Document Behind: Benchmarking Long-Context LLMs with Extended Multi-Doc QA
Long-context modeling capabilities have garnered widespread attention, leading to the emergence of Large Language Models (LLMs) with ultra-context windows. Meanwhile, benchmarks for evaluating long-context LLMs are gradually catching up. However, existing benchmarks employ irrelevant noise texts to artificially extend the length of test cases, diverging from the real-world scenarios of long-context applications. To bridge this gap, we propose a novel long-context benchmark, Loong, aligning with realistic scenarios through extended multi-document question answering (QA). Unlike typical document QA, in Loong's test cases, each document is relevant to the final answer, ignoring any document will lead to the failure of the answer. Furthermore, Loong introduces four types of tasks with a range of context lengths: Spotlight Locating, Comparison, Clustering, and Chain of Reasoning, to facilitate a more realistic and comprehensive evaluation of long-context understanding. Extensive experiments indicate that existing long-context language models still exhibit considerable potential for enhancement. Retrieval augmented generation (RAG) achieves poor performance, demonstrating that Loong can reliably assess the model's long-context modeling capabilities.
comment: We release our code and data publicly at https://github.com/MozerWang/Loong
☆ SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation
Despite the suitability of graphs for capturing the relational structures inherent in architectural layout designs, there is a notable dearth of research on interpreting architectural design space using graph-based representation learning and exploring architectural design graph generation. Concurrently, disentangled representation learning in graph generation faces challenges such as node permutation invariance and representation expressiveness. To address these challenges, we introduce an unsupervised disentangled representation learning framework, Style-based Edge-augmented Variational Graph Auto-Encoder (SE-VGAE), aiming to generate architectural layout in the form of attributed adjacency multi-graphs while prioritizing representation disentanglement. The framework is designed with three alternative pipelines, each integrating a transformer-based edge-augmented encoder, a latent space disentanglement module, and a style-based decoder. These components collectively facilitate the decomposition of latent factors influencing architectural layout graph generation, enhancing generation fidelity and diversity. We also provide insights into optimizing the framework by systematically exploring graph feature augmentation schemes and evaluating their effectiveness for disentangling architectural layout representation through extensive experiments. Additionally, we contribute a new benchmark large-scale architectural layout graph dataset extracted from real-world floor plan images to facilitate the exploration of graph data-based architectural design representation space interpretation. This study pioneered disentangled representation learning for the architectural layout graph generation. The code and dataset of this study will be open-sourced.
☆ Variable Layer-Wise Quantization: A Simple and Effective Approach to Quantize LLMs EMNLP
We present a simple variable quantization approach that quantizes different layers of a large language model (LLM) at different bit levels. Specifically, we quantize the most important layers to higher bit precision and less important layers to lower bits to achieve floating point quantization levels. We propose two effective strategies to measure the importance of layers within LLMs: the first measures the importance of a layer based on how different its output embeddings are from the input embeddings (the higher the better); the second estimates the importance of a layer using the number of layer weights that are much larger than average (the smaller the better). We show that quantizing different layers at varying bits according to our importance scores results in minimal performance drop with a far more compressed model size. Finally, we present several practical key takeaways from our variable layer-wise quantization experiments: (a) LLM performance under variable quantization remains close to the original model until 25-50% of layers are moved in lower quantization using our proposed ordering but only until 5-10% if moved using no specific ordering; (b) Quantizing LLMs to lower bits performs substantially better than pruning unless extreme quantization (2-bit) is used; and (c) Layer-wise quantization to lower bits works better in the case of larger LLMs with more layers compared to smaller LLMs with fewer layers. The code used to run the experiments is available at: https://github.com/RazvanDu/LayerwiseQuant.
comment: submitted to EMNLP, 15 pages, 10 figures, 4 tables
☆ Double Momentum Method for Lower-Level Constrained Bilevel Optimization
Bilevel optimization (BO) has recently gained prominence in many machine learning applications due to its ability to capture the nested structure inherent in these problems. Recently, many hypergradient methods have been proposed as effective solutions for solving large-scale problems. However, current hypergradient methods for the lower-level constrained bilevel optimization (LCBO) problems need very restrictive assumptions, namely, where optimality conditions satisfy the differentiability and invertibility conditions and lack a solid analysis of the convergence rate. What's worse, existing methods require either double-loop updates, which are sometimes less efficient. To solve this problem, in this paper, we propose a new hypergradient of LCBO leveraging the theory of nonsmooth implicit function theorem instead of using the restrive assumptions. In addition, we propose a \textit{single-loop single-timescale} algorithm based on the double-momentum method and adaptive step size method and prove it can return a $(\delta, \epsilon)$-stationary point with $\tilde{\mathcal{O}}(d_2^2\epsilon^{-4})$ iterations. Experiments on two applications demonstrate the effectiveness of our proposed method.
comment: 27pages, 9 figures
☆ Temporal-Channel Modeling in Multi-head Self-Attention for Synthetic Speech Detection INTERSPEECH 2024
Recent synthetic speech detectors leveraging the Transformer model have superior performance compared to the convolutional neural network counterparts. This improvement could be due to the powerful modeling ability of the multi-head self-attention (MHSA) in the Transformer model, which learns the temporal relationship of each input token. However, artifacts of synthetic speech can be located in specific regions of both frequency channels and temporal segments, while MHSA neglects this temporal-channel dependency of the input sequence. In this work, we proposed a Temporal-Channel Modeling (TCM) module to enhance MHSA's capability for capturing temporal-channel dependencies. Experimental results on the ASVspoof 2021 show that with only 0.03M additional parameters, the TCM module can outperform the state-of-the-art system by 9.25% in EER. Further ablation study reveals that utilizing both temporal and channel information yields the most improvement for detecting synthetic speech.
comment: Accepted by INTERSPEECH 2024
☆ Q-DiT: Accurate Post-Training Quantization for Diffusion Transformers
Recent advancements in diffusion models, particularly the trend of architectural transformation from UNet-based Diffusion to Diffusion Transformer (DiT), have significantly improved the quality and scalability of image synthesis. Despite the incredible generative quality, the large computational requirements of these large-scale models significantly hinder the deployments in real-world scenarios. Post-training Quantization (PTQ) offers a promising solution by compressing model sizes and speeding up inference for the pretrained models while eliminating model retraining. However, we have observed the existing PTQ frameworks exclusively designed for both ViT and conventional Diffusion models fall into biased quantization and result in remarkable performance degradation. In this paper, we find that the DiTs typically exhibit considerable variance in terms of both weight and activation, which easily runs out of the limited numerical representations. To address this issue, we devise Q-DiT, which seamlessly integrates three techniques: fine-grained quantization to manage substantial variance across input channels of weights and activations, an automatic search strategy to optimize the quantization granularity and mitigate redundancies, and dynamic activation quantization to capture the activation changes across timesteps. Extensive experiments on the ImageNet dataset demonstrate the effectiveness of the proposed Q-DiT. Specifically, when quantizing DiT-XL/2 to W8A8 on ImageNet 256x256, Q-DiT achieves a remarkable reduction in FID by 1.26 compared to the baseline. Under a W4A8 setting, it maintains high fidelity in image generation, showcasing only a marginal increase in FID and setting a new benchmark for efficient, high-quality quantization in diffusion transformers. Code is available at \href{https://github.com/Juanerx/Q-DiT}{https://github.com/Juanerx/Q-DiT}.
☆ Masked Generative Extractor for Synergistic Representation and 3D Generation of Point Clouds
In the field of 2D image generation modeling and representation learning, Masked Generative Encoder (MAGE) has demonstrated the synergistic potential between generative modeling and representation learning. Inspired by this, we propose Point-MAGE to extend this concept to point cloud data. Specifically, this framework first utilizes a Vector Quantized Variational Autoencoder (VQVAE) to reconstruct a neural field representation of 3D shapes, thereby learning discrete semantic features of point patches. Subsequently, by combining the masking model with variable masking ratios, we achieve synchronous training for both generation and representation learning. Furthermore, our framework seamlessly integrates with existing point cloud self-supervised learning (SSL) models, thereby enhancing their performance. We extensively evaluate the representation learning and generation capabilities of Point-MAGE. In shape classification tasks, Point-MAGE achieved an accuracy of 94.2% on the ModelNet40 dataset and 92.9% (+1.3%) on the ScanObjectNN dataset. Additionally, it achieved new state-of-the-art performance in few-shot learning and part segmentation tasks. Experimental results also confirmed that Point-MAGE can generate detailed and high-quality 3D shapes in both unconditional and conditional settings.
☆ Joint Admission Control and Resource Allocation of Virtual Network Embedding via Hierarchical Deep Reinforcement Learning SC
As an essential resource management problem in network virtualization, virtual network embedding (VNE) aims to allocate the finite resources of physical network to sequentially arriving virtual network requests (VNRs) with different resource demands. Since this is an NP-hard combinatorial optimization problem, many efforts have been made to provide viable solutions. However, most existing approaches have either ignored the admission control of VNRs, which has a potential impact on long-term performances, or not fully exploited the temporal and topological features of the physical network and VNRs. In this paper, we propose a deep Hierarchical Reinforcement Learning approach to learn a joint Admission Control and Resource Allocation policy for VNE, named HRL-ACRA. Specifically, the whole VNE process is decomposed into an upper-level policy for deciding whether to admit the arriving VNR or not and a lower-level policy for allocating resources of the physical network to meet the requirement of VNR through the HRL approach. Considering the proximal policy optimization as the basic training algorithm, we also adopt the average reward method to address the infinite horizon problem of the upper-level agent and design a customized multi-objective intrinsic reward to alleviate the sparse reward issue of the lower-level agent. Moreover, we develop a deep feature-aware graph neural network to capture the features of VNR and physical network and exploit a sequence-to-sequence model to generate embedding actions iteratively. Finally, extensive experiments are conducted in various settings, and show that HRL-ACRA outperforms state-of-the-art baselines in terms of both the acceptance ratio and long-term average revenue. Our code is available at \url{https://github.com/GeminiLight/hrl-acra}.
comment: Accepted by IEEE Transactions on Services Computing (TSC)
☆ Dual-Space Knowledge Distillation for Large Language Models
Knowledge distillation (KD) is known as a promising solution to compress large language models (LLMs) via transferring their knowledge to smaller models. During this process, white-box KD methods usually minimize the distance between the output distributions of the two models so that more knowledge can be transferred. However, in the current white-box KD framework, the output distributions are from the respective output spaces of the two models, using their own prediction heads. We argue that the space discrepancy will lead to low similarity between the teacher model and the student model on both representation and distribution levels. Furthermore, this discrepancy also hinders the KD process between models with different vocabularies, which is common for current LLMs. To address these issues, we propose a dual-space knowledge distillation (DSKD) framework that unifies the output spaces of the two models for KD. On the basis of DSKD, we further develop a cross-model attention mechanism, which can automatically align the representations of the two models with different vocabularies. Thus, our framework is not only compatible with various distance functions for KD (e.g., KL divergence) like the current framework, but also supports KD between any two LLMs regardless of their vocabularies. Experiments on task-agnostic instruction-following benchmarks show that DSKD significantly outperforms the current white-box KD framework with various distance functions, and also surpasses existing KD methods for LLMs with different vocabularies.
comment: 17 pages, 11 figures, code available at: https://github.com/songmzhang/DSKD
☆ The State-Action-Reward-State-Action Algorithm in Spatial Prisoner's Dilemma Game
Cooperative behavior is prevalent in both human society and nature. Understanding the emergence and maintenance of cooperation among self-interested individuals remains a significant challenge in evolutionary biology and social sciences. Reinforcement learning (RL) provides a suitable framework for studying evolutionary game theory as it can adapt to environmental changes and maximize expected benefits. In this study, we employ the State-Action-Reward-State-Action (SARSA) algorithm as the decision-making mechanism for individuals in evolutionary game theory. Initially, we apply SARSA to imitation learning, where agents select neighbors to imitate based on rewards. This approach allows us to observe behavioral changes in agents without independent decision-making abilities. Subsequently, SARSA is utilized for primary agents to independently choose cooperation or betrayal with their neighbors. We evaluate the impact of SARSA on cooperation rates by analyzing variations in rewards and the distribution of cooperators and defectors within the network.
☆ ALPBench: A Benchmark for Active Learning Pipelines on Tabular Data
In settings where only a budgeted amount of labeled data can be afforded, active learning seeks to devise query strategies for selecting the most informative data points to be labeled, aiming to enhance learning algorithms' efficiency and performance. Numerous such query strategies have been proposed and compared in the active learning literature. However, the community still lacks standardized benchmarks for comparing the performance of different query strategies. This particularly holds for the combination of query strategies with different learning algorithms into active learning pipelines and examining the impact of the learning algorithm choice. To close this gap, we propose ALPBench, which facilitates the specification, execution, and performance monitoring of active learning pipelines. It has built-in measures to ensure evaluations are done reproducibly, saving exact dataset splits and hyperparameter settings of used algorithms. In total, ALPBench consists of 86 real-world tabular classification datasets and 5 active learning settings, yielding 430 active learning problems. To demonstrate its usefulness and broad compatibility with various learning algorithms and query strategies, we conduct an exemplary study evaluating 9 query strategies paired with 8 learning algorithms in 2 different settings. We provide ALPBench here: https://github.com/ValentinMargraf/ActiveLearningPipelines.
☆ Towards Open-set Camera 3D Object Detection
Traditional camera 3D object detectors are typically trained to recognize a predefined set of known object classes. In real-world scenarios, these detectors may encounter unknown objects outside the training categories and fail to identify them correctly. To address this gap, we present OS-Det3D (Open-set Camera 3D Object Detection), a two-stage training framework enhancing the ability of camera 3D detectors to identify both known and unknown objects. The framework involves our proposed 3D Object Discovery Network (ODN3D), which is specifically trained using geometric cues such as the location and scale of 3D boxes to discover general 3D objects. ODN3D is trained in a class-agnostic manner, and the provided 3D object region proposals inherently come with data noise. To boost accuracy in identifying unknown objects, we introduce a Joint Objectness Selection (JOS) module. JOS selects the pseudo ground truth for unknown objects from the 3D object region proposals of ODN3D by combining the ODN3D objectness and camera feature attention objectness. Experiments on the nuScenes and KITTI datasets demonstrate the effectiveness of our framework in enabling camera 3D detectors to successfully identify unknown objects while also improving their performance on known objects.
☆ Hyperbolic Knowledge Transfer in Cross-Domain Recommendation System
Cross-Domain Recommendation (CDR) seeks to utilize knowledge from different domains to alleviate the problem of data sparsity in the target recommendation domain, and it has been gaining more attention in recent years. Although there have been notable advancements in this area, most current methods represent users and items in Euclidean space, which is not ideal for handling long-tail distributed data in recommendation systems. Additionally, adding data from other domains can worsen the long-tail characteristics of the entire dataset, making it harder to train CDR models effectively. Recent studies have shown that hyperbolic methods are particularly suitable for modeling long-tail distributions, which has led us to explore hyperbolic representations for users and items in CDR scenarios. However, due to the distinct characteristics of the different domains, applying hyperbolic representation learning to CDR tasks is quite challenging. In this paper, we introduce a new framework called Hyperbolic Contrastive Learning (HCTS), designed to capture the unique features of each domain while enabling efficient knowledge transfer between domains. We achieve this by embedding users and items from each domain separately and mapping them onto distinct hyperbolic manifolds with adjustable curvatures for prediction. To improve the representations of users and items in the target domain, we develop a hyperbolic contrastive learning module for knowledge transfer. Extensive experiments on real-world datasets demonstrate that hyperbolic manifolds are a promising alternative to Euclidean space for CDR tasks.
☆ Predicting the Big Five Personality Traits in Chinese Counselling Dialogues Using Large Language Models
Accurate assessment of personality traits is crucial for effective psycho-counseling, yet traditional methods like self-report questionnaires are time-consuming and biased. This study exams whether Large Language Models (LLMs) can predict the Big Five personality traits directly from counseling dialogues and introduces an innovative framework to perform the task. Our framework applies role-play and questionnaire-based prompting to condition LLMs on counseling sessions, simulating client responses to the Big Five Inventory. We evaluated our framework on 853 real-world counseling sessions, finding a significant correlation between LLM-predicted and actual Big Five traits, proving the validity of framework. Moreover, ablation studies highlight the importance of role-play simulations and task simplification via questionnaires in enhancing prediction accuracy. Meanwhile, our fine-tuned Llama3-8B model, utilizing Direct Preference Optimization with Supervised Fine-Tuning, achieves a 130.95\% improvement, surpassing the state-of-the-art Qwen1.5-110B by 36.94\% in personality prediction validity. In conclusion, LLMs can predict personality based on counseling dialogues. Our code and model are publicly available at \url{https://github.com/kuri-leo/BigFive-LLM-Predictor}, providing a valuable tool for future research in computational psychometrics.
☆ EON-1: A Brain-Inspired Processor for Near-Sensor Extreme Edge Online Feature Extraction
For Edge AI applications, deploying online learning and adaptation on resource-constrained embedded devices can deal with fast sensor-generated streams of data in changing environments. However, since maintaining low-latency and power-efficient inference is paramount at the Edge, online learning and adaptation on the device should impose minimal additional overhead for inference. With this goal in mind, we explore energy-efficient learning and adaptation on-device for streaming-data Edge AI applications using Spiking Neural Networks (SNNs), which follow the principles of brain-inspired computing, such as high-parallelism, neuron co-located memory and compute, and event-driven processing. We propose EON-1, a brain-inspired processor for near-sensor extreme edge online feature extraction, that integrates a fast online learning and adaptation algorithm. We report results of only 1% energy overhead for learning, by far the lowest overhead when compared to other SoTA solutions, while attaining comparable inference accuracy. Furthermore, we demonstrate that EON-1 is up for the challenge of low-latency processing of HD and UHD streaming video in real-time, with learning enabled.
☆ Learning Decentralized Multi-Biped Control for Payload Transport
Payload transport over flat terrain via multi-wheel robot carriers is well-understood, highly effective, and configurable. In this paper, our goal is to provide similar effectiveness and configurability for transport over rough terrain that is more suitable for legs rather than wheels. For this purpose, we consider multi-biped robot carriers, where wheels are replaced by multiple bipedal robots attached to the carrier. Our main contribution is to design a decentralized controller for such systems that can be effectively applied to varying numbers and configurations of rigidly attached bipedal robots without retraining. We present a reinforcement learning approach for training the controller in simulation that supports transfer to the real world. Our experiments in simulation provide quantitative metrics showing the effectiveness of the approach over a wide variety of simulated transport scenarios. In addition, we demonstrate the controller in the real-world for systems composed of two and three Cassie robots. To our knowledge, this is the first example of a scalable multi-biped payload transport system.
comment: Submitted to CoRL 2024, Project website: decmbc.github.io
☆ AG-LSEC: Audio Grounded Lexical Speaker Error Correction INTERSPEECH 2024
Speaker Diarization (SD) systems are typically audio-based and operate independently of the ASR system in traditional speech transcription pipelines and can have speaker errors due to SD and/or ASR reconciliation, especially around speaker turns and regions of speech overlap. To reduce these errors, a Lexical Speaker Error Correction (LSEC), in which an external language model provides lexical information to correct the speaker errors, was recently proposed. Though the approach achieves good Word Diarization error rate (WDER) improvements, it does not use any additional acoustic information and is prone to miscorrections. In this paper, we propose to enhance and acoustically ground the LSEC system with speaker scores directly derived from the existing SD pipeline. This approach achieves significant relative WDER reductions in the range of 25-40% over the audio-based SD, ASR system and beats the LSEC system by 15-25% relative on RT03-CTS, Callhome American English and Fisher datasets.
comment: Accepted at INTERSPEECH 2024
☆ Image-Guided Outdoor LiDAR Perception Quality Assessment for Autonomous Driving
LiDAR is one of the most crucial sensors for autonomous vehicle perception. However, current LiDAR-based point cloud perception algorithms lack comprehensive and rigorous LiDAR quality assessment methods, leading to uncertainty in detection performance. Additionally, existing point cloud quality assessment algorithms are predominantly designed for indoor environments or single-object scenarios. In this paper, we introduce a novel image-guided point cloud quality assessment algorithm for outdoor autonomous driving environments, named the Image-Guided Outdoor Point Cloud Quality Assessment (IGO-PQA) algorithm. Our proposed algorithm comprises two main components. The first component is the IGO-PQA generation algorithm, which leverages point cloud data, corresponding RGB surrounding view images, and agent objects' ground truth annotations to generate an overall quality score for a single-frame LiDAR-based point cloud. The second component is a transformer-based IGO-PQA regression algorithm for no-reference outdoor point cloud quality assessment. This regression algorithm allows for the direct prediction of IGO-PQA scores in an online manner, without requiring image data and object ground truth annotations. We evaluate our proposed algorithm using the nuScenes and Waymo open datasets. The IGO-PQA generation algorithm provides consistent and reasonable perception quality indices. Furthermore, our proposed IGO-PQA regression algorithm achieves a Pearson Linear Correlation Coefficient (PLCC) of 0.86 on the nuScenes dataset and 0.97 on the Waymo dataset.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ TopoGCL: Topological Graph Contrastive Learning
Graph contrastive learning (GCL) has recently emerged as a new concept which allows for capitalizing on the strengths of graph neural networks (GNNs) to learn rich representations in a wide variety of applications which involve abundant unlabeled information. However, existing GCL approaches largely tend to overlook the important latent information on higher-order graph substructures. We address this limitation by introducing the concepts of topological invariance and extended persistence on graphs to GCL. In particular, we propose a new contrastive mode which targets topological representations of the two augmented views from the same graph, yielded by extracting latent shape properties of the graph at multiple resolutions. Along with the extended topological layer, we introduce a new extended persistence summary, namely, extended persistence landscapes (EPL) and derive its theoretical stability guarantees. Our extensive numerical results on biological, chemical, and social interaction graphs show that the new Topological Graph Contrastive Learning (TopoGCL) model delivers significant performance gains in unsupervised graph classification for 11 out of 12 considered datasets and also exhibits robustness under noisy scenarios.
☆ Beyond Silence: Bias Analysis through Loss and Asymmetric Approach in Audio Anti-Spoofing
Current trends in audio anti-spoofing detection research strive to improve models' ability to generalize across unseen attacks by learning to identify a variety of spoofing artifacts. This emphasis has primarily focused on the spoof class. Recently, several studies have noted that the distribution of silence differs between the two classes, which can serve as a shortcut. In this paper, we extend class-wise interpretations beyond silence. We employ loss analysis and asymmetric methodologies to move away from traditional attack-focused and result-oriented evaluations towards a deeper examination of model behaviors. Our investigations highlight the significant differences in training dynamics between the two classes, emphasizing the need for future research to focus on robust modeling of the bonafide class.
comment: 5 pages, 1 figure, 5 tables
☆ Unlocking Continual Learning Abilities in Language Models
Language models (LMs) exhibit impressive performance and generalization capabilities. However, LMs struggle with the persistent challenge of catastrophic forgetting, which undermines their long-term sustainability in continual learning (CL). Existing approaches usually address the issue by incorporating old task data or task-wise inductive bias into LMs. However, old data and accurate task information are often unavailable or costly to collect, hindering the availability of current CL approaches for LMs. To address this limitation, we introduce $\textbf{MIGU}$ ($\textbf{M}$agn$\textbf{I}$tude-based $\textbf{G}$radient $\textbf{U}$pdating for continual learning), a rehearsal-free and task-label-free method that only updates the model parameters with large magnitudes of output in LMs' linear layers. MIGU is based on our observation that the L1-normalized magnitude distribution of the output in LMs' linear layers is different when the LM models deal with different task data. By imposing this simple constraint on the gradient update process, we can leverage the inherent behaviors of LMs, thereby unlocking their innate CL abilities. Our experiments demonstrate that MIGU is universally applicable to all three LM architectures (T5, RoBERTa, and Llama2), delivering state-of-the-art or on-par performance across continual finetuning and continual pre-training settings on four CL benchmarks. For example, MIGU brings a 15.2% average accuracy improvement over conventional parameter-efficient finetuning baselines in a 15-task CL benchmark. MIGU can also seamlessly integrate with all three existing CL types to further enhance performance. Code is available at \href{https://github.com/wenyudu/MIGU}{this https URL}.
comment: preprint, 19 pages
☆ Task-Agnostic Federated Learning
In the realm of medical imaging, leveraging large-scale datasets from various institutions is crucial for developing precise deep learning models, yet privacy concerns frequently impede data sharing. federated learning (FL) emerges as a prominent solution for preserving privacy while facilitating collaborative learning. However, its application in real-world scenarios faces several obstacles, such as task & data heterogeneity, label scarcity, non-identically distributed (non-IID) data, computational vaiation, etc. In real-world, medical institutions may not want to disclose their tasks to FL server and generalization challenge of out-of-network institutions with un-seen task want to join the on-going federated system. This study address task-agnostic and generalization problem on un-seen tasks by adapting self-supervised FL framework. Utilizing Vision Transformer (ViT) as consensus feature encoder for self-supervised pre-training, no initial labels required, the framework enabling effective representation learning across diverse datasets and tasks. Our extensive evaluations, using various real-world non-IID medical imaging datasets, validate our approach's efficacy, retaining 90\% of F1 accuracy with only 5\% of the training data typically required for centralized approaches and exhibiting superior adaptability to out-of-distribution task. The result indicate that federated learning architecture can be a potential approach toward multi-task foundation modeling.
☆ Large Language Models are Interpretable Learners
The trade-off between expressiveness and interpretability remains a core challenge when building human-centric predictive models for classification and decision-making. While symbolic rules offer interpretability, they often lack expressiveness, whereas neural networks excel in performance but are known for being black boxes. In this paper, we show a combination of Large Language Models (LLMs) and symbolic programs can bridge this gap. In the proposed LLM-based Symbolic Programs (LSPs), the pretrained LLM with natural language prompts provides a massive set of interpretable modules that can transform raw input into natural language concepts. Symbolic programs then integrate these modules into an interpretable decision rule. To train LSPs, we develop a divide-and-conquer approach to incrementally build the program from scratch, where the learning process of each step is guided by LLMs. To evaluate the effectiveness of LSPs in extracting interpretable and accurate knowledge from data, we introduce IL-Bench, a collection of diverse tasks, including both synthetic and real-world scenarios across different modalities. Empirical results demonstrate LSP's superior performance compared to traditional neurosymbolic programs and vanilla automatic prompt tuning methods. Moreover, as the knowledge learned by LSP is a combination of natural language descriptions and symbolic rules, it is easily transferable to humans (interpretable), and other LLMs, and generalizes well to out-of-distribution samples.
comment: Preliminary Version, Code at [this url](https://github.com/ruocwang/llm-symbolic-program)
☆ Machine Unlearning Fails to Remove Data Poisoning Attacks
We revisit the efficacy of several practical methods for approximate machine unlearning developed for large-scale deep learning. In addition to complying with data deletion requests, one often-cited potential application for unlearning methods is to remove the effects of training on poisoned data. We experimentally demonstrate that, while existing unlearning methods have been demonstrated to be effective in a number of evaluation settings (e.g., alleviating membership inference attacks), they fail to remove the effects of data poisoning, across a variety of types of poisoning attacks (indiscriminate, targeted, and a newly-introduced Gaussian poisoning attack) and models (image classifiers and LLMs); even when granted a relatively large compute budget. In order to precisely characterize unlearning efficacy, we introduce new evaluation metrics for unlearning based on data poisoning. Our results suggest that a broader perspective, including a wider variety of evaluations, is required to avoid a false sense of confidence in machine unlearning procedures for deep learning without provable guarantees. Moreover, while unlearning methods show some signs of being useful to efficiently remove poisoned datapoints without having to retrain, our work suggests that these methods are not yet "ready for prime time", and currently provide limited benefit over retraining.
☆ Enabling Large Language Models to Perform Power System Simulations with Previously Unseen Tools: A Case of Daline
The integration of experiment technologies with large language models (LLMs) is transforming scientific research, offering AI capabilities beyond specialized problem-solving to becoming research assistants for human scientists. In power systems, simulations are essential for research. However, LLMs face significant challenges in power system simulations due to limited pre-existing knowledge and the complexity of power grids. To address this issue, this work proposes a modular framework that integrates expertise from both the power system and LLM domains. This framework enhances LLMs' ability to perform power system simulations on previously unseen tools. Validated using 34 simulation tasks in Daline, a (optimal) power flow simulation and linearization toolbox not yet exposed to LLMs, the proposed framework improved GPT-4o's simulation coding accuracy from 0% to 96.07%, also outperforming the ChatGPT-4o web interface's 33.8% accuracy (with the entire knowledge base uploaded). These results highlight the potential of LLMs as research assistants in power systems.
☆ Geometric Median (GM) Matching for Robust Data Pruning
Data pruning, the combinatorial task of selecting a small and informative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Unfortunately, the existing heuristics for (robust) data pruning lack theoretical coherence and rely on heroic assumptions, that are, often unattainable, by the very nature of the problem setting. Moreover, these strategies often yield sub-optimal neural scaling laws even compared to random sampling, especially in scenarios involving strong corruption and aggressive pruning rates -- making provably robust data pruning an open challenge. In response, in this work, we propose Geometric Median ($\gm$) Matching -- a herding~\citep{welling2009herding} style greedy algorithm -- that yields a $k$-subset such that the mean of the subset approximates the geometric median of the (potentially) noisy dataset. Theoretically, we show that $\gm$ Matching enjoys an improved $\gO(1/k)$ scaling over $\gO(1/\sqrt{k})$ scaling of uniform sampling; while achieving the optimal breakdown point of 1/2 even under arbitrary corruption. Extensive experiments across popular deep learning benchmarks indicate that $\gm$ Matching consistently outperforms prior state-of-the-art; the gains become more profound at high rates of corruption and aggressive pruning rates; making $\gm$ Matching a strong baseline for future research in robust data pruning.
☆ Encourage or Inhibit Monosemanticity? Revisit Monosemanticity from a Feature Decorrelation Perspective
To better interpret the intrinsic mechanism of large language models (LLMs), recent studies focus on monosemanticity on its basic units. A monosemantic neuron is dedicated to a single and specific concept, which forms a one-to-one correlation between neurons and concepts. Despite extensive research in monosemanticity probing, it remains unclear whether monosemanticity is beneficial or harmful to model capacity. To explore this question, we revisit monosemanticity from the feature decorrelation perspective and advocate for its encouragement. We experimentally observe that the current conclusion by wang2024learning, which suggests that decreasing monosemanticity enhances model performance, does not hold when the model changes. Instead, we demonstrate that monosemanticity consistently exhibits a positive correlation with model capacity, in the preference alignment process. Consequently, we apply feature correlation as a proxy for monosemanticity and incorporate a feature decorrelation regularizer into the dynamic preference optimization process. The experiments show that our method not only enhances representation diversity and activation sparsity but also improves preference alignment performance.
☆ Efficient Document Ranking with Learnable Late Interactions
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.
☆ NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.
comment: Work in Progress
☆ MAGIC: Meta-Ability Guided Interactive Chain-of-Distillation for Effective-and-Efficient Vision-and-Language Navigation
Despite the remarkable developments of recent large models in Embodied Artificial Intelligence (E-AI), their integration into robotics is hampered by their excessive parameter sizes and computational demands. Towards the Vision-and-Language Navigation (VLN) task, a core task in E-AI, this paper reveals the great potential of using knowledge distillation for obtaining lightweight student models by proposing a Meta-Ability Guided Interactive Chain-of-distillation (MAGIC) method. Specifically, a Meta-Ability Knowledge Distillation (MAKD) framework is proposed for decoupling and refining the necessary meta-abilities of VLN agents. A Meta-Knowledge Randomization Weighting (MKRW) and a Meta-Knowledge Transferable Determination (MKTD) module are incorporated to dynamically adjust aggregation weights at the meta-ability and sample levels, respectively. Move beyond the traditional one-step unidirectional distillation, an Interactive Chain-of-Distillation (ICoD) learning strategy is proposed to allow students to give feedback to teachers, forming a new multi-step teacher-student co-evolution pipeline. Remarkably, on the R2R test unseen public leaderboard, our smallest model, MAGIC-S, with only 5% (11M) of the teacher's size, outperforms all previous methods under the same training data. Additionally, our largest model, MAGIC-L, surpasses the previous state-of-the-art by 5.84% in SPL and 3.18% in SR. Furthermore, a new dataset was collected and annotated from our living environments, where MAGIC-S demonstrated superior performance and real-time efficiency. Our code is publicly available on https://github.com/CrystalSixone/VLN-MAGIC.
☆ Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment INTERSPEECH 2024
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech (referred to as hallucinations or attention errors), especially when the text contains multiple occurrences of the same token. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text. To make the alignment more robust, we propose techniques utilizing CTC loss and attention priors that encourage monotonic cross-attention over the text tokens. Our guided attention training technique does not introduce any new learnable parameters and significantly improves robustness of LLM-based TTS models.
comment: Published as a conference paper at INTERSPEECH 2024
☆ The Overcooked Generalisation Challenge
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents' zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OGC. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.
comment: 9 pages
☆ Semi-supervised classification of dental conditions in panoramic radiographs using large language model and instance segmentation: A real-world dataset evaluation
Dental panoramic radiographs offer vast diagnostic opportunities, but training supervised deep learning networks for automatic analysis of those radiology images is hampered by a shortage of labeled data. Here, a different perspective on this problem is introduced. A semi-supervised learning framework is proposed to classify thirteen dental conditions on panoramic radiographs, with a particular emphasis on teeth. Large language models were explored to annotate the most common dental conditions based on dental reports. Additionally, a masked autoencoder was employed to pre-train the classification neural network, and a Vision Transformer was used to leverage the unlabeled data. The analyses were validated using two of the most extensive datasets in the literature, comprising 8,795 panoramic radiographs and 8,029 paired reports and images. Encouragingly, the results consistently met or surpassed the baseline metrics for the Matthews correlation coefficient. A comparison of the proposed solution with human practitioners, supported by statistical analysis, highlighted its effectiveness and performance limitations; based on the degree of agreement among specialists, the solution demonstrated an accuracy level comparable to that of a junior specialist.
comment: 43 pages, 12 figures, 9 tables
☆ Transforming Software Development: Evaluating the Efficiency and Challenges of GitHub Copilot in Real-World Projects
Generative AI technologies promise to transform the product development lifecycle. This study evaluates the efficiency gains, areas for improvement, and emerging challenges of using GitHub Copilot, an AI-powered coding assistant. We identified 15 software development tasks and assessed Copilot's benefits through real-world projects on large proprietary code bases. Our findings indicate significant reductions in developer toil, with up to 50% time saved in code documentation and autocompletion, and 30-40% in repetitive coding tasks, unit test generation, debugging, and pair programming. However, Copilot struggles with complex tasks, large functions, multiple files, and proprietary contexts, particularly with C/C++ code. We project a 33-36% time reduction for coding-related tasks in a cloud-first software development lifecycle. This study aims to quantify productivity improvements, identify underperforming scenarios, examine practical benefits and challenges, investigate performance variations across programming languages, and discuss emerging issues related to code quality, security, and developer experience.
comment: 13 pages, 8 figures
☆ Unbiasing on the Fly: Explanation-Guided Human Oversight of Machine Learning System Decisions
The widespread adoption of ML systems across critical domains like hiring, finance, and healthcare raises growing concerns about their potential for discriminatory decision-making based on protected attributes. While efforts to ensure fairness during development are crucial, they leave deployed ML systems vulnerable to potentially exhibiting discrimination during their operations. To address this gap, we propose a novel framework for on-the-fly tracking and correction of discrimination in deployed ML systems. Leveraging counterfactual explanations, the framework continuously monitors the predictions made by an ML system and flags discriminatory outcomes. When flagged, post-hoc explanations related to the original prediction and the counterfactual alternatives are presented to a human reviewer for real-time intervention. This human-in-the-loop approach empowers reviewers to accept or override the ML system decision, enabling fair and responsible ML operation under dynamic settings. While further work is needed for validation and refinement, this framework offers a promising avenue for mitigating discrimination and building trust in ML systems deployed in a wide range of domains.
comment: 13 pages
☆ Application of Liquid Rank Reputation System for Twitter Trend Analysis on Bitcoin
Analyzing social media trends can create a win-win situation for both creators and consumers. Creators can receive fair compensation, while consumers gain access to engaging, relevant, and personalized content. This paper proposes a new model for analyzing Bitcoin trends on Twitter by incorporating a 'liquid democracy' approach based on user reputation. This system aims to identify the most impactful trends and their influence on Bitcoin prices and trading volume. It uses a Twitter sentiment analysis model based on a reputation rating system to determine the impact on Bitcoin price change and traded volume. In addition, the reputation model considers the users' higher-order friends on the social network (the initial Twitter input channels in our case study) to improve the accuracy and diversity of the reputation results. We analyze Bitcoin-related news on Twitter to understand how trends and user sentiment, measured through our Liquid Rank Reputation System, affect Bitcoin price fluctuations and trading activity within the studied time frame. This reputation model can also be used as an additional layer in other trend and sentiment analysis models. The paper proposes the implementation, challenges, and future scope of the liquid rank reputation model.
comment: Under publication in 2024 Ural-Siberian Conference on Biomedical Engineering, Radioelectronics and Information Technology, Yekaterinburg, Russia
☆ Domain Adaptation of Echocardiography Segmentation Via Reinforcement Learning
Performance of deep learning segmentation models is significantly challenged in its transferability across different medical imaging domains, particularly when aiming to adapt these models to a target domain with insufficient annotated data for effective fine-tuning. While existing domain adaptation (DA) methods propose strategies to alleviate this problem, these methods do not explicitly incorporate human-verified segmentation priors, compromising the potential of a model to produce anatomically plausible segmentations. We introduce RL4Seg, an innovative reinforcement learning framework that reduces the need to otherwise incorporate large expertly annotated datasets in the target domain, and eliminates the need for lengthy manual human review. Using a target dataset of 10,000 unannotated 2D echocardiographic images, RL4Seg not only outperforms existing state-of-the-art DA methods in accuracy but also achieves 99% anatomical validity on a subset of 220 expert-validated subjects from the target domain. Furthermore, our framework's reward network offers uncertainty estimates comparable with dedicated state-of-the-art uncertainty methods, demonstrating the utility and effectiveness of RL4Seg in overcoming domain adaptation challenges in medical image segmentation.
comment: 9 pages
☆ Human-centered In-building Embodied Delivery Benchmark
Recently, the concept of embodied intelligence has been widely accepted and popularized, leading people to naturally consider the potential for commercialization in this field. In this work, we propose a specific commercial scenario simulation, human-centered in-building embodied delivery. Furthermore, for this scenario, we have developed a brand-new virtual environment system from scratch, constructing a multi-level connected building space modeled after a polar research station. This environment also includes autonomous human characters and robots with grasping and mobility capabilities, as well as a large number of interactive items. Based on this environment, we have built a delivery dataset containing 13k language instructions to guide robots in providing services. We simulate human behavior through human characters and sample their various needs in daily life. Finally, we proposed a method centered around a large multimodal model to serve as the baseline system for this dataset. Compared to past embodied data work, our work focuses on a virtual environment centered around human-robot interaction for commercial scenarios. We believe this will bring new perspectives and exploration angles to the embodied community.
☆ CTBench: A Comprehensive Benchmark for Evaluating Language Model Capabilities in Clinical Trial Design
CTBench is introduced as a benchmark to assess language models (LMs) in aiding clinical study design. Given study-specific metadata, CTBench evaluates AI models' ability to determine the baseline features of a clinical trial (CT), which include demographic and relevant features collected at the trial's start from all participants. These baseline features, typically presented in CT publications (often as Table 1), are crucial for characterizing study cohorts and validating results. Baseline features, including confounders and covariates, are also necessary for accurate treatment effect estimation in studies involving observational data. CTBench consists of two datasets: "CT-Repo," containing baseline features from 1,690 clinical trials sourced from clinicaltrials.gov, and "CT-Pub," a subset of 100 trials with more comprehensive baseline features gathered from relevant publications. Two LM-based evaluation methods are developed to compare the actual baseline feature lists against LM-generated responses. "ListMatch-LM" and "ListMatch-BERT" use GPT-4o and BERT scores (at various thresholds), respectively, for evaluation. To establish baseline results, advanced prompt engineering techniques using LLaMa3-70B-Instruct and GPT-4o in zero-shot and three-shot learning settings are applied to generate potential baseline features. The performance of GPT-4o as an evaluator is validated through human-in-the-loop evaluations on the CT-Pub dataset, where clinical experts confirm matches between actual and LM-generated features. The results highlight a promising direction with significant potential for improvement, positioning CTBench as a useful tool for advancing research on AI in CT design and potentially enhancing the efficacy and robustness of CTs.
☆ Federated Dynamical Low-Rank Training with Global Loss Convergence Guarantees
In this work, we propose a federated dynamical low-rank training (FeDLRT) scheme to reduce client compute and communication costs - two significant performance bottlenecks in horizontal federated learning. Our method builds upon dynamical low-rank splitting schemes for manifold-constrained optimization to create a global low-rank basis of network weights, which enables client training on a small coefficient matrix. A consistent global low-rank basis allows us to incorporate a variance correction scheme and prove global loss descent and convergence to a stationary point. Dynamic augmentation and truncation of the low-rank bases automatically optimizes computing and communication resource utilization. We demonstrate the efficiency of FeDLRT in an array of computer vision benchmarks and show a reduction of client compute and communication costs by up to an order of magnitude with minimal impacts on global accuracy.
☆ Enabling Regional Explainability by Automatic and Model-agnostic Rule Extraction
In Explainable AI, rule extraction translates model knowledge into logical rules, such as IF-THEN statements, crucial for understanding patterns learned by black-box models. This could significantly aid in fields like disease diagnosis, disease progression estimation, or drug discovery. However, such application domains often contain imbalanced data, with the class of interest underrepresented. Existing methods inevitably compromise the performance of rules for the minor class to maximise the overall performance. As the first attempt in this field, we propose a model-agnostic approach for extracting rules from specific subgroups of data, featuring automatic rule generation for numerical features. This method enhances the regional explainability of machine learning models and offers wider applicability compared to existing methods. We additionally introduce a new method for selecting features to compose rules, reducing computational costs in high-dimensional spaces. Experiments across various datasets and models demonstrate the effectiveness of our methods.
comment: 18 pages
☆ ET tu, CLIP? Addressing Common Object Errors for Unseen Environments
We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.
♻ ☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using $1/\sqrt{v}$, Adam-mini uses the average of $v$ within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down $\geq$ 90% $v$ in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
♻ ☆ Large Language Models in Healthcare: A Comprehensive Benchmark
The adoption of large language models (LLMs) to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and complex clinical tasks that are close to real-world practice, i.e., referral QA, treatment recommendation, hospitalization (long document) summarization, patient education, pharmacology QA and drug interaction for emerging drugs. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs.
♻ ☆ A Data-Centric Approach To Generate Faithful and High Quality Patient Summaries with Large Language Models
Patients often face difficulties in understanding their hospitalizations, while healthcare workers have limited resources to provide explanations. In this work, we investigate the potential of large language models to generate patient summaries based on doctors' notes and study the effect of training data on the faithfulness and quality of the generated summaries. To this end, we release (i) a rigorous labeling protocol for errors in medical texts and (ii) a publicly available dataset of annotated hallucinations in 100 doctor-written and 100 generated summaries. We show that fine-tuning on hallucination-free data effectively reduces hallucinations from 2.60 to 1.55 per summary for Llama 2, while preserving relevant information. We observe a similar effect on GPT-4 (0.70 to 0.40), when the few-shot examples are hallucination-free. We also conduct a qualitative evaluation using hallucination-free and improved training data. We find that common quantitative metrics do not correlate well with faithfulness and quality. Finally, we test GPT-4 for automatic hallucination detection, which clearly outperforms common baselines.
♻ ☆ Deep Pulse-Signal Magnification for remote Heart Rate Estimation in Compressed Videos
Recent advancements in data-driven approaches for remote photoplethysmography (rPPG) have significantly improved the accuracy of remote heart rate estimation. However, the performance of such approaches worsens considerably under video compression, which is nevertheless necessary to store and transmit video data efficiently. In this paper, we present a novel approach to address the impact of video compression on rPPG estimation, which leverages a pulse-signal magnification transformation to adapt compressed videos to an uncompressed data domain in which the rPPG signal is magnified. We validate the effectiveness of our model by exhaustive evaluations on two publicly available datasets, UCLA-rPPG and UBFC-rPPG, employing both intra- and cross-database performance at several compression rates. Additionally, we assess the robustness of our approach on two additional highly compressed and widely-used datasets, MAHNOB-HCI and COHFACE, which reveal outstanding heart rate estimation results.
♻ ☆ Truthful Aggregation of LLMs with an Application to Online Advertising
Online platforms generate hundreds of billions of dollars in revenue per year by showing advertisements alongside their own content. Currently, these platforms are integrating Large Language Models (LLMs) into their services. This makes revenue generation from LLM-generated content the next major challenge in online advertising. We consider a scenario where advertisers aim to influence the responses of an LLM to align with their interests, while platforms seek to maximize advertiser value and ensure user satisfaction. We introduce an auction mechanism for this problem that operates without LLM fine-tuning or access to model weights and provably converges to the output of the optimally fine-tuned LLM for the platform's objective as computational resources increase. Our mechanism ensures that truthful reporting is a dominant strategy for advertisers and it aligns each advertiser's utility with their contribution to social welfare - an essential feature for long-term viability. Additionally, it can incorporate contextual information about the advertisers, significantly accelerating convergence. Via experiments with a publicly available LLM, we show that our mechanism significantly boosts advertiser value and platform revenue, with low computational overhead. While our motivating application is online advertising, our mechanism can be applied in any setting with monetary transfers, making it a general-purpose solution for truthfully aggregating the preferences of self-interested agents over LLM-generated replies.
♻ ☆ iWISDM: Assessing instruction following in multimodal models at scale
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.The code of iWISDM is available on GitHub at https://github.com/BashivanLab/iWISDM.
♻ ☆ Evaluating $n$-Gram Novelty of Language Models Using Rusty-DAWG
How novel are texts generated by language models (LMs) relative to their training corpora? In this work, we investigate the extent to which modern LMs generate $n$-grams from their training data, evaluating both (i) the probability LMs assign to complete training $n$-grams and (ii) $n$-novelty, the proportion of $n$-grams generated by an LM that did not appear in the training data (for arbitrarily large $n$). To enable arbitrary-length $n$-gram search over a corpus in constant time, we develop Rusty-DAWG, a novel search tool inspired by indexing of genomic data. We compare the novelty of LM-generated text to human-written text and explore factors that affect generation novelty, focusing on the Pythia models. We find that, for $n > 4$, LM-generated text is less novel than human-written text, though it is more novel for smaller $n$. Larger LMs and more constrained decoding strategies both decrease novelty. Finally, we show that LMs complete $n$-grams with lower loss if they are more frequent in the training data. Overall, our results reveal factors influencing the novelty of LM-generated text, and we release Rusty-DAWG to facilitate further pretraining data research.
comment: 8 page preprint + appendix. Minor fixes and appendix changes June 25, 2024
♻ ☆ Learning with Noisy Labels through Learnable Weighting and Centroid Similarity
We introduce a novel method for training machine learning models in the presence of noisy labels, which are prevalent in domains such as medical diagnosis and autonomous driving and have the potential to degrade a model's generalization performance. Inspired by established literature that highlights how deep learning models are prone to overfitting to noisy samples in the later epochs of training, we propose a strategic approach. This strategy leverages the distance to class centroids in the latent space and incorporates a discounting mechanism, aiming to diminish the influence of samples that lie distant from all class centroids. By doing so, we effectively counteract the adverse effects of noisy labels. The foundational premise of our approach is the assumption that samples situated further from their respective class centroid in the initial stages of training are more likely to be associated with noise. Our methodology is grounded in robust theoretical principles and has been validated empirically through extensive experiments on several benchmark datasets. Our results show that our method consistently outperforms the existing state-of-the-art techniques, achieving significant improvements in classification accuracy in the presence of noisy labels. The code for our proposed loss function and supplementary materials is available at https://github.com/wanifarooq/NCOD
♻ ☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
♻ ☆ Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over lottery tickets (or sparse task vectors), LoTA also enables model merging over highly dissimilar tasks. Our code is made publicly available at https://github.com/kiddyboots216/lottery-ticket-adaptation.
♻ ☆ MedCalc-Bench: Evaluating Large Language Models for Medical Calculations
As opposed to evaluating computation and logic-based reasoning, current benchmarks for evaluating large language models (LLMs) in medicine are primarily focused on question-answering involving domain knowledge and descriptive reasoning. While such qualitative capabilities are vital to medical diagnosis, in real-world scenarios, doctors frequently use clinical calculators that follow quantitative equations and rule-based reasoning paradigms for evidence-based decision support. To this end, we propose MedCalc-Bench, a first-of-its-kind dataset focused on evaluating the medical calculation capability of LLMs. MedCalc-Bench contains an evaluation set of over 1000 manually reviewed instances from 55 different medical calculation tasks. Each instance in MedCalc-Bench consists of a patient note, a question requesting to compute a specific medical value, a ground truth answer, and a step-by-step explanation showing how the answer is obtained. While our evaluation results show the potential of LLMs in this area, none of them are effective enough for clinical settings. Common issues include extracting the incorrect entities, not using the correct equation or rules for a calculation task, or incorrectly performing the arithmetic for the computation. We hope our study highlights the quantitative knowledge and reasoning gaps in LLMs within medical settings, encouraging future improvements of LLMs for various clinical calculation tasks.
comment: Github link: https://github.com/ncbi-nlp/MedCalc-Bench HuggingFace link: https://huggingface.co/datasets/nsk7153/MedCalc-Bench
♻ ☆ Aligning Large Language Models by On-Policy Self-Judgment ACL 2024
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning. In this paper, we present a novel alignment framework, SELF-JUDGE that (1) does on-policy learning and 2) is parameter efficient, as it does not require an additional RM for evaluating the samples for on-policy learning. To this end, we propose Judge-augmented Supervised Fine-Tuning (JSFT) to train a single model to act as both a policy and a judge. Specifically, we view the pairwise judgment task, choosing the better response from a response pair, as a special case of the instruction-following task. The resulting model can judge preferences of on-the-fly responses from current policy initialized from itself. Experimental results show the efficacy of SELF-JUDGE, outperforming baselines in preference benchmarks. We also show that the rejecting sampling by itself can improve performance further without an additional evaluator.
comment: Published as a main conference paper at ACL 2024
♻ ☆ MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues ACL 2024
The advent of Large Language Models (LLMs) has drastically enhanced dialogue systems. However, comprehensively evaluating the dialogue abilities of LLMs remains a challenge. Previous benchmarks have primarily focused on single-turn dialogues or provided coarse-grained and incomplete assessments of multi-turn dialogues, overlooking the complexity and fine-grained nuances of real-life dialogues. To address this issue, we introduce MT-Bench-101, specifically designed to evaluate the fine-grained abilities of LLMs in multi-turn dialogues. By conducting a detailed analysis of real multi-turn dialogue data, we construct a three-tier hierarchical ability taxonomy comprising 4208 turns across 1388 multi-turn dialogues in 13 distinct tasks. We then evaluate 21 popular LLMs based on MT-Bench-101, conducting comprehensive analyses from both ability and task perspectives and observing differing trends in LLMs performance across dialogue turns within various tasks. Further analysis indicates that neither utilizing common alignment techniques nor chat-specific designs has led to obvious enhancements in the multi-turn abilities of LLMs. Extensive case studies suggest that our designed tasks accurately assess the corresponding multi-turn abilities. The data and code are available at \url{https://github.com/mtbench101/mt-bench-101}.
comment: [ACL 2024] The first three authors contribute equally, 34 pages, repo at https://github.com/mtbench101/mt-bench-101
♻ ☆ Scoreformer: A Surrogate Model For Large-Scale Prediction of Docking Scores ICML 2024
In this study, we present ScoreFormer, a novel graph transformer model designed to accurately predict molecular docking scores, thereby optimizing high-throughput virtual screening (HTVS) in drug discovery. The architecture integrates Principal Neighborhood Aggregation (PNA) and Learnable Random Walk Positional Encodings (LRWPE), enhancing the model's ability to understand complex molecular structures and their relationship with their respective docking scores. This approach significantly surpasses traditional HTVS methods and recent Graph Neural Network (GNN) models in both recovery and efficiency due to a wider coverage of the chemical space and enhanced performance. Our results demonstrate that ScoreFormer achieves competitive performance in docking score prediction and offers a substantial 1.65-fold reduction in inference time compared to existing models. We evaluated ScoreFormer across multiple datasets under various conditions, confirming its robustness and reliability in identifying potential drug candidates rapidly.
comment: Accepted at the 1st Machine Learning for Life and Material Sciences Workshop at ICML 2024
♻ ☆ In value-based deep reinforcement learning, a pruned network is a good network
Recent work has shown that deep reinforcement learning agents have difficulty in effectively using their network parameters. We leverage prior insights into the advantages of sparse training techniques and demonstrate that gradual magnitude pruning enables value-based agents to maximize parameter effectiveness. This results in networks that yield dramatic performance improvements over traditional networks, using only a small fraction of the full network parameters.
♻ ☆ Jigsaw: Supporting Designers to Prototype Multimodal Applications by Chaining AI Foundation Models
Recent advancements in AI foundation models have made it possible for them to be utilized off-the-shelf for creative tasks, including ideating design concepts or generating visual prototypes. However, integrating these models into the creative process can be challenging as they often exist as standalone applications tailored to specific tasks. To address this challenge, we introduce Jigsaw, a prototype system that employs puzzle pieces as metaphors to represent foundation models. Jigsaw allows designers to combine different foundation model capabilities across various modalities by assembling compatible puzzle pieces. To inform the design of Jigsaw, we interviewed ten designers and distilled design goals. In a user study, we showed that Jigsaw enhanced designers' understanding of available foundation model capabilities, provided guidance on combining capabilities across different modalities and tasks, and served as a canvas to support design exploration, prototyping, and documentation.
comment: https://jigsaw.to
♻ ☆ A Probabilistic Fluctuation based Membership Inference Attack for Diffusion Models
Membership Inference Attack (MIA) identifies whether a record exists in a machine learning model's training set by querying the model. MIAs on the classic classification models have been well-studied, and recent works have started to explore how to transplant MIA onto generative models. Our investigation indicates that existing MIAs designed for generative models mainly depend on the overfitting in target models. However, overfitting can be avoided by employing various regularization techniques, whereas existing MIAs demonstrate poor performance in practice. Unlike overfitting, memorization is essential for deep learning models to attain optimal performance, making it a more prevalent phenomenon. Memorization in generative models leads to an increasing trend in the probability distribution of generating records around the member record. Therefore, we propose a Probabilistic Fluctuation Assessing Membership Inference Attack (PFAMI), a black-box MIA that infers memberships by detecting these trends via analyzing the overall probabilistic fluctuations around given records. We conduct extensive experiments across multiple generative models and datasets, which demonstrate PFAMI can improve the attack success rate (ASR) by about 27.9% when compared with the best baseline.
comment: Repo: https://github.com/wjfu99/MIA-Gen
♻ ☆ High-Dimension Human Value Representation in Large Language Models
The widespread application of Large Language Models (LLMs) across various tasks and fields has necessitated the alignment of these models with human values and preferences. Given various approaches of human value alignment, ranging from Reinforcement Learning with Human Feedback (RLHF), to constitutional learning, etc. there is an urgent need to understand the scope and nature of human values injected into these models before their release. There is also a need for model alignment without a costly large scale human annotation effort. We propose UniVaR, a high-dimensional representation of human value distributions in LLMs, orthogonal to model architecture and training data. Trained from the value-relevant output of eight multilingual LLMs and tested on the output from four multilingual LLMs, namely LlaMA2, ChatGPT, JAIS and Yi, we show that UniVaR is a powerful tool to compare the distribution of human values embedded in different LLMs with different langauge sources. Through UniVaR, we explore how different LLMs prioritize various values in different languages and cultures, shedding light on the complex interplay between human values and language modeling.
♻ ☆ Embedding Ontologies via Incorporating Extensional and Intensional Knowledge
Ontologies contain rich knowledge within domain, which can be divided into two categories, namely extensional knowledge and intensional knowledge. Extensional knowledge provides information about the concrete instances that belong to specific concepts in the ontology, while intensional knowledge details inherent properties, characteristics, and semantic associations among concepts. However, existing ontology embedding approaches fail to take both extensional knowledge and intensional knowledge into fine consideration simultaneously. In this paper, we propose a novel ontology embedding approach named EIKE (Extensional and Intensional Knowledge Embedding) by representing ontologies in two spaces, called extensional space and intensional space. EIKE presents a unified framework for embedding instances, concepts and their relations in an ontology, applying a geometry-based method to model extensional knowledge and a pretrained language model to model intensional knowledge, which can capture both structure information and textual information. Experimental results show that EIKE significantly outperforms state-of-the-art methods in three datasets for both triple classification and link prediction, indicating that EIKE provides a more comprehensive and representative perspective of the domain.
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at https://github.com/SamuelCahyawijaya/in-context-alignment
♻ ☆ Efficient 3D Molecular Generation with Flow Matching and Scale Optimal Transport
Generative models for 3D drug design have gained prominence recently for their potential to design ligands directly within protein pockets. Current approaches, however, often suffer from very slow sampling times or generate molecules with poor chemical validity. Addressing these limitations, we propose Semla, a scalable E(3)-equivariant message passing architecture. We further introduce a molecular generation model, SemlaFlow, which is trained using flow matching along with scale optimal transport, a novel extension of equivariant optimal transport. Our model produces state-of-the-art results on benchmark datasets with just 100 sampling steps. Crucially, SemlaFlow samples high quality molecules with as few as 20 steps, corresponding to a two order-of-magnitude speed-up compared to state-of-the-art, without sacrificing performance. Furthermore, we highlight limitations of current evaluation methods for 3D generation and propose new benchmark metrics for unconditional molecular generators. Finally, using these new metrics, we compare our model's ability to generate high quality samples against current approaches and further demonstrate SemlaFlow's strong performance.
comment: Preprint. Code to be released upon full publication
♻ ☆ From Text to Test: AI-Generated Control Software for Materials Science Instruments
Large language models (LLMs) are transforming the landscape of chemistry and materials science. Recent examples of LLM-accelerated experimental research include virtual assistants for parsing synthesis recipes from the literature, or using the extracted knowledge to guide synthesis and characterization. Despite these advancements, their application is constrained to labs with automated instruments and control software, leaving much of materials science reliant on manual processes. Here, we demonstrate the rapid deployment of a Python-based control module for a Keithley 2400 electrical source measure unit using ChatGPT-4. Through iterative refinement, we achieved effective instrument management with minimal human intervention. Additionally, a user-friendly graphical user interface (GUI) was created, effectively linking all instrument controls to interactive screen elements. Finally, we integrated this AI-crafted instrument control software with a high-performance stochastic optimization algorithm to facilitate rapid and automated extraction of electronic device parameters related to semiconductor charge transport mechanisms from current-voltage (IV) measurement data. This integration resulted in a comprehensive open-source toolkit for semiconductor device characterization and analysis using IV curve measurements. We demonstrate the application of these tools by acquiring, analyzing, and parameterizing IV data from a Pt/Cr$_2$O$_3$:Mg/$\beta$-Ga$_2$O$_3$ heterojunction diode, a novel stack for high-power and high-temperature electronic devices. This approach underscores the powerful synergy between LLMs and the development of instruments for scientific inquiry, showcasing a path for further acceleration in materials science.
♻ ☆ Embodied Question Answering via Multi-LLM Systems
Embodied Question Answering (EQA) is an important problem, which involves an agent exploring the environment to answer user queries. In the existing literature, EQA has exclusively been studied in single-agent scenarios, where exploration can be time-consuming and costly. In this work, we consider EQA in a multi-agent framework involving multiple large language models (LLM) based agents independently answering queries about a household environment. To generate one answer for each query, we use the individual responses to train a Central Answer Model (CAM) that aggregates responses for a robust answer. Using CAM, we observe a $50\%$ higher EQA accuracy when compared against aggregation methods for ensemble LLM, such as voting schemes and debates. CAM does not require any form of agent communication, alleviating it from the associated costs. We ablate CAM with various nonlinear (neural network, random forest, decision tree, XGBoost) and linear (logistic regression classifier, SVM) algorithms. Finally, we present a feature importance analysis for CAM via permutation feature importance (PFI), quantifying CAMs reliance on each independent agent and query context.
comment: 17 pages, 13 Figures, 4 Tables
♻ ☆ WRDScore: New Metric for Evaluation of Natural Language Generation Models
The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.
♻ ☆ FedPop: Federated Population-based Hyperparameter Tuning
Federated Learning (FL) is a distributed machine learning (ML) paradigm, in which multiple clients collaboratively train ML models without centralizing their local data. Similar to conventional ML pipelines, the client local optimization and server aggregation procedure in FL are sensitive to the hyperparameter (HP) selection. Despite extensive research on tuning HPs for centralized ML, these methods yield suboptimal results when employed in FL. This is mainly because their "training-after-tuning" framework is unsuitable for FL with limited client computation power. While some approaches have been proposed for HP-Tuning in FL, they are limited to the HPs for client local updates. In this work, we propose a novel HP-tuning algorithm, called Federated Population-based Hyperparameter Tuning (FedPop), to address this vital yet challenging problem. FedPop employs population-based evolutionary algorithms to optimize the HPs, which accommodates various HP types at both client and server sides. Compared with prior tuning methods, FedPop employs an online "tuning-while-training" framework, offering computational efficiency and enabling the exploration of a broader HP search space. Our empirical validation on the common FL benchmarks and complex real-world FL datasets demonstrates the effectiveness of the proposed method, which substantially outperforms the concurrent state-of-the-art HP tuning methods for FL.
comment: Code: https://github.com/HaokunChen245/FedPop
♻ ☆ PatentEval: Understanding Errors in Patent Generation
In this work, we introduce a comprehensive error typology specifically designed for evaluating two distinct tasks in machine-generated patent texts: claims-to-abstract generation, and the generation of the next claim given previous ones. We have also developed a benchmark, PatentEval, for systematically assessing language models in this context. Our study includes a comparative analysis, annotated by humans, of various models. These range from those specifically adapted during training for tasks within the patent domain to the latest general-purpose large language models (LLMs). Furthermore, we explored and evaluated some metrics to approximate human judgments in patent text evaluation, analyzing the extent to which these metrics align with expert assessments. These approaches provide valuable insights into the capabilities and limitations of current language models in the specialized field of patent text generation.
♻ ☆ CoSMo: a Framework to Instantiate Conditioned Process Simulation Models
Process simulation is gaining attention for its ability to assess potential performance improvements and risks associated with business process changes. The existing literature presents various techniques, generally grounded in process models discovered from event log data or built upon deep learning algorithms. These techniques have specific strengths and limitations. Traditional data-driven approaches offer increased interpretability, while deep learning-based excel at generalizing changes across large event logs. However, the practical application of deep learning faces challenges related to managing stochasticity and integrating information for what-if analysis. This paper introduces a novel recurrent neural architecture tailored to discover COnditioned process Simulation MOdels (CoSMo) based on user-based constraints or any other nature of a-priori knowledge. This architecture facilitates the simulation of event logs that adhere to specific constraints by incorporating declarative-based rules into the learning phase as an attempt to fill the gap of incorporating information into deep learning models to perform what-if analysis. Experimental validation illustrates CoSMo's efficacy in simulating event logs while adhering to predefined declarative conditions, emphasizing both control-flow and data-flow perspectives.
♻ ☆ Straight-Through meets Sparse Recovery: the Support Exploration Algorithm
The {\it straight-through estimator} (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical successes.To make a step forward in this comprehension, we apply STE to a well-understood problem: {\it sparse support recovery}. We introduce the {\it Support Exploration Algorithm} (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of $A$ are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix $A$ satisfies the {\it Restricted Isometry Property} (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.
♻ ☆ FAIIR: Building Toward A Conversational AI Agent Assistant for Youth Mental Health Service Provision
The world's healthcare systems and mental health agencies face both a growing demand for youth mental health services, alongside a simultaneous challenge of limited resources. Here, we focus on frontline crisis support, where Crisis Responders (CRs) engage in conversations for youth mental health support and assign an issue tag to each conversation. In this study, we develop FAIIR (Frontline Assistant: Issue Identification and Recommendation), an advanced tool leveraging an ensemble of domain-adapted and fine-tuned transformer models trained on a large conversational dataset comprising 780,000 conversations. The primary aim is to reduce the cognitive burden on CRs, enhance the accuracy of issue identification, and streamline post-conversation administrative tasks. We evaluate FAIIR on both retrospective and prospective conversations, emphasizing human-in-the-loop design with active CR engagement for model refinement, consensus-building, and overall assessment. Our results indicate that FAIIR achieves an average AUCROC of 94%, a sample average F1-score of 64%, and a sample average recall score of 81% on the retrospective test set. We also demonstrate the robustness and generalizability of the FAIIR tool during the silent testing phase, with less than a 2% drop in all performance metrics. Notably, CRs' responses exhibited an overall agreement of 90.9% with FAIIR's predictions. Furthermore, expert agreement with FAIIR surpassed their agreement with the original labels. To conclude, our findings indicate that assisting with the identification of issues of relevance helps reduce the burden on CRs, ensuring that appropriate resources can be provided and that active rescues and mandatory reporting can take place in critical situations requiring immediate de-escalation.
♻ ☆ Expert Q-learning: Deep Reinforcement Learning with Coarse State Values from Offline Expert Examples
In this article, we propose a novel algorithm for deep reinforcement learning named Expert Q-learning. Expert Q-learning is inspired by Dueling Q-learning and aims at incorporating semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. We require that an offline expert assesses the value of a state in a coarse manner using three discrete values. An expert network is designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. Using the board game Othello, we compare our algorithm with the baseline Q-learning algorithm, which is a combination of Double Q-learning and Dueling Q-learning. Our results show that Expert Q-learning is indeed useful and more resistant to the overestimation bias. The baseline Q-learning algorithm exhibits unstable and suboptimal behavior in non-deterministic settings, whereas Expert Q-learning demonstrates more robust performance with higher scores, illustrating that our algorithm is indeed suitable to integrate state values from expert examples into Q-learning.
comment: Camera-ready version
♻ ☆ Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models
We propose Knowledge Crosswords, a geometric knowledge reasoning benchmark consisting of incomplete knowledge networks bounded by structured factual constraints, where LLMs are tasked with inferring the missing facts to meet all constraints. The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA, such as backtracking, verifying facts and constraints, reasoning with uncertainty, and more. Knowledge Crosswords contains 2,101 individual problems, covering diverse knowledge domains, and is further divided into three difficulty levels. We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords. Results demonstrate that baseline approaches struggle with larger knowledge networks and semantically-equivalent entity distractors. In light of their limitations, we propose two new approaches, Staged Prompting and Verify-All, to augment LLMs' abilities for error-aware backtracking and constraint verification. Our Verify-All significantly outperforms prior methods and is more robust towards problems in the hard subset. Further analysis shows that geometric knowledge reasoning poses new challenges to LLMs' knowledge abilities, particularly in robustness towards varying option orders, complex structural constraints in knowledge networks, "none of the above" scenarios, and more.
♻ ☆ A Survey on Safe Multi-Modal Learning System
In the rapidly evolving landscape of artificial intelligence, multimodal learning systems (MMLS) have gained traction for their ability to process and integrate information from diverse modality inputs. Their expanding use in vital sectors such as healthcare has made safety assurance a critical concern. However, the absence of systematic research into their safety is a significant barrier to progress in this field. To bridge the gap, we present the first taxonomy that systematically categorizes and assesses MMLS safety. This taxonomy is structured around four fundamental pillars that are critical to ensuring the safety of MMLS: robustness, alignment, monitoring, and controllability. Leveraging this taxonomy, we review existing methodologies, benchmarks, and the current state of research, while also pinpointing the principal limitations and gaps in knowledge. Finally, we discuss unique challenges in MMLS safety. In illuminating these challenges, we aim to pave the way for future research, proposing potential directions that could lead to significant advancements in the safety protocols of MMLS.
♻ ☆ NExT-GPT: Any-to-Any Multimodal LLM ICML 2024
While recently Multimodal Large Language Models (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/
comment: ICML 2024 (Oral)
♻ ☆ Pretrained deep models outperform GBDTs in Learning-To-Rank under label scarcity ICML
On tabular data, a significant body of literature has shown that current deep learning (DL) models perform at best similarly to Gradient Boosted Decision Trees (GBDTs), while significantly underperforming them on outlier data. However, these works often study idealized problem settings which may fail to capture complexities of real-world scenarios. We identify a natural tabular data setting where DL models can outperform GBDTs: tabular Learning-to-Rank (LTR) under label scarcity. Tabular LTR applications, including search and recommendation, often have an abundance of unlabeled data, and scarce labeled data. We show that DL rankers can utilize unsupervised pretraining to exploit this unlabeled data. In extensive experiments over both public and proprietary datasets, we show that pretrained DL rankers consistently outperform GBDT rankers on ranking metrics -- sometimes by as much as 38% -- both overall and on outliers.
comment: ICML-MFPL 2023 Workshop Oral, SPIGM@ICML2024
♻ ☆ Modeling Emotions and Ethics with Large Language Models
This paper explores the integration of human-like emotions and ethical considerations into Large Language Models (LLMs). We first model eight fundamental human emotions, presented as opposing pairs, and employ collaborative LLMs to reinterpret and express these emotions across a spectrum of intensity. Our focus extends to embedding a latent ethical dimension within LLMs, guided by a novel self-supervised learning algorithm with human feedback (SSHF). This approach enables LLMs to perform self-evaluations and adjustments concerning ethical guidelines, enhancing their capability to generate content that is not only emotionally resonant but also ethically aligned. The methodologies and case studies presented herein illustrate the potential of LLMs to transcend mere text and image generation, venturing into the realms of empathetic interaction and principled decision-making, thereby setting a new precedent in the development of emotionally aware and ethically conscious AI systems.
comment: 8 pages, 4 figures, 3 tables
♻ ☆ Distillation Enhanced Time Series Forecasting Network with Momentum Contrastive Learning
Contrastive representation learning is crucial in time series analysis as it alleviates the issue of data noise and incompleteness as well as sparsity of supervision signal. However, existing constrastive learning frameworks usually focus on intral-temporal features, which fails to fully exploit the intricate nature of time series data. To address this issue, we propose DE-TSMCL, an innovative distillation enhanced framework for long sequence time series forecasting. Specifically, we design a learnable data augmentation mechanism which adaptively learns whether to mask a timestamp to obtain optimized sub-sequences. Then, we propose a contrastive learning task with momentum update to explore inter-sample and intra-temporal correlations of time series to learn the underlying structure feature on the unlabeled time series. Meanwhile, we design a supervised task to learn more robust representations and facilitate the contrastive learning process. Finally, we jointly optimize the above two tasks. By developing model loss from multiple tasks, we can learn effective representations for downstream forecasting task. Extensive experiments, in comparison with state-of-the-arts, well demonstrate the effectiveness of DE-TSMCL, where the maximum improvement can reach to 27.3%.
♻ ☆ Large Multimodal Model Compression via Efficient Pruning and Distillation at AntGroup
The deployment of Large Multimodal Models (LMMs) within AntGroup has significantly advanced multimodal tasks in payment, security, and advertising, notably enhancing advertisement audition tasks in Alipay. However, the deployment of such sizable models introduces challenges, particularly in increased latency and carbon emissions, which are antithetical to the ideals of Green AI. This paper introduces a novel multi-stage compression strategy for our proprietary LLM, AntGMM. Our methodology pivots on three main aspects: employing small training sample sizes, addressing multi-level redundancy through multi-stage pruning, and introducing an advanced distillation loss design. In our research, we constructed a dataset, the Multimodal Advertisement Audition Dataset (MAAD), from real-world scenarios within Alipay, and conducted experiments to validate the reliability of our proposed strategy. Furthermore, the effectiveness of our strategy is evident in its operational success in Alipay's real-world multimodal advertisement audition for three months from September 2023. Notably, our approach achieved a substantial reduction in latency, decreasing it from 700ms to 90ms, while maintaining online performance with only a slight performance decrease. Moreover, our compressed model is estimated to reduce electricity consumption by approximately 75 million kWh annually compared to the direct deployment of AntGMM, demonstrating our commitment to green AI initiatives. We will publicly release our code and the MAAD dataset after some reviews\footnote{https://github.com/MorinW/AntGMM$\_$Pruning}.
♻ ☆ LKM-UNet: Large Kernel Vision Mamba UNet for Medical Image Segmentation MICCAI 2024
In clinical practice, medical image segmentation provides useful information on the contours and dimensions of target organs or tissues, facilitating improved diagnosis, analysis, and treatment. In the past few years, convolutional neural networks (CNNs) and Transformers have dominated this area, but they still suffer from either limited receptive fields or costly long-range modeling. Mamba, a State Space Sequence Model (SSM), recently emerged as a promising paradigm for long-range dependency modeling with linear complexity. In this paper, we introduce a Large Kernel Vision Mamba U-shape Network, or LKM-UNet, for medical image segmentation. A distinguishing feature of our LKM-UNet is its utilization of large Mamba kernels, excelling in locally spatial modeling compared to small kernel-based CNNs and Transformers, while maintaining superior efficiency in global modeling compared to self-attention with quadratic complexity. Additionally, we design a novel hierarchical and bidirectional Mamba block to further enhance Mamba's global and neighborhood spatial modeling capability for vision inputs. Comprehensive experiments demonstrate the feasibility and the effectiveness of using large-size Mamba kernels to achieve large receptive fields. Codes are available at https://github.com/wjh892521292/LKM-UNet.
comment: Accepted by MICCAI 2024
♻ ☆ When Large Language Models Meet Optical Networks: Paving the Way for Automation
Since the advent of GPT, large language models (LLMs) have brought about revolutionary advancements in all walks of life. As a superior natural language processing (NLP) technology, LLMs have consistently achieved state-of-the-art performance on numerous areas. However, LLMs are considered to be general-purpose models for NLP tasks, which may encounter challenges when applied to complex tasks in specialized fields such as optical networks. In this study, we propose a framework of LLM-empowered optical networks, facilitating intelligent control of the physical layer and efficient interaction with the application layer through an LLM-driven agent (AI-Agent) deployed in the control layer. The AI-Agent can leverage external tools and extract domain knowledge from a comprehensive resource library specifically established for optical networks. This is achieved through user input and well-crafted prompts, enabling the generation of control instructions and result representations for autonomous operation and maintenance in optical networks. To improve LLM's capability in professional fields and stimulate its potential on complex tasks, the details of performing prompt engineering, establishing domain knowledge library, and implementing complex tasks are illustrated in this study. Moreover, the proposed framework is verified on two typical tasks: network alarm analysis and network performance optimization. The good response accuracies and sematic similarities of 2,400 test situations exhibit the great potential of LLM in optical networks.
♻ ☆ Graph-Augmented LLMs for Personalized Health Insights: A Case Study in Sleep Analysis
Health monitoring systems have revolutionized modern healthcare by enabling the continuous capture of physiological and behavioral data, essential for preventive measures and early health intervention. While integrating this data with Large Language Models (LLMs) has shown promise in delivering interactive health advice, traditional methods like Retrieval-Augmented Generation (RAG) and fine-tuning often fail to fully utilize the complex, multi-dimensional, and temporally relevant data from wearable devices. These conventional approaches typically provide limited actionable and personalized health insights due to their inadequate capacity to dynamically integrate and interpret diverse health data streams. In response, this paper introduces a graph-augmented LLM framework designed to significantly enhance the personalization and clarity of health insights. Utilizing a hierarchical graph structure, the framework captures inter and intra-patient relationships, enriching LLM prompts with dynamic feature importance scores derived from a Random Forest Model. The effectiveness of this approach is demonstrated through a sleep analysis case study involving 20 college students during the COVID-19 lockdown, highlighting the potential of our model to generate actionable and personalized health insights efficiently. We leverage another LLM to evaluate the insights for relevance, comprehensiveness, actionability, and personalization, addressing the critical need for models that process and interpret complex health data effectively. Our findings show that augmenting prompts with our framework yields significant improvements in all 4 criteria. Through our framework, we can elicit well-crafted, more thoughtful responses tailored to a specific patient.
♻ ☆ Rapid and Accurate Diagnosis of Acute Aortic Syndrome using Non-contrast CT: A Large-scale, Retrospective, Multi-center and AI-based Study
Chest pain symptoms are highly prevalent in emergency departments (EDs), where acute aortic syndrome (AAS) is a catastrophic cardiovascular emergency with a high fatality rate, especially when timely and accurate treatment is not administered. However, current triage practices in the ED can cause up to approximately half of patients with AAS to have an initially missed diagnosis or be misdiagnosed as having other acute chest pain conditions. Subsequently, these AAS patients will undergo clinically inaccurate or suboptimal differential diagnosis. Fortunately, even under these suboptimal protocols, nearly all these patients underwent non-contrast CT covering the aorta anatomy at the early stage of differential diagnosis. In this study, we developed an artificial intelligence model (DeepAAS) using non-contrast CT, which is highly accurate for identifying AAS and provides interpretable results to assist in clinical decision-making. Performance was assessed in two major phases: a multi-center retrospective study (n = 20,750) and an exploration in real-world emergency scenarios (n = 137,525). In the multi-center cohort, DeepAAS achieved a mean area under the receiver operating characteristic curve of 0.958 (95% CI 0.950-0.967). In the real-world cohort, DeepAAS detected 109 AAS patients with misguided initial suspicion, achieving 92.6% (95% CI 76.2%-97.5%) in mean sensitivity and 99.2% (95% CI 99.1%-99.3%) in mean specificity. Our AI model performed well on non-contrast CT at all applicable early stages of differential diagnosis workflows, effectively reduced the overall missed diagnosis and misdiagnosis rate from 48.8% to 4.8% and shortened the diagnosis time for patients with misguided initial suspicion from an average of 681.8 (74-11,820) mins to 68.5 (23-195) mins. DeepAAS could effectively fill the gap in the current clinical workflow without requiring additional tests.
comment: under peer review
♻ ☆ Who Plays First? Optimizing the Order of Play in Stackelberg Games with Many Robots
We consider the multi-agent spatial navigation problem of computing the socially optimal order of play, i.e., the sequence in which the agents commit to their decisions, and its associated equilibrium in an N-player Stackelberg trajectory game. We model this problem as a mixed-integer optimization problem over the space of all possible Stackelberg games associated with the order of play's permutations. To solve the problem, we introduce Branch and Play (B&P), an efficient and exact algorithm that provably converges to a socially optimal order of play and its Stackelberg equilibrium. As a subroutine for B&P, we employ and extend sequential trajectory planning, i.e., a popular multi-agent control approach, to scalably compute valid local Stackelberg equilibria for any given order of play. We demonstrate the practical utility of B&P to coordinate air traffic control, swarm formation, and delivery vehicle fleets. We find that B&P consistently outperforms various baselines, and computes the socially optimal equilibrium.
comment: Robotics: Science and Systems (RSS) 2024
♻ ☆ OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.
♻ ☆ I Don't Know You, But I Can Catch You: Real-Time Defense against Diverse Adversarial Patches for Object Detectors
Deep neural networks (DNNs) have revolutionized the field of computer vision like object detection with their unparalleled performance. However, existing research has shown that DNNs are vulnerable to adversarial attacks. In the physical world, an adversary could exploit adversarial patches to implement a Hiding Attack (HA) which patches the target object to make it disappear from the detector, and an Appearing Attack (AA) which fools the detector into misclassifying the patch as a specific object. Recently, many defense methods for detectors have been proposed to mitigate the potential threats of adversarial patches. However, such methods still have limitations in generalization, robustness and efficiency. Most defenses are only effective against the HA, leaving the detector vulnerable to the AA. In this paper, we propose \textit{NutNet}, an innovative model for detecting adversarial patches, with high generalization, robustness and efficiency. With experiments for six detectors including YOLOv2-v4, SSD, Faster RCNN and DETR on both digital and physical domains, the results show that our proposed method can effectively defend against both the HA and AA, with only 0.4\% sacrifice of the clean performance. We compare NutNet with four baseline defense methods for detectors, and our method exhibits an average defense performance that is over 2.4 times and 4.7 times higher than existing approaches for HA and AA, respectively. In addition, NutNet only increases the inference time by 8\%, which can meet the real-time requirements of the detection systems. Demos of NutNet are available at: \url{https://sites.google.com/view/nutnet}.
♻ ☆ Neural Optimization with Adaptive Heuristics for Intelligent Marketing System KDD 2024
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
comment: KDD 2024
♻ ☆ LEDITS++: Limitless Image Editing using Text-to-Image Models CVPR
Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However, existing image-to-image methods are often inefficient, imprecise, and of limited versatility. They either require time-consuming finetuning, deviate unnecessarily strongly from the input image, and/or lack support for multiple, simultaneous edits. To address these issues, we introduce LEDITS++, an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second, our methodology supports multiple simultaneous edits and is architecture-agnostic. Third, we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods.
comment: Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) The project page is available at https://leditsplusplus-project.static.hf.space
♻ ☆ ODIN: A Single Model for 2D and 3D Segmentation CVPR 2024
State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds, obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain, forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper, we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation), a model that can segment and label both 2D RGB images and 3D point clouds, using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved, which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D instance segmentation benchmarks, and competitive performance on ScanNet, S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture, it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website (https://odin-seg.github.io).
comment: Camera Ready (CVPR 2024, Highlight)
♻ ☆ Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
♻ ☆ Confabulation: The Surprising Value of Large Language Model Hallucinations ACL2024
This paper presents a systematic defense of large language model (LLM) hallucinations or 'confabulations' as a potential resource instead of a categorically negative pitfall. The standard view is that confabulations are inherently problematic and AI research should eliminate this flaw. In this paper, we argue and empirically demonstrate that measurable semantic characteristics of LLM confabulations mirror a human propensity to utilize increased narrativity as a cognitive resource for sense-making and communication. In other words, it has potential value. Specifically, we analyze popular hallucination benchmarks and reveal that hallucinated outputs display increased levels of narrativity and semantic coherence relative to veridical outputs. This finding reveals a tension in our usually dismissive understandings of confabulation. It suggests, counter-intuitively, that the tendency for LLMs to confabulate may be intimately associated with a positive capacity for coherent narrative-text generation.
comment: Forthcoming at ACL2024 main conference. 1 figure
Information Retrieval 12
☆ Light-weight End-to-End Graph Interest Network for CTR Prediction in E-commerce Search
Click-through-rate (CTR) prediction has an essential impact on improving user experience and revenue in e-commerce search. With the development of deep learning, graph-based methods are well exploited to utilize graph structure extracted from user behaviors and other information to help embedding learning. However, most of the previous graph-based methods mainly focus on recommendation scenarios, and therefore their graph structures highly depend on item's sequential information from user behaviors, ignoring query's sequential signal and query-item correlation. In this paper, we propose a new approach named Light-weight End-to-End Graph Interest Network (EGIN) to effectively mine users' search interests and tackle previous challenges. (i) EGIN utilizes query and item's correlation and sequential information from the search system to build a heterogeneous graph for better CTR prediction in e-commerce search. (ii) EGIN's graph embedding learning shares the same training input and is jointly trained with CTR prediction, making the end-to-end framework effortless to deploy in large-scale search systems. The proposed EGIN is composed of three parts: query-item heterogeneous graph, light-weight graph sampling, and multi-interest network. The query-item heterogeneous graph captures correlation and sequential information of query and item efficiently by the proposed light-weight graph sampling. The multi-interest network is well designed to utilize graph embedding to capture various similarity relationships between query and item to enhance the final CTR prediction. We conduct extensive experiments on both public and industrial datasets to demonstrate the effectiveness of the proposed EGIN. At the same time, the training cost of graph learning is relatively low compared with the main CTR prediction task, ensuring efficiency in practical applications.
comment: 8 pages, 4 figures
☆ LumberChunker: Long-Form Narrative Document Segmentation
Modern NLP tasks increasingly rely on dense retrieval methods to access up-to-date and relevant contextual information. We are motivated by the premise that retrieval benefits from segments that can vary in size such that a content's semantic independence is better captured. We propose LumberChunker, a method leveraging an LLM to dynamically segment documents, which iteratively prompts the LLM to identify the point within a group of sequential passages where the content begins to shift. To evaluate our method, we introduce GutenQA, a benchmark with 3000 "needle in a haystack" type of question-answer pairs derived from 100 public domain narrative books available on Project Gutenberg. Our experiments show that LumberChunker not only outperforms the most competitive baseline by 7.37% in retrieval performance (DCG@20) but also that, when integrated into a RAG pipeline, LumberChunker proves to be more effective than other chunking methods and competitive baselines, such as the Gemini 1.5M Pro. Our Code and Data are available at https://github.com/joaodsmarques/LumberChunker
☆ ACE: A Generative Cross-Modal Retrieval Framework with Coarse-To-Fine Semantic Modeling
Generative retrieval, which has demonstrated effectiveness in text-to-text retrieval, utilizes a sequence-to-sequence model to directly generate candidate identifiers based on natural language queries. Without explicitly computing the similarity between queries and candidates, generative retrieval surpasses dual-tower models in both speed and accuracy on large-scale corpora, providing new insights for cross-modal retrieval. However, constructing identifiers for multimodal data remains an untapped problem, and the modality gap between natural language queries and multimodal candidates hinders retrieval performance due to the absence of additional encoders. To this end, we propose a pioneering generAtive Cross-modal rEtrieval framework (ACE), which is a comprehensive framework for end-to-end cross-modal retrieval based on coarse-to-fine semantic modeling. We propose combining K-Means and RQ-VAE to construct coarse and fine tokens, serving as identifiers for multimodal data. Correspondingly, we design the coarse-to-fine feature fusion strategy to efficiently align natural language queries and candidate identifiers. ACE is the first work to comprehensively demonstrate the feasibility of generative approach on text-to-image/audio/video retrieval, challenging the dominance of the embedding-based dual-tower architecture. Extensive experiments show that ACE achieves state-of-the-art performance in cross-modal retrieval and outperforms the strong baselines on Recall@1 by 15.27% on average.
☆ Performative Debias with Fair-exposure Optimization Driven by Strategic Agents in Recommender Systems KDD 2024
Data bias, e.g., popularity impairs the dynamics of two-sided markets within recommender systems. This overshadows the less visible but potentially intriguing long-tail items that could capture user interest. Despite the abundance of research surrounding this issue, it still poses challenges and remains a hot topic in academic circles. Along this line, in this paper, we developed a re-ranking approach in dynamic settings with fair-exposure optimization driven by strategic agents. Designed for the producer side, the execution of agents assumes content creators can modify item features based on strategic incentives to maximize their exposure. This iterative process entails an end-to-end optimization, employing differentiable ranking operators that simultaneously target accuracy and fairness. Joint objectives ensure the performance of recommendations while enhancing the visibility of tail items. We also leveraged the performativity nature of predictions to illustrate how strategic learning influences content creators to shift towards fairness efficiently, thereby incentivizing features of tail items. Through comprehensive experiments on both public and industrial datasets, we have substantiated the effectiveness and dominance of the proposed method especially on unveiling the potential of tail items.
comment: SIGKDD 2024 accepted paper
☆ A Text is Worth Several Tokens: Text Embedding from LLMs Secretly Aligns Well with The Key Tokens
Text embeddings from large language models (LLMs) have achieved excellent results in tasks such as information retrieval, semantic textual similarity, etc. In this work, we show an interesting finding: when feeding a text into the embedding LLMs, the obtained text embedding will be able to be aligned with the key tokens in the input text. We first fully analyze this phenomenon on eight embedding LLMs and show that this phenomenon is universal and is not affected by model architecture, training strategy, and embedding method. With a deeper analysis, we then find that the main change in embedding space between the embedding LLMs and their original generative LLMs is in the first principal component. By adjusting the first principal component, we can align text embedding with the key tokens. Finally, we give several examples to demonstrate the vast application potential of this finding: (1) we propose a simple and practical sparse retrieval method based on the aligned tokens, which can achieve 80\% of the dense retrieval effect of the same model while reducing the computation significantly; (2) we show that our findings provide a fresh perspective to help understand fuzzy concepts (e.g., semantic relatedness vs. semantic similarity) and emerging technologies (e.g., instruction-following embedding) in this field.
comment: Work in Progress
☆ A Thorough Performance Benchmarking on Lightweight Embedding-based Recommender Systems
Since the creation of the Web, recommender systems (RSs) have been an indispensable mechanism in information filtering. State-of-the-art RSs primarily depend on categorical features, which ecoded by embedding vectors, resulting in excessively large embedding tables. To prevent over-parameterized embedding tables from harming scalability, both academia and industry have seen increasing efforts in compressing RS embeddings. However, despite the prosperity of lightweight embedding-based RSs (LERSs), a wide diversity is seen in evaluation protocols, resulting in obstacles when relating LERS performance to real-world usability. Moreover, despite the common goal of lightweight embeddings, LERSs are evaluated with a single choice between the two main recommendation tasks -- collaborative filtering and content-based recommendation. This lack of discussions on cross-task transferability hinders the development of unified, more scalable solutions. Motivated by these issues, this study investigates various LERSs' performance, efficiency, and cross-task transferability via a thorough benchmarking process. Additionally, we propose an efficient embedding compression method using magnitude pruning, which is an easy-to-deploy yet highly competitive baseline that outperforms various complex LERSs. Our study reveals the distinct performance of LERSs across the two tasks, shedding light on their effectiveness and generalizability. To support edge-based recommendations, we tested all LERSs on a Raspberry Pi 4, where the efficiency bottleneck is exposed. Finally, we conclude this paper with critical summaries of LERS performance, model selection suggestions, and underexplored challenges around LERSs for future research. To encourage future research, we publish source codes and artifacts at \href{this link}{https://github.com/chenxing1999/recsys-benchmark}.
☆ Hyperbolic Knowledge Transfer in Cross-Domain Recommendation System
Cross-Domain Recommendation (CDR) seeks to utilize knowledge from different domains to alleviate the problem of data sparsity in the target recommendation domain, and it has been gaining more attention in recent years. Although there have been notable advancements in this area, most current methods represent users and items in Euclidean space, which is not ideal for handling long-tail distributed data in recommendation systems. Additionally, adding data from other domains can worsen the long-tail characteristics of the entire dataset, making it harder to train CDR models effectively. Recent studies have shown that hyperbolic methods are particularly suitable for modeling long-tail distributions, which has led us to explore hyperbolic representations for users and items in CDR scenarios. However, due to the distinct characteristics of the different domains, applying hyperbolic representation learning to CDR tasks is quite challenging. In this paper, we introduce a new framework called Hyperbolic Contrastive Learning (HCTS), designed to capture the unique features of each domain while enabling efficient knowledge transfer between domains. We achieve this by embedding users and items from each domain separately and mapping them onto distinct hyperbolic manifolds with adjustable curvatures for prediction. To improve the representations of users and items in the target domain, we develop a hyperbolic contrastive learning module for knowledge transfer. Extensive experiments on real-world datasets demonstrate that hyperbolic manifolds are a promising alternative to Euclidean space for CDR tasks.
☆ Efficient Document Ranking with Learnable Late Interactions
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower latency. Recently, late-interaction models have been proposed to realize more favorable latency-quality tradeoffs, by using a DE structure followed by a lightweight scorer based on query and document token embeddings. However, these lightweight scorers are often hand-crafted, and there is no understanding of their approximation power; further, such scorers require access to individual document token embeddings, which imposes an increased latency and storage burden. In this paper, we propose novel learnable late-interaction models (LITE) that resolve these issues. Theoretically, we prove that LITE is a universal approximator of continuous scoring functions, even for relatively small embedding dimension. Empirically, LITE outperforms previous late-interaction models such as ColBERT on both in-domain and zero-shot re-ranking tasks. For instance, experiments on MS MARCO passage re-ranking show that LITE not only yields a model with better generalization, but also lowers latency and requires 0.25x storage compared to ColBERT.
☆ NormTab: Improving Symbolic Reasoning in LLMs Through Tabular Data Normalization
In recent years, Large Language Models (LLMs) have demonstrated remarkable capabilities in parsing textual data and generating code. However, their performance in tasks involving tabular data, especially those requiring symbolic reasoning, faces challenges due to the structural variance and inconsistency in table cell values often found in web tables. In this paper, we introduce NormTab, a novel framework aimed at enhancing the symbolic reasoning performance of LLMs by normalizing web tables. We study table normalization as a stand-alone, one-time preprocessing step using LLMs to support symbolic reasoning on tabular data. Our experimental evaluation, conducted on challenging web table datasets such as WikiTableQuestion and TabFact, demonstrates that leveraging NormTab significantly improves symbolic reasoning performance, showcasing the importance and effectiveness of web table normalization for enhancing LLM-based symbolic reasoning tasks.
comment: Work in Progress
♻ ☆ Gradient Coding with Iterative Block Leverage Score Sampling
We generalize the leverage score sampling sketch for $\ell_2$-subspace embeddings, to accommodate sampling subsets of the transformed data, so that the sketching approach is appropriate for distributed settings. This is then used to derive an approximate coded computing approach for first-order methods; known as gradient coding, to accelerate linear regression in the presence of failures in distributed computational networks, \textit{i.e.} stragglers. We replicate the data across the distributed network, to attain the approximation guarantees through the induced sampling distribution. The significance and main contribution of this work, is that it unifies randomized numerical linear algebra with approximate coded computing, while attaining an induced $\ell_2$-subspace embedding through uniform sampling. The transition to uniform sampling is done without applying a random projection, as in the case of the subsampled randomized Hadamard transform. Furthermore, by incorporating this technique to coded computing, our scheme is an iterative sketching approach to approximately solving linear regression. We also propose weighting when sketching takes place through sampling with replacement, for further compression.
comment: 26 pages, 6 figures, 1 table
♻ ☆ Neural Optimization with Adaptive Heuristics for Intelligent Marketing System KDD 2024
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
comment: KDD 2024
♻ ☆ Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers NAACL 2024
The integration of Large Language Models (LLMs) in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.
comment: Accepted at NAACL 2024 Main Conference
Multimedia 5
☆ MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
Pre-trained models have been a foundational approach in speech recognition, albeit with associated additional costs. In this study, we propose a regularization technique that facilitates the training of visual and audio-visual speech recognition models (VSR and AVSR) from scratch. This approach, abbreviated as \textbf{MSRS} (Multimodal Speech Recognition from Scratch), introduces a sparse regularization that rapidly learns sparse structures within the dense model at the very beginning of training, which receives healthier gradient flow than the dense equivalent. Once the sparse mask stabilizes, our method allows transitioning to a dense model or keeping a sparse model by updating non-zero values. MSRS achieves competitive results in VSR and AVSR with 21.1% and 0.9% WER on the LRS3 benchmark, while reducing training time by at least 2x. We explore other sparse approaches and show that only MSRS enables training from scratch by implicitly masking the weights affected by vanishing gradients.
comment: Accepted at Interspeech 2024
☆ SonicSense: Object Perception from In-Hand Acoustic Vibration
We introduce SonicSense, a holistic design of hardware and software to enable rich robot object perception through in-hand acoustic vibration sensing. While previous studies have shown promising results with acoustic sensing for object perception, current solutions are constrained to a handful of objects with simple geometries and homogeneous materials, single-finger sensing, and mixing training and testing on the same objects. SonicSense enables container inventory status differentiation, heterogeneous material prediction, 3D shape reconstruction, and object re-identification from a diverse set of 83 real-world objects. Our system employs a simple but effective heuristic exploration policy to interact with the objects as well as end-to-end learning-based algorithms to fuse vibration signals to infer object properties. Our framework underscores the significance of in-hand acoustic vibration sensing in advancing robot tactile perception.
comment: Our project website is at: http://generalroboticslab.com/SonicSense
♻ ☆ Soundify: Matching Sound Effects to Video
In the art of video editing, sound helps add character to an object and immerse the viewer within a space. Through formative interviews with professional editors (N=10), we found that the task of adding sounds to video can be challenging. This paper presents Soundify, a system that assists editors in matching sounds to video. Given a video, Soundify identifies matching sounds, synchronizes the sounds to the video, and dynamically adjusts panning and volume to create spatial audio. In a human evaluation study (N=889), we show that Soundify is capable of matching sounds to video out-of-the-box for a diverse range of audio categories. In a within-subjects expert study (N=12), we demonstrate the usefulness of Soundify in helping video editors match sounds to video with lighter workload, reduced task completion time, and improved usability.
comment: https://soundify.cc
♻ ☆ VideoMap: Supporting Video Editing Exploration, Brainstorming, and Prototyping in the Latent Space
Video editing is a creative and complex endeavor and we believe that there is potential for reimagining a new video editing interface to better support the creative and exploratory nature of video editing. We take inspiration from latent space exploration tools that help users find patterns and connections within complex datasets. We present VideoMap, a proof-of-concept video editing interface that operates on video frames projected onto a latent space. We support intuitive navigation through map-inspired navigational elements and facilitate transitioning between different latent spaces through swappable lenses. We built three VideoMap components to support editors in three common video tasks. In a user study with both professionals and non-professionals, editors found that VideoMap helps reduce grunt work, offers a user-friendly experience, provides an inspirational way of editing, and effectively supports the exploratory nature of video editing. We further demonstrate the versatility of VideoMap by implementing three extended applications. For interactive examples, we invite you to visit our project page: https://humanvideointeraction.github.io/videomap.
comment: https://humanvideointeraction.github.io/videomap
♻ ☆ Videogenic: Identifying Highlight Moments in Videos with Professional Photographs as a Prior
This paper investigates the challenge of extracting highlight moments from videos. To perform this task, we need to understand what constitutes a highlight for arbitrary video domains while at the same time being able to scale across different domains. Our key insight is that photographs taken by photographers tend to capture the most remarkable or photogenic moments of an activity. Drawing on this insight, we present Videogenic, a technique capable of creating domain-specific highlight videos for a diverse range of domains. In a human evaluation study (N=50), we show that a high-quality photograph collection combined with CLIP-based retrieval (which uses a neural network with semantic knowledge of images) can serve as an excellent prior for finding video highlights. In a within-subjects expert study (N=12), we demonstrate the usefulness of Videogenic in helping video editors create highlight videos with lighter workload, shorter task completion time, and better usability.
comment: https://humanvideointeraction.github.io/videogenic
Computer Vision and Pattern Recognition 142
☆ StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X
comment: HF Demo: hf.co/Stable-X, Video: https://www.youtube.com/watch?v=sylXTxG_U2U
☆ Revisiting Referring Expression Comprehension Evaluation in the Era of Large Multimodal Models
Referring expression comprehension (REC) involves localizing a target instance based on a textual description. Recent advancements in REC have been driven by large multimodal models (LMMs) like CogVLM, which achieved 92.44% accuracy on RefCOCO. However, this study questions whether existing benchmarks such as RefCOCO, RefCOCO+, and RefCOCOg, capture LMMs' comprehensive capabilities. We begin with a manual examination of these benchmarks, revealing high labeling error rates: 14% in RefCOCO, 24% in RefCOCO+, and 5% in RefCOCOg, which undermines the authenticity of evaluations. We address this by excluding problematic instances and reevaluating several LMMs capable of handling the REC task, showing significant accuracy improvements, thus highlighting the impact of benchmark noise. In response, we introduce Ref-L4, a comprehensive REC benchmark, specifically designed to evaluate modern REC models. Ref-L4 is distinguished by four key features: 1) a substantial sample size with 45,341 annotations; 2) a diverse range of object categories with 365 distinct types and varying instance scales from 30 to 3,767; 3) lengthy referring expressions averaging 24.2 words; and 4) an extensive vocabulary comprising 22,813 unique words. We evaluate a total of 24 large models on Ref-L4 and provide valuable insights. The cleaned versions of RefCOCO, RefCOCO+, and RefCOCOg, as well as our Ref-L4 benchmark and evaluation code, are available at https://github.com/JierunChen/Ref-L4.
☆ FreeTraj: Tuning-Free Trajectory Control in Video Diffusion Models
Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest in introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion model itself allows decent control over the generated content without requiring any training. In this study, we introduce a tuning-free framework to achieve trajectory-controllable video generation, by imposing guidance on both noise construction and attention computation. Specifically, 1) we first show several instructive phenomenons and analyze how initial noises influence the motion trajectory of generated content. 2) Subsequently, we propose FreeTraj, a tuning-free approach that enables trajectory control by modifying noise sampling and attention mechanisms. 3) Furthermore, we extend FreeTraj to facilitate longer and larger video generation with controllable trajectories. Equipped with these designs, users have the flexibility to provide trajectories manually or opt for trajectories automatically generated by the LLM trajectory planner. Extensive experiments validate the efficacy of our approach in enhancing the trajectory controllability of video diffusion models.
comment: Project Page: http://haonanqiu.com/projects/FreeTraj.html, Code Repo: https://github.com/arthur-qiu/FreeTraj
☆ Dreamitate: Real-World Visuomotor Policy Learning via Video Generation
A key challenge in manipulation is learning a policy that can robustly generalize to diverse visual environments. A promising mechanism for learning robust policies is to leverage video generative models, which are pretrained on large-scale datasets of internet videos. In this paper, we propose a visuomotor policy learning framework that fine-tunes a video diffusion model on human demonstrations of a given task. At test time, we generate an example of an execution of the task conditioned on images of a novel scene, and use this synthesized execution directly to control the robot. Our key insight is that using common tools allows us to effortlessly bridge the embodiment gap between the human hand and the robot manipulator. We evaluate our approach on four tasks of increasing complexity and demonstrate that harnessing internet-scale generative models allows the learned policy to achieve a significantly higher degree of generalization than existing behavior cloning approaches.
comment: Project page: https://dreamitate.cs.columbia.edu/
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
We introduce Cambrian-1, a family of multimodal LLMs (MLLMs) designed with a vision-centric approach. While stronger language models can enhance multimodal capabilities, the design choices for vision components are often insufficiently explored and disconnected from visual representation learning research. This gap hinders accurate sensory grounding in real-world scenarios. Our study uses LLMs and visual instruction tuning as an interface to evaluate various visual representations, offering new insights into different models and architectures -- self-supervised, strongly supervised, or combinations thereof -- based on experiments with over 20 vision encoders. We critically examine existing MLLM benchmarks, addressing the difficulties involved in consolidating and interpreting results from various tasks, and introduce a new vision-centric benchmark, CV-Bench. To further improve visual grounding, we propose the Spatial Vision Aggregator (SVA), a dynamic and spatially-aware connector that integrates high-resolution vision features with LLMs while reducing the number of tokens. Additionally, we discuss the curation of high-quality visual instruction-tuning data from publicly available sources, emphasizing the importance of data source balancing and distribution ratio. Collectively, Cambrian-1 not only achieves state-of-the-art performance but also serves as a comprehensive, open cookbook for instruction-tuned MLLMs. We provide model weights, code, supporting tools, datasets, and detailed instruction-tuning and evaluation recipes. We hope our release will inspire and accelerate advancements in multimodal systems and visual representation learning.
comment: Website at https://cambrian-mllm.github.io
☆ DreamBench++: A Human-Aligned Benchmark for Personalized Image Generation
Personalized image generation holds great promise in assisting humans in everyday work and life due to its impressive function in creatively generating personalized content. However, current evaluations either are automated but misalign with humans or require human evaluations that are time-consuming and expensive. In this work, we present DreamBench++, a human-aligned benchmark automated by advanced multimodal GPT models. Specifically, we systematically design the prompts to let GPT be both human-aligned and self-aligned, empowered with task reinforcement. Further, we construct a comprehensive dataset comprising diverse images and prompts. By benchmarking 7 modern generative models, we demonstrate that DreamBench++ results in significantly more human-aligned evaluation, helping boost the community with innovative findings.
comment: Project page: https://dreambenchplus.github.io/
☆ Long Context Transfer from Language to Vision
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. Many works address this by reducing the number of visual tokens using visual resamplers. Alternatively, in this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training. We call this phenomenon long context transfer and carefully ablate its properties. To effectively measure LMMs' ability to generalize to long contexts in the vision modality, we develop V-NIAH (Visual Needle-In-A-Haystack), a purely synthetic long vision benchmark inspired by the language model's NIAH test. Our proposed Long Video Assistant (LongVA) can process 2000 frames or over 200K visual tokens without additional complexities. With its extended context length, LongVA achieves state-of-the-art performance on Video-MME among 7B-scale models by densely sampling more input frames. Our work is open-sourced at https://github.com/EvolvingLMMs-Lab/LongVA.
☆ Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking exponential decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
comment: Under review
☆ From Perfect to Noisy World Simulation: Customizable Embodied Multi-modal Perturbations for SLAM Robustness Benchmarking
Embodied agents require robust navigation systems to operate in unstructured environments, making the robustness of Simultaneous Localization and Mapping (SLAM) models critical to embodied agent autonomy. While real-world datasets are invaluable, simulation-based benchmarks offer a scalable approach for robustness evaluations. However, the creation of a challenging and controllable noisy world with diverse perturbations remains under-explored. To this end, we propose a novel, customizable pipeline for noisy data synthesis, aimed at assessing the resilience of multi-modal SLAM models against various perturbations. The pipeline comprises a comprehensive taxonomy of sensor and motion perturbations for embodied multi-modal (specifically RGB-D) sensing, categorized by their sources and propagation order, allowing for procedural composition. We also provide a toolbox for synthesizing these perturbations, enabling the transformation of clean environments into challenging noisy simulations. Utilizing the pipeline, we instantiate the large-scale Noisy-Replica benchmark, which includes diverse perturbation types, to evaluate the risk tolerance of existing advanced RGB-D SLAM models. Our extensive analysis uncovers the susceptibilities of both neural (NeRF and Gaussian Splatting -based) and non-neural SLAM models to disturbances, despite their demonstrated accuracy in standard benchmarks. Our code is publicly available at https://github.com/Xiaohao-Xu/SLAM-under-Perturbation.
comment: 50 pages. arXiv admin note: substantial text overlap with arXiv:2402.08125
☆ Unsupervised Domain Adaptation for Pediatric Brain Tumor Segmentation
Significant advances have been made toward building accurate automatic segmentation models for adult gliomas. However, the performance of these models often degrades when applied to pediatric glioma due to their imaging and clinical differences (domain shift). Obtaining sufficient annotated data for pediatric glioma is typically difficult because of its rare nature. Also, manual annotations are scarce and expensive. In this work, we propose Domain-Adapted nnU-Net (DA-nnUNet) to perform unsupervised domain adaptation from adult glioma (source domain) to pediatric glioma (target domain). Specifically, we add a domain classifier connected with a gradient reversal layer (GRL) to a backbone nnU-Net. Once the classifier reaches a very high accuracy, the GRL is activated with the goal of transferring domain-invariant features from the classifier to the segmentation model while preserving segmentation accuracy on the source domain. The accuracy of the classifier slowly degrades to chance levels. No annotations are used in the target domain. The method is compared to 8 different supervised models using BraTS-Adult glioma (N=1251) and BraTS-PED glioma data (N=99). The proposed method shows notable performance enhancements in the tumor core (TC) region compared to the model that only uses adult data: ~32% better Dice scores and ~20 better 95th percentile Hausdorff distances. Moreover, our unsupervised approach shows no statistically significant difference compared to the practical upper bound model using manual annotations from both datasets in TC region. The code is shared at https://github.com/Fjr9516/DA_nnUNet.
comment: 10 pages, 4 figures, conference
GPT-4V Explorations: Mining Autonomous Driving
This paper explores the application of the GPT-4V(ision) large visual language model to autonomous driving in mining environments, where traditional systems often falter in understanding intentions and making accurate decisions during emergencies. GPT-4V introduces capabilities for visual question answering and complex scene comprehension, addressing challenges in these specialized settings.Our evaluation focuses on its proficiency in scene understanding, reasoning, and driving functions, with specific tests on its ability to recognize and interpret elements such as pedestrians, various vehicles, and traffic devices. While GPT-4V showed robust comprehension and decision-making skills, it faced difficulties in accurately identifying specific vehicle types and managing dynamic interactions. Despite these challenges, its effective navigation and strategic decision-making demonstrate its potential as a reliable agent for autonomous driving in the complex conditions of mining environments, highlighting its adaptability and operational viability in industrial settings.
☆ ClotheDreamer: Text-Guided Garment Generation with 3D Gaussians
High-fidelity 3D garment synthesis from text is desirable yet challenging for digital avatar creation. Recent diffusion-based approaches via Score Distillation Sampling (SDS) have enabled new possibilities but either intricately couple with human body or struggle to reuse. We introduce ClotheDreamer, a 3D Gaussian-based method for generating wearable, production-ready 3D garment assets from text prompts. We propose a novel representation Disentangled Clothe Gaussian Splatting (DCGS) to enable separate optimization. DCGS represents clothed avatar as one Gaussian model but freezes body Gaussian splats. To enhance quality and completeness, we incorporate bidirectional SDS to supervise clothed avatar and garment RGBD renderings respectively with pose conditions and propose a new pruning strategy for loose clothing. Our approach can also support custom clothing templates as input. Benefiting from our design, the synthetic 3D garment can be easily applied to virtual try-on and support physically accurate animation. Extensive experiments showcase our method's superior and competitive performance. Our project page is at https://ggxxii.github.io/clothedreamer.
comment: Project Page: https://ggxxii.github.io/clothedreamer
☆ Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.
☆ The Progression of Transformers from Language to Vision to MOT: A Literature Review on Multi-Object Tracking with Transformers
The transformer neural network architecture allows for autoregressive sequence-to-sequence modeling through the use of attention layers. It was originally created with the application of machine translation but has revolutionized natural language processing. Recently, transformers have also been applied across a wide variety of pattern recognition tasks, particularly in computer vision. In this literature review, we describe major advances in computer vision utilizing transformers. We then focus specifically on Multi-Object Tracking (MOT) and discuss how transformers are increasingly becoming competitive in state-of-the-art MOT works, yet still lag behind traditional deep learning methods.
comment: This report was written in November 2022, and may not contain more recent works since then
☆ Instance Consistency Regularization for Semi-Supervised 3D Instance Segmentation
Large-scale datasets with point-wise semantic and instance labels are crucial to 3D instance segmentation but also expensive. To leverage unlabeled data, previous semi-supervised 3D instance segmentation approaches have explored self-training frameworks, which rely on high-quality pseudo labels for consistency regularization. They intuitively utilize both instance and semantic pseudo labels in a joint learning manner. However, semantic pseudo labels contain numerous noise derived from the imbalanced category distribution and natural confusion of similar but distinct categories, which leads to severe collapses in self-training. Motivated by the observation that 3D instances are non-overlapping and spatially separable, we ask whether we can solely rely on instance consistency regularization for improved semi-supervised segmentation. To this end, we propose a novel self-training network InsTeacher3D to explore and exploit pure instance knowledge from unlabeled data. We first build a parallel base 3D instance segmentation model DKNet, which distinguishes each instance from the others via discriminative instance kernels without reliance on semantic segmentation. Based on DKNet, we further design a novel instance consistency regularization framework to generate and leverage high-quality instance pseudo labels. Experimental results on multiple large-scale datasets show that the InsTeacher3D significantly outperforms prior state-of-the-art semi-supervised approaches. Code is available: https://github.com/W1zheng/InsTeacher3D.
comment: 14 pages, 10 figures
☆ The MRI Scanner as a Diagnostic: Image-less Active Sampling MICCAI 2024
Despite the high diagnostic accuracy of Magnetic Resonance Imaging (MRI), using MRI as a Point-of-Care (POC) disease identification tool poses significant accessibility challenges due to the use of high magnetic field strength and lengthy acquisition times. We ask a simple question: Can we dynamically optimise acquired samples, at the patient level, according to an (automated) downstream decision task, while discounting image reconstruction? We propose an ML-based framework that learns an active sampling strategy, via reinforcement learning, at a patient-level to directly infer disease from undersampled k-space. We validate our approach by inferring Meniscus Tear in undersampled knee MRI data, where we achieve diagnostic performance comparable with ML-based diagnosis, using fully sampled k-space data. We analyse task-specific sampling policies, showcasing the adaptability of our active sampling approach. The introduced frugal sampling strategies have the potential to reduce high field strength requirements that in turn strengthen the viability of MRI-based POC disease identification and associated preliminary screening tools.
comment: Accepted in MICCAI 2024
☆ μ-Net: A Deep Learning-Based Architecture for μ-CT Segmentation
X-ray computed microtomography ({\mu}-CT) is a non-destructive technique that can generate high-resolution 3D images of the internal anatomy of medical and biological samples. These images enable clinicians to examine internal anatomy and gain insights into the disease or anatomical morphology. However, extracting relevant information from 3D images requires semantic segmentation of the regions of interest, which is usually done manually and results time-consuming and tedious. In this work, we propose a novel framework that uses a convolutional neural network (CNN) to automatically segment the full morphology of the heart of Carassius auratus. The framework employs an optimized 2D CNN architecture that can infer a 3D segmentation of the sample, avoiding the high computational cost of a 3D CNN architecture. We tackle the challenges of handling large and high-resoluted image data (over a thousand pixels in each dimension) and a small training database (only three samples) by proposing a standard protocol for data normalization and processing. Moreover, we investigate how the noise, contrast, and spatial resolution of the sample and the training of the architecture are affected by the reconstruction technique, which depends on the number of input images. Experiments show that our framework significantly reduces the time required to segment new samples, allowing a faster microtomography analysis of the Carassius auratus heart shape. Furthermore, our framework can work with any bio-image (biological and medical) from {\mu}-CT with high-resolution and small dataset size
☆ Portrait3D: 3D Head Generation from Single In-the-wild Portrait Image
While recent works have achieved great success on one-shot 3D common object generation, high quality and fidelity 3D head generation from a single image remains a great challenge. Previous text-based methods for generating 3D heads were limited by text descriptions and image-based methods struggled to produce high-quality head geometry. To handle this challenging problem, we propose a novel framework, Portrait3D, to generate high-quality 3D heads while preserving their identities. Our work incorporates the identity information of the portrait image into three parts: 1) geometry initialization, 2) geometry sculpting, and 3) texture generation stages. Given a reference portrait image, we first align the identity features with text features to realize ID-aware guidance enhancement, which contains the control signals representing the face information. We then use the canny map, ID features of the portrait image, and a pre-trained text-to-normal/depth diffusion model to generate ID-aware geometry supervision, and 3D-GAN inversion is employed to generate ID-aware geometry initialization. Furthermore, with the ability to inject identity information into 3D head generation, we use ID-aware guidance to calculate ID-aware Score Distillation (ISD) for geometry sculpting. For texture generation, we adopt the ID Consistent Texture Inpainting and Refinement which progressively expands the view for texture inpainting to obtain an initialization UV texture map. We then use the id-aware guidance to provide image-level supervision for noisy multi-view images to obtain a refined texture map. Extensive experiments demonstrate that we can generate high-quality 3D heads with accurate geometry and texture from single in-the-wild portrait images. The project page is at https://jinkun-hao.github.io/Portrait3D/.
comment: https://jinkun-hao.github.io/Portrait3D/
☆ Demystifying the Effect of Receptive Field Size in U-Net Models for Medical Image Segmentation
Medical image segmentation is a critical task in healthcare applications, and U-Nets have demonstrated promising results. This work delves into the understudied aspect of receptive field (RF) size and its impact on the U-Net and Attention U-Net architectures. This work explores several critical elements including the relationship between RF size, characteristics of the region of interest, and model performance, as well as the balance between RF size and computational costs for U-Net and Attention U-Net methods for different datasets. This work also proposes a mathematical notation for representing the theoretical receptive field (TRF) of a given layer in a network and proposes two new metrics - effective receptive field (ERF) rate and the Object rate to quantify the fraction of significantly contributing pixels within the ERF against the TRF area and assessing the relative size of the segmentation object compared to the TRF size respectively. The results demonstrate that there exists an optimal TRF size that successfully strikes a balance between capturing a wider global context and maintaining computational efficiency, thereby optimizing model performance. Interestingly, a distinct correlation is observed between the data complexity and the required TRF size; segmentation based solely on contrast achieved peak performance even with smaller TRF sizes, whereas more complex segmentation tasks necessitated larger TRFs. Attention U-Net models consistently outperformed their U-Net counterparts, highlighting the value of attention mechanisms regardless of TRF size. These novel insights present an invaluable resource for developing more efficient U-Net-based architectures for medical imaging and pave the way for future exploration. A tool is also developed that calculates the TRF for a U-Net (and Attention U-Net) model, and also suggest an appropriate TRF size for a given model and dataset.
☆ Geometry-Aware Score Distillation via 3D Consistent Noising and Gradient Consistency Modeling
Score distillation sampling (SDS), the methodology in which the score from pretrained 2D diffusion models is distilled into 3D representation, has recently brought significant advancements in text-to-3D generation task. However, this approach is still confronted with critical geometric inconsistency problems such as the Janus problem. Starting from a hypothesis that such inconsistency problems may be induced by multiview inconsistencies between 2D scores predicted from various viewpoints, we introduce GSD, a simple and general plug-and-play framework for incorporating 3D consistency and therefore geometry awareness into the SDS process. Our methodology is composed of three components: 3D consistent noising, designed to produce 3D consistent noise maps that perfectly follow the standard Gaussian distribution, geometry-based gradient warping for identifying correspondences between predicted gradients of different viewpoints, and novel gradient consistency loss to optimize the scene geometry toward producing more consistent gradients. We demonstrate that our method significantly improves performance, successfully addressing the geometric inconsistency problems in text-to-3D generation task with minimal computation cost and being compatible with existing score distillation-based models. Our project page is available at https://ku-cvlab.github.io/GSD/.
☆ Repulsive Score Distillation for Diverse Sampling of Diffusion Models
Score distillation sampling has been pivotal for integrating diffusion models into generation of complex visuals. Despite impressive results it suffers from mode collapse and lack of diversity. To cope with this challenge, we leverage the gradient flow interpretation of score distillation to propose Repulsive Score Distillation (RSD). In particular, we propose a variational framework based on repulsion of an ensemble of particles that promotes diversity. Using a variational approximation that incorporates a coupling among particles, the repulsion appears as a simple regularization that allows interaction of particles based on their relative pairwise similarity, measured e.g., via radial basis kernels. We design RSD for both unconstrained and constrained sampling scenarios. For constrained sampling we focus on inverse problems in the latent space that leads to an augmented variational formulation, that strikes a good balance between compute, quality and diversity. Our extensive experiments for text-to-image generation, and inverse problems demonstrate that RSD achieves a superior trade-off between diversity and quality compared with state-of-the-art alternatives.
☆ Sampling Strategies in Bayesian Inversion: A Study of RTO and Langevin Methods
This paper studies two classes of sampling methods for the solution of inverse problems, namely Randomize-Then-Optimize (RTO), which is rooted in sensitivity analysis, and Langevin methods, which are rooted in the Bayesian framework. The two classes of methods correspond to different assumptions and yield samples from different target distributions. We highlight the main conceptual and theoretical differences between the two approaches and compare them from a practical point of view by tackling two classical inverse problems in imaging: deblurring and inpainting. We show that the choice of the sampling method has a significant impact on the quality of the reconstruction and that the RTO method is more robust to the choice of the parameters.
☆ Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment
Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal prompt learning method only tunes the language branch of CLIP models. This is not enough for adapting CLIP models to AI generated image quality assessment (AGIQA) since AGIs visually differ from natural images. In addition, the consistency between AGIs and user input text prompts, which correlates with the perceptual quality of AGIs, is not investigated to guide AGIQA. In this letter, we propose vision-language consistency guided multi-modal prompt learning for blind AGIQA, dubbed CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in language and vision branches of CLIP models, respectively. Moreover, we design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts. Experimental results on two public AGIQA datasets demonstrate that the proposed method outperforms state-of-the-art quality assessment models. The source code is available at https://github.com/JunFu1995/CLIP-AGIQA.
comment: Accepted by IEEE Signal Processing Letter
☆ Feature Fusion for Human Activity Recognition using Parameter-Optimized Multi-Stage Graph Convolutional Network and Transformer Models
Human activity recognition (HAR) is a crucial area of research that involves understanding human movements using computer and machine vision technology. Deep learning has emerged as a powerful tool for this task, with models such as Convolutional Neural Networks (CNNs) and Transformers being employed to capture various aspects of human motion. One of the key contributions of this work is the demonstration of the effectiveness of feature fusion in improving HAR accuracy by capturing spatial and temporal features, which has important implications for the development of more accurate and robust activity recognition systems. The study uses sensory data from HuGaDB, PKU-MMD, LARa, and TUG datasets. Two model, the PO-MS-GCN and a Transformer were trained and evaluated, with PO-MS-GCN outperforming state-of-the-art models. HuGaDB and TUG achieved high accuracies and f1-scores, while LARa and PKU-MMD had lower scores. Feature fusion improved results across datasets.
comment: 7 pages, 1 figure, conference
☆ MLAAN: Scaling Supervised Local Learning with Multilaminar Leap Augmented Auxiliary Network
End-to-end (E2E) training approaches are commonly plagued by high memory consumption, reduced efficiency in training, challenges in model parallelization, and suboptimal biocompatibility. Local learning is considered a novel interactive training method that holds promise as an alternative to E2E. Nonetheless, conventional local learning methods fall short in achieving high model accuracy due to inadequate local inter-module interactions. In this paper, we introduce a new model known as the Scaling Supervised Local Learning with Multilaminar Leap Augmented Auxiliary Network (MLAAN). MLAAN features an innovative supervised local learning approach coupled with a robust reinforcement module. This dual-component design enables the MLAAN to integrate smoothly with established local learning techniques, thereby enhancing the efficacy of the foundational methods. The method simultaneously acquires the local and global features of the model separately by constructing an independent auxiliary network and a cascade auxiliary network on the one hand and incorporates a leap augmented module, which serves to counteract the reduced learning capacity often associated with weaker supervision. This architecture not only augments the exchange of information amongst the local modules but also effectively mitigates the model's tendency toward myopia. The experimental evaluations conducted on four benchmark datasets, CIFAR-10, STL-10, SVHN, and ImageNet, demonstrate that the integration of MLAAN with existing supervised local learning methods significantly enhances the original methodologies. Of particular note, MLAAN enables local learning methods to comprehensively outperform end-to-end training approaches in terms of optimal performance while saving GPU memory.
☆ Articulate your NeRF: Unsupervised articulated object modeling via conditional view synthesis
We propose a novel unsupervised method to learn the pose and part-segmentation of articulated objects with rigid parts. Given two observations of an object in different articulation states, our method learns the geometry and appearance of object parts by using an implicit model from the first observation, distils the part segmentation and articulation from the second observation while rendering the latter observation. Additionally, to tackle the complexities in the joint optimization of part segmentation and articulation, we propose a voxel grid-based initialization strategy and a decoupled optimization procedure. Compared to the prior unsupervised work, our model obtains significantly better performance, and generalizes to objects with multiple parts while it can be efficiently from few views for the latter observation.
comment: 9 pages for the maincontent, excluding references and supplementaries
☆ OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer
Recent advancements in Large Language Models (LLMs) have expanded their capabilities to multimodal contexts, including comprehensive video understanding. However, processing extensive videos such as 24-hour CCTV footage or full-length films presents significant challenges due to the vast data and processing demands. Traditional methods, like extracting key frames or converting frames to text, often result in substantial information loss. To address these shortcomings, we develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries, preserving the detailed content of videos. Additionally, it features an Divide-and-Conquer Loop capable of autonomous reasoning, dynamically invoking APIs and tools to enhance query processing and accuracy. This approach ensures robust video understanding, significantly reducing information loss. Experimental results affirm OmAgent's efficacy in handling various types of videos and complex tasks. Moreover, we have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
☆ The Championship-Winning Solution for the 5th CLVISION Challenge 2024
In this paper, we introduce our approach to the 5th CLVision Challenge, which presents distinctive challenges beyond traditional class incremental learning. Unlike standard settings, this competition features the recurrence of previously encountered classes and includes unlabeled data that may contain Out-of-Distribution (OOD) categories. Our approach is based on Winning Subnetworks to allocate independent parameter spaces for each task addressing the catastrophic forgetting problem in class incremental learning and employ three training strategies: supervised classification learning, unsupervised contrastive learning, and pseudo-label classification learning to fully utilize the information in both labeled and unlabeled data, enhancing the classification performance of each subnetwork. Furthermore, during the inference stage, we have devised an interaction strategy between subnetworks, where the prediction for a specific class of a particular sample is the average logits across different subnetworks corresponding to that class, leveraging the knowledge learned from different subnetworks on recurring classes to improve classification accuracy. These strategies can be simultaneously applied to the three scenarios of the competition, effectively solving the difficulties in the competition scenarios. Experimentally, our method ranks first in both the pre-selection and final evaluation stages, with an average accuracy of 0.4535 during the preselection stage and an average accuracy of 0.4805 during the final evaluation stage.
☆ When Invariant Representation Learning Meets Label Shift: Insufficiency and Theoretical Insights
As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner is proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
☆ Do As I Do: Pose Guided Human Motion Copy
Human motion copy is an intriguing yet challenging task in artificial intelligence and computer vision, which strives to generate a fake video of a target person performing the motion of a source person. The problem is inherently challenging due to the subtle human-body texture details to be generated and the temporal consistency to be considered. Existing approaches typically adopt a conventional GAN with an L1 or L2 loss to produce the target fake video, which intrinsically necessitates a large number of training samples that are challenging to acquire. Meanwhile, current methods still have difficulties in attaining realistic image details and temporal consistency, which unfortunately can be easily perceived by human observers. Motivated by this, we try to tackle the issues from three aspects: (1) We constrain pose-to-appearance generation with a perceptual loss and a theoretically motivated Gromov-Wasserstein loss to bridge the gap between pose and appearance. (2) We present an episodic memory module in the pose-to-appearance generation to propel continuous learning that helps the model learn from its past poor generations. We also utilize geometrical cues of the face to optimize facial details and refine each key body part with a dedicated local GAN. (3) We advocate generating the foreground in a sequence-to-sequence manner rather than a single-frame manner, explicitly enforcing temporal inconsistency. Empirical results on five datasets, iPER, ComplexMotion, SoloDance, Fish, and Mouse datasets, demonstrate that our method is capable of generating realistic target videos while precisely copying motion from a source video. Our method significantly outperforms state-of-the-art approaches and gains 7.2% and 12.4% improvements in PSNR and FID respectively.
☆ Measuring the Recyclability of Electronic Components to Assist Automatic Disassembly and Sorting Waste Printed Circuit Boards
The waste of electrical and electronic equipment has been increased due to the fast evolution of technology products and competition of many IT sectors. Every year millions of tons of electronic waste are thrown into the environment which causes high consequences for human health. Therefore, it is crucial to control this waste flow using technology, especially using Artificial Intelligence but also reclamation of critical raw materials for new production processes. In this paper, we focused on the measurement of recyclability of waste electronic components (WECs) from waste printed circuit boards (WPCBs) using mathematical innovation model. This innovative approach evaluates both the recyclability and recycling difficulties of WECs, integrating an AI model for improved disassembly and sorting. Assessing the recyclability of individual electronic components present on WPCBs provides insight into the recovery potential of valuable materials and indicates the level of complexity involved in recycling in terms of economic worth and production utility. This novel measurement approach helps AI models in accurately determining the number of classes to be identified and sorted during the automated disassembly of discarded PCBs. It also facilitates the model in iterative training and validation of individual electronic components.
comment: 15 pages, 6 figures
☆ Toward Fairer Face Recognition Datasets
Face recognition and verification are two computer vision tasks whose performance has progressed with the introduction of deep representations. However, ethical, legal, and technical challenges due to the sensitive character of face data and biases in real training datasets hinder their development. Generative AI addresses privacy by creating fictitious identities, but fairness problems persist. We promote fairness by introducing a demographic attributes balancing mechanism in generated training datasets. We experiment with an existing real dataset, three generated training datasets, and the balanced versions of a diffusion-based dataset. We propose a comprehensive evaluation that considers accuracy and fairness equally and includes a rigorous regression-based statistical analysis of attributes. The analysis shows that balancing reduces demographic unfairness. Also, a performance gap persists despite generation becoming more accurate with time. The proposed balancing method and comprehensive verification evaluation promote fairer and transparent face recognition and verification.
☆ Personalized federated learning based on feature fusion
Federated learning enables distributed clients to collaborate on training while storing their data locally to protect client privacy. However, due to the heterogeneity of data, models, and devices, the final global model may need to perform better for tasks on each client. Communication bottlenecks, data heterogeneity, and model heterogeneity have been common challenges in federated learning. In this work, we considered a label distribution skew problem, a type of data heterogeneity easily overlooked. In the context of classification, we propose a personalized federated learning approach called pFedPM. In our process, we replace traditional gradient uploading with feature uploading, which helps reduce communication costs and allows for heterogeneous client models. These feature representations play a role in preserving privacy to some extent. We use a hyperparameter $a$ to mix local and global features, which enables us to control the degree of personalization. We also introduced a relation network as an additional decision layer, which provides a non-linear learnable classifier to predict labels. Experimental results show that, with an appropriate setting of $a$, our scheme outperforms several recent FL methods on MNIST, FEMNIST, and CRIFAR10 datasets and achieves fewer communications.
☆ FASTC: A Fast Attentional Framework for Semantic Traversability Classification Using Point Cloud ECAI2023
Producing traversability maps and understanding the surroundings are crucial prerequisites for autonomous navigation. In this paper, we address the problem of traversability assessment using point clouds. We propose a novel pillar feature extraction module that utilizes PointNet to capture features from point clouds organized in vertical volume and a 2D encoder-decoder structure to conduct traversability classification instead of the widely used 3D convolutions. This results in less computational cost while even better performance is achieved at the same time. We then propose a new spatio-temporal attention module to fuse multi-frame information, which can properly handle the varying density problem of LIDAR point clouds, and this makes our module able to assess distant areas more accurately. Comprehensive experimental results on augmented Semantic KITTI and RELLIS-3D datasets show that our method is able to achieve superior performance over existing approaches both quantitatively and quantitatively.
comment: Accepted to ECAI2023 Our code is publicly available at [this](https://github.com/chenyirui/FASTC)
☆ EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
comment: Github Repository: https://github.com/SAIS-FUXI/EvalAlign
☆ Hierarchical B-frame Video Coding for Long Group of Pictures
Learned video compression methods already outperform VVC in the low-delay (LD) case, but the random-access (RA) scenario remains challenging. Most works on learned RA video compression either use HEVC as an anchor or compare it to VVC in specific test conditions, using RGB-PSNR metric instead of Y-PSNR and avoiding comprehensive evaluation. Here, we present an end-to-end learned video codec for random access that combines training on long sequences of frames, rate allocation designed for hierarchical coding and content adaptation on inference. We show that under common test conditions (JVET-CTC), it achieves results comparable to VTM (VVC reference software) in terms of YUV-PSNR BD-Rate on some classes of videos, and outperforms it on almost all test sets in terms of VMAF BD-Rate. On average it surpasses open LD and RA end-to-end solutions in terms of VMAF and YUV BD-Rates.
☆ Improving robustness to corruptions with multiplicative weight perturbations
Deep neural networks (DNNs) excel on clean images but struggle with corrupted ones. Incorporating specific corruptions into the data augmentation pipeline can improve robustness to those corruptions but may harm performance on clean images and other types of distortion. In this paper, we introduce an alternative approach that improves the robustness of DNNs to a wide range of corruptions without compromising accuracy on clean images. We first demonstrate that input perturbations can be mimicked by multiplicative perturbations in the weight space. Leveraging this, we propose Data Augmentation via Multiplicative Perturbation (DAMP), a training method that optimizes DNNs under random multiplicative weight perturbations. We also examine the recently proposed Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs under adversarial multiplicative weight perturbations. Experiments on image classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural network architectures (ResNet50, ViT-S/16) show that DAMP enhances model generalization performance in the presence of corruptions across different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from scratch, reaching the top-1 error of 23.7% which is comparable to ResNet50 without extensive data augmentations.
comment: Under review
☆ Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
☆ GIM: A Million-scale Benchmark for Generative Image Manipulation Detection and Localization
The extraordinary ability of generative models emerges as a new trend in image editing and generating realistic images, posing a serious threat to the trustworthiness of multimedia data and driving the research of image manipulation detection and location(IMDL). However, the lack of a large-scale data foundation makes IMDL task unattainable. In this paper, a local manipulation pipeline is designed, incorporating the powerful SAM, ChatGPT and generative models. Upon this basis, We propose the GIM dataset, which has the following advantages: 1) Large scale, including over one million pairs of AI-manipulated images and real images. 2) Rich Image Content, encompassing a broad range of image classes 3) Diverse Generative Manipulation, manipulated images with state-of-the-art generators and various manipulation tasks. The aforementioned advantages allow for a more comprehensive evaluation of IMDL methods, extending their applicability to diverse images. We introduce two benchmark settings to evaluate the generalization capability and comprehensive performance of baseline methods. In addition, we propose a novel IMDL framework, termed GIMFormer, which consists of a ShadowTracer, Frequency-Spatial Block (FSB), and a Multi-window Anomalous Modelling (MWAM) Module. Extensive experiments on the GIM demonstrate that GIMFormer surpasses previous state-of-the-art works significantly on two different benchmarks.
comment: Code page: https://github.com/chenyirui/GIM
☆ Vision Mamba-based autonomous crack segmentation on concrete, asphalt, and masonry surfaces
Convolutional neural networks (CNNs) and Transformers have shown advanced accuracy in crack detection under certain conditions. Yet, the fixed local attention can compromise the generalisation of CNNs, and the quadratic complexity of the global self-attention restricts the practical deployment of Transformers. Given the emergence of the new-generation architecture of Mamba, this paper proposes a Vision Mamba (VMamba)-based framework for crack segmentation on concrete, asphalt, and masonry surfaces, with high accuracy, generalisation, and less computational complexity. Having 15.6% - 74.5% fewer parameters, the encoder-decoder network integrated with VMamba could obtain up to 2.8% higher mDS than representative CNN-based models while showing about the same performance as Transformer-based models. Moreover, the VMamba-based encoder-decoder network could process high-resolution image input with up to 90.6% lower floating-point operations.
comment: 23 pages, 9 figures
☆ Multi-Modal Vision Transformers for Crop Mapping from Satellite Image Time Series
Using images acquired by different satellite sensors has shown to improve classification performance in the framework of crop mapping from satellite image time series (SITS). Existing state-of-the-art architectures use self-attention mechanisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi-temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Attention Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experimental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components.
comment: 5 pages, 2 figures, 1 table. Accepted at IEEE International Geoscience and Remote Sensing Symposium (IGARSS) 2024. Our code is available at https://git.tu-berlin.de/rsim/mmtsvit
☆ LOGCAN++: Local-global class-aware network for semantic segmentation of remote sensing images
Remote sensing images usually characterized by complex backgrounds, scale and orientation variations, and large intra-class variance. General semantic segmentation methods usually fail to fully investigate the above issues, and thus their performances on remote sensing image segmentation are limited. In this paper, we propose our LOGCAN++, a semantic segmentation model customized for remote sensing images, which is made up of a Global Class Awareness (GCA) module and several Local Class Awareness (LCA) modules. The GCA module captures global representations for class-level context modeling to reduce the interference of background noise. The LCA module generates local class representations as intermediate perceptual elements to indirectly associate pixels with the global class representations, targeting at dealing with the large intra-class variance problem. In particular, we introduce affine transformations in the LCA module for adaptive extraction of local class representations to effectively tolerate scale and orientation variations in remotely sensed images. Extensive experiments on three benchmark datasets show that our LOGCAN++ outperforms current mainstream general and remote sensing semantic segmentation methods and achieves a better trade-off between speed and accuracy. Code is available at https://github.com/xwmaxwma/rssegmentation.
comment: Under Review
☆ UNICAD: A Unified Approach for Attack Detection, Noise Reduction and Novel Class Identification
As the use of Deep Neural Networks (DNNs) becomes pervasive, their vulnerability to adversarial attacks and limitations in handling unseen classes poses significant challenges. The state-of-the-art offers discrete solutions aimed to tackle individual issues covering specific adversarial attack scenarios, classification or evolving learning. However, real-world systems need to be able to detect and recover from a wide range of adversarial attacks without sacrificing classification accuracy and to flexibly act in {\bf unseen} scenarios. In this paper, UNICAD, is proposed as a novel framework that integrates a variety of techniques to provide an adaptive solution. For the targeted image classification, UNICAD achieves accurate image classification, detects unseen classes, and recovers from adversarial attacks using Prototype and Similarity-based DNNs with denoising autoencoders. Our experiments performed on the CIFAR-10 dataset highlight UNICAD's effectiveness in adversarial mitigation and unseen class classification, outperforming traditional models.
☆ Improving Quaternion Neural Networks with Quaternionic Activation Functions
In this paper, we propose novel quaternion activation functions where we modify either the quaternion magnitude or the phase, as an alternative to the commonly used split activation functions. We define criteria that are relevant for quaternion activation functions, and subsequently we propose our novel activation functions based on this analysis. Instead of applying a known activation function like the ReLU or Tanh on the quaternion elements separately, these activation functions consider the quaternion properties and respect the quaternion space $\mathbb{H}$. In particular, all quaternion components are utilized to calculate all output components, carrying out the benefit of the Hamilton product in e.g. the quaternion convolution to the activation functions. The proposed activation functions can be incorporated in arbitrary quaternion valued neural networks trained with gradient descent techniques. We further discuss the derivatives of the proposed activation functions where we observe beneficial properties for the activation functions affecting the phase. Specifically, they prove to be sensitive on basically the whole input range, thus improved gradient flow can be expected. We provide an elaborate experimental evaluation of our proposed quaternion activation functions including comparison with the split ReLU and split Tanh on two image classification tasks using the CIFAR-10 and SVHN dataset. There, especially the quaternion activation functions affecting the phase consistently prove to provide better performance.
☆ DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-Resolution
Image super-resolution pursuits reconstructing high-fidelity high-resolution counterpart for low-resolution image. In recent years, diffusion-based models have garnered significant attention due to their capabilities with rich prior knowledge. The success of diffusion models based on general text prompts has validated the effectiveness of textual control in the field of text2image. However, given the severe degradation commonly presented in low-resolution images, coupled with the randomness characteristics of diffusion models, current models struggle to adequately discern semantic and degradation information within severely degraded images. This often leads to obstacles such as semantic loss, visual artifacts, and visual hallucinations, which pose substantial challenges for practical use. To address these challenges, this paper proposes to leverage degradation-aligned language prompt for accurate, fine-grained, and high-fidelity image restoration. Complementary priors including semantic content descriptions and degradation prompts are explored. Specifically, on one hand, image-restoration prompt alignment decoder is proposed to automatically discern the degradation degree of LR images, thereby generating beneficial degradation priors for image restoration. On the other hand, much richly tailored descriptions from pretrained multimodal large language model elicit high-level semantic priors closely aligned with human perception, ensuring fidelity control for image restoration. Comprehensive comparisons with state-of-the-art methods have been done on several popular synthetic and real-world benchmark datasets. The quantitative and qualitative analysis have demonstrated that the proposed method achieves a new state-of-the-art perceptual quality level, especially in real-world cases based on reference-free metrics.
☆ ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance
Diffusion models excel at producing high-quality images; however, scaling to higher resolutions, such as 4K, often results in over-smoothed content, structural distortions, and repetitive patterns. To this end, we introduce ResMaster, a novel, training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions. Specifically, ResMaster leverages a low-resolution reference image created by a pre-trained diffusion model to provide structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis. To ensure a coherent global structure, ResMaster meticulously aligns the low-frequency components of high-resolution patches with the low-resolution reference at each denoising step. For fine-grained guidance, tailored image prompts based on the low-resolution reference and enriched textual prompts produced by a vision-language model are incorporated. This approach could significantly mitigate local pattern distortions and improve detail refinement. Extensive experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency. The project page is https://shuweis.github.io/ResMaster .
☆ Seeking Certainty In Uncertainty: Dual-Stage Unified Framework Solving Uncertainty in Dynamic Facial Expression Recognition
The contemporary state-of-the-art of Dynamic Facial Expression Recognition (DFER) technology facilitates remarkable progress by deriving emotional mappings of facial expressions from video content, underpinned by training on voluminous datasets. Yet, the DFER datasets encompass a substantial volume of noise data. Noise arises from low-quality captures that defy logical labeling, and instances that suffer from mislabeling due to annotation bias, engendering two principal types of uncertainty: the uncertainty regarding data usability and the uncertainty concerning label reliability. Addressing the two types of uncertainty, we have meticulously crafted a two-stage framework aiming at \textbf{S}eeking \textbf{C}ertain data \textbf{I}n extensive \textbf{U}ncertain data (SCIU). This initiative aims to purge the DFER datasets of these uncertainties, thereby ensuring that only clean, verified data is employed in training processes. To mitigate the issue of low-quality samples, we introduce the Coarse-Grained Pruning (CGP) stage, which assesses sample weights and prunes those deemed unusable due to their low weight. For samples with incorrect annotations, the Fine-Grained Correction (FGC) stage evaluates prediction stability to rectify mislabeled data. Moreover, SCIU is conceived as a universally compatible, plug-and-play framework, tailored to integrate seamlessly with prevailing DFER methodologies. Rigorous experiments across prevalent DFER datasets and against numerous benchmark methods substantiates SCIU's capacity to markedly elevate performance metrics.
☆ Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
To create culturally inclusive vision-language models (VLMs), the foremost requirement is developing a test benchmark that can diagnose the models' ability to respond to questions reflecting cultural elements. This paper addresses the necessity for such benchmarks, noting that existing research has relied on human annotators' manual efforts, which impedes diversity and efficiency. We propose a semi-automated pipeline for constructing cultural VLM benchmarks to enhance diversity and efficiency. This pipeline leverages human-VLM collaboration, where VLMs generate questions based on guidelines, human-annotated examples, and image-wise relevant knowledge, which are then reviewed by native speakers for quality and cultural relevance. The effectiveness of our adaptable pipeline is demonstrated through a specific application: creating a dataset tailored to Korean culture, dubbed K-Viscuit. The resulting benchmark features two types of questions: Type 1 questions measure visual recognition abilities, while Type 2 assess fine-grained visual reasoning skills. This ensures a thorough diagnosis of VLM models across various aspects. Our evaluation using K-Viscuit revealed that open-source models notably lag behind proprietary models in understanding Korean culture, highlighting areas for improvement. We provided diverse analyses of VLM performance across different cultural aspects. Besides, we explored the potential of incorporating external knowledge retrieval to enhance the generation process, suggesting future directions for improving cultural interpretation ability of VLMs. Our dataset and code will be made publicly available.
☆ SLOctolyzer: Fully automatic analysis toolkit for segmentation and feature extracting in scanning laser ophthalmoscopy images
Purpose: To describe SLOctolyzer: an open-source analysis toolkit for en face retinal vessels appearing in infrared reflectance scanning laser ophthalmoscopy (SLO) images. Methods: SLOctolyzer includes two main modules: segmentation and measurement. The segmentation module use deep learning methods to delineate retinal anatomy, while the measurement module quantifies key retinal vascular features such as vessel complexity, density, tortuosity, and calibre. We evaluate the segmentation module using unseen data and measure its reproducibility. Results: SLOctolyzer's segmentation module performed well against unseen internal test data (Dice for all-vessels, 0.9097; arteries, 0.8376; veins, 0.8525; optic disc, 0.9430; fovea, 0.8837). External validation against severe retinal pathology showed decreased performance (Dice for arteries, 0.7180; veins, 0.7470; optic disc, 0.9032). SLOctolyzer had good reproducibility (mean difference for fractal dimension, -0.0007; vessel density, -0.0003; vessel calibre, -0.3154 $\mu$m; tortuosity density, 0.0013). SLOctolyzer can process a macula-centred SLO image in under 20 seconds and a disc-centred SLO image in under 30 seconds using a standard laptop CPU. Conclusions: To our knowledge, SLOctolyzer is the first open-source tool to convert raw SLO images into reproducible and clinically meaningful retinal vascular parameters. SLO images are captured simultaneous to optical coherence tomography (OCT), and we believe our software will be useful for extracting retinal vascular measurements from large OCT image sets and linking them to ocular or systemic diseases. It requires no specialist knowledge or proprietary software, and allows manual correction of segmentations and re-computing of vascular metrics. SLOctolyzer is freely available at https://github.com/jaburke166/SLOctolyzer.
comment: 10 pages, 5 figures, 6 tables + Supplementary (7 pages, 10 figures, 4 tables). Submitted for peer review at Translational Vision Science and Technology
☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at [https://github.com/CoderChen01/InterCLIP-MEP](https://github.com/CoderChen01/InterCLIP-MEP).
comment: 8 pages, 6 figures, 6 tables
☆ Suppressing Uncertainties in Degradation Estimation for Blind Super-Resolution
The problem of blind image super-resolution aims to recover high-resolution (HR) images from low-resolution (LR) images with unknown degradation modes. Most existing methods model the image degradation process using blur kernels. However, this explicit modeling approach struggles to cover the complex and varied degradation processes encountered in the real world, such as high-order combinations of JPEG compression, blur, and noise. Implicit modeling for the degradation process can effectively overcome this issue, but a key challenge of implicit modeling is the lack of accurate ground truth labels for the degradation process to conduct supervised training. To overcome this limitations inherent in implicit modeling, we propose an \textbf{U}ncertainty-based degradation representation for blind \textbf{S}uper-\textbf{R}esolution framework (\textbf{USR}). By suppressing the uncertainty of local degradation representations in images, USR facilitated self-supervised learning of degradation representations. The USR consists of two components: Adaptive Uncertainty-Aware Degradation Extraction (AUDE) and a feature extraction network composed of Variable Depth Dynamic Convolution (VDDC) blocks. To extract Uncertainty-based Degradation Representation from LR images, the AUDE utilizes the Self-supervised Uncertainty Contrast module with Uncertainty Suppression Loss to suppress the inherent model uncertainty of the Degradation Extractor. Furthermore, VDDC block integrates degradation information through dynamic convolution. Rhe VDDC also employs an Adaptive Intensity Scaling operation that adaptively adjusts the degradation representation according to the network hierarchy, thereby facilitating the effective integration of degradation information. Quantitative and qualitative experiments affirm the superiority of our approach.
☆ Evaluating and Analyzing Relationship Hallucinations in LVLMs ICML2024
The issue of hallucinations is a prevalent concern in existing Large Vision-Language Models (LVLMs). Previous efforts have primarily focused on investigating object hallucinations, which can be easily alleviated by introducing object detectors. However, these efforts neglect hallucinations in inter-object relationships, which is essential for visual comprehension. In this work, we introduce R-Bench, a novel benchmark for evaluating Vision Relationship Hallucination. R-Bench features image-level questions that focus on the existence of relationships and instance-level questions that assess local visual comprehension. We identify three types of relationship co-occurrences that lead to hallucinations: relationship-relationship, subject-relationship, and relationship-object. The visual instruction tuning dataset's long-tail distribution significantly impacts LVLMs' understanding of visual relationships. Furthermore, our analysis reveals that current LVLMs tend to disregard visual content and overly rely on the common sense knowledge of Large Language Models. They also struggle with reasoning about spatial relationships based on contextual information.
comment: ICML2024
☆ EmoLLM: Multimodal Emotional Understanding Meets Large Language Models
Multi-modal large language models (MLLMs) have achieved remarkable performance on objective multimodal perception tasks, but their ability to interpret subjective, emotionally nuanced multimodal content remains largely unexplored. Thus, it impedes their ability to effectively understand and react to the intricate emotions expressed by humans through multimodal media. To bridge this gap, we introduce EmoBench, the first comprehensive benchmark designed specifically to evaluate the emotional capabilities of MLLMs across five popular emotional tasks, using a diverse dataset of 287k images and videos paired with corresponding textual instructions. Meanwhile, we propose EmoLLM, a novel model for multimodal emotional understanding, incorporating with two core techniques. 1) Multi-perspective Visual Projection, it captures diverse emotional cues from visual data from multiple perspectives. 2) EmoPrompt, it guides MLLMs to reason about emotions in the correct direction. Experimental results demonstrate that EmoLLM significantly elevates multimodal emotional understanding performance, with an average improvement of 12.1% across multiple foundation models on EmoBench. Our work contributes to the advancement of MLLMs by facilitating a deeper and more nuanced comprehension of intricate human emotions, paving the way for the development of artificial emotional intelligence capabilities with wide-ranging applications in areas such as human-computer interaction, mental health support, and empathetic AI systems. Code, data, and model will be released.
comment: 9 pages
☆ Exploring Test-Time Adaptation for Object Detection in Continually Changing Environments
For real-world applications, neural network models are commonly deployed in dynamic environments, where the distribution of the target domain undergoes temporal changes. Continual Test-Time Adaptation (CTTA) has recently emerged as a promising technique to gradually adapt a source-trained model to test data drawn from a continually changing target domain. Despite recent advancements in addressing CTTA, two critical issues remain: 1) The use of a fixed threshold for pseudo-labeling in existing methodologies leads to the generation of low-quality pseudo-labels, as model confidence varies across categories and domains; 2) While current solutions utilize stochastic parameter restoration to mitigate catastrophic forgetting, their capacity to preserve critical information is undermined by its intrinsic randomness. To tackle these challenges, we present CTAOD, aiming to enhance the performance of detection models in CTTA scenarios. Inspired by prior CTTA works for effective adaptation, CTAOD is founded on the mean-teacher framework, characterized by three core components. Firstly, the object-level contrastive learning module tailored for object detection extracts object-level features using the teacher's region of interest features and optimizes them through contrastive learning. Secondly, the dynamic threshold strategy updates the category-specific threshold based on predicted confidence scores to improve the quality of pseudo-labels. Lastly, we design a data-driven stochastic restoration mechanism to selectively reset inactive parameters using the gradients as weights for a random mask matrix, thereby ensuring the retention of essential knowledge. We demonstrate the effectiveness of our approach on four CTTA tasks for object detection, where CTAOD outperforms existing methods, especially achieving a 3.0 mAP improvement on the Cityscapes-to-Cityscapes-C CTTA task.
☆ Multi-threshold Deep Metric Learning for Facial Expression Recognition
Effective expression feature representations generated by a triplet-based deep metric learning are highly advantageous for facial expression recognition (FER). The performance of triplet-based deep metric learning is contingent upon identifying the best threshold for triplet loss. Threshold validation, however, is tough and challenging, as the ideal threshold changes among datasets and even across classes within the same dataset. In this paper, we present the multi-threshold deep metric learning technique, which not only avoids the difficult threshold validation but also vastly increases the capacity of triplet loss learning to construct expression feature representations. We find that each threshold of the triplet loss intrinsically determines a distinctive distribution of inter-class variations and corresponds, thus, to a unique expression feature representation. Therefore, rather than selecting a single optimal threshold from a valid threshold range, we thoroughly sample thresholds across the range, allowing the representation characteristics manifested by thresholds within the range to be fully extracted and leveraged for FER. To realize this approach, we partition the embedding layer of the deep metric learning network into a collection of slices and model training these embedding slices as an end-to-end multi-threshold deep metric learning problem. Each embedding slice corresponds to a sample threshold and is learned by enforcing the corresponding triplet loss, yielding a set of distinct expression features, one for each embedding slice. It makes the embedding layer, which is composed of a set of slices, a more informative and discriminative feature, hence enhancing the FER accuracy. Extensive evaluations demonstrate the superior performance of the proposed approach on both posed and spontaneous facial expression datasets.
comment: accepted by Pattern Recognition
☆ Dynamic Pseudo Label Optimization in Point-Supervised Nuclei Segmentation MICCAI2024
Deep learning has achieved impressive results in nuclei segmentation, but the massive requirement for pixel-wise labels remains a significant challenge. To alleviate the annotation burden, existing methods generate pseudo masks for model training using point labels. However, the generated masks are inevitably different from the ground truth, and these dissimilarities are not handled reasonably during the network training, resulting in the subpar performance of the segmentation model. To tackle this issue, we propose a framework named DoNuSeg, enabling \textbf{D}ynamic pseudo label \textbf{O}ptimization in point-supervised \textbf{Nu}clei \textbf{Seg}mentation. Specifically, DoNuSeg takes advantage of class activation maps (CAMs) to adaptively capture regions with semantics similar to annotated points. To leverage semantic diversity in the hierarchical feature levels, we design a dynamic selection module to choose the optimal one among CAMs from different encoder blocks as pseudo masks. Meanwhile, a CAM-guided contrastive module is proposed to further enhance the accuracy of pseudo masks. In addition to exploiting the semantic information provided by CAMs, we consider location priors inherent to point labels, developing a task-decoupled structure for effectively differentiating nuclei. Extensive experiments demonstrate that DoNuSeg outperforms state-of-the-art point-supervised methods. The code is available at https://github.com/shinning0821/MICCAI24-DoNuSeg.
comment: early accepted by MICCAI2024
☆ Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.
☆ High-resolution open-vocabulary object 6D pose estimation CVPR
The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.
comment: Technical report. Extension of CVPR paper "Open-vocabulary object 6D pose estimation". Project page: https://jcorsetti.github.io/oryon
☆ MIRReS: Multi-bounce Inverse Rendering using Reservoir Sampling
We present MIRReS, a novel two-stage inverse rendering framework that jointly reconstructs and optimizes the explicit geometry, material, and lighting from multi-view images. Unlike previous methods that rely on implicit irradiance fields or simplified path tracing algorithms, our method extracts an explicit geometry (triangular mesh) in stage one, and introduces a more realistic physically-based inverse rendering model that utilizes multi-bounce path tracing and Monte Carlo integration. By leveraging multi-bounce path tracing, our method effectively estimates indirect illumination, including self-shadowing and internal reflections, which improves the intrinsic decomposition of shape, material, and lighting. Moreover, we incorporate reservoir sampling into our framework to address the noise in Monte Carlo integration, enhancing convergence and facilitating gradient-based optimization with low sample counts. Through qualitative and quantitative evaluation of several scenarios, especially in challenging scenarios with complex shadows, we demonstrate that our method achieves state-of-the-art performance on decomposition results. Additionally, our optimized explicit geometry enables applications such as scene editing, relighting, and material editing with modern graphics engines or CAD software. The source code is available at https://brabbitdousha.github.io/MIRReS/
comment: 16 pages, 14 figures
☆ Improving Generative Adversarial Networks for Video Super-Resolution
In this research, we explore different ways to improve generative adversarial networks for video super-resolution tasks from a base single image super-resolution GAN model. Our primary objective is to identify potential techniques that enhance these models and to analyze which of these techniques yield the most significant improvements. We evaluate our results using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM). Our findings indicate that the most effective techniques include temporal smoothing, long short-term memory (LSTM) layers, and a temporal loss function. The integration of these methods results in an 11.97% improvement in PSNR and an 8% improvement in SSIM compared to the baseline video super-resolution generative adversarial network (GAN) model. This substantial improvement suggests potential further applications to enhance current state-of-the-art models.
☆ Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks
Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's.
☆ VideoHallucer: Evaluating Intrinsic and Extrinsic Hallucinations in Large Video-Language Models
Recent advancements in Multimodal Large Language Models (MLLMs) have extended their capabilities to video understanding. Yet, these models are often plagued by "hallucinations", where irrelevant or nonsensical content is generated, deviating from the actual video context. This work introduces VideoHallucer, the first comprehensive benchmark for hallucination detection in large video-language models (LVLMs). VideoHallucer categorizes hallucinations into two main types: intrinsic and extrinsic, offering further subcategories for detailed analysis, including object-relation, temporal, semantic detail, extrinsic factual, and extrinsic non-factual hallucinations. We adopt an adversarial binary VideoQA method for comprehensive evaluation, where pairs of basic and hallucinated questions are crafted strategically. By evaluating eleven LVLMs on VideoHallucer, we reveal that i) the majority of current models exhibit significant issues with hallucinations; ii) while scaling datasets and parameters improves models' ability to detect basic visual cues and counterfactuals, it provides limited benefit for detecting extrinsic factual hallucinations; iii) existing models are more adept at detecting facts than identifying hallucinations. As a byproduct, these analyses further instruct the development of our self-PEP framework, achieving an average of 5.38% improvement in hallucination resistance across all model architectures.
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models
The rapid advancement of Text-to-Image(T2I) generative models has enabled the synthesis of high-quality images guided by textual descriptions. Despite this significant progress, these models are often susceptible in generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this problem, we introduce a novel diffusion-based framework to significantly enhance the alignment of generated images with their corresponding descriptions, addressing the inconsistency between visual output and textual input. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. Leveraging a state-of-the-art large language module, we first extract objects and construct a knowledge graph to predict the locations of these objects in potentially generated images. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt, guided by the predicted object locations. Through extensive experiments on an advanced multimodal hallucination benchmark, we demonstrate the efficacy of our approach in accurately generating the images without the inconsistency with the original prompt. The code can be accessed via https://github.com/TruthAI-Lab/PCIG.
☆ Lesion-Aware Cross-Phase Attention Network for Renal Tumor Subtype Classification on Multi-Phase CT Scans
Multi-phase computed tomography (CT) has been widely used for the preoperative diagnosis of kidney cancer due to its non-invasive nature and ability to characterize renal lesions. However, since enhancement patterns of renal lesions across CT phases are different even for the same lesion type, the visual assessment by radiologists suffers from inter-observer variability in clinical practice. Although deep learning-based approaches have been recently explored for differential diagnosis of kidney cancer, they do not explicitly model the relationships between CT phases in the network design, limiting the diagnostic performance. In this paper, we propose a novel lesion-aware cross-phase attention network (LACPANet) that can effectively capture temporal dependencies of renal lesions across CT phases to accurately classify the lesions into five major pathological subtypes from time-series multi-phase CT images. We introduce a 3D inter-phase lesion-aware attention mechanism to learn effective 3D lesion features that are used to estimate attention weights describing the inter-phase relations of the enhancement patterns. We also present a multi-scale attention scheme to capture and aggregate temporal patterns of lesion features at different spatial scales for further improvement. Extensive experiments on multi-phase CT scans of kidney cancer patients from the collected dataset demonstrate that our LACPANet outperforms state-of-the-art approaches in diagnostic accuracy.
comment: This article has been accepted for publication in Computers in Biology and Medicine
☆ Artistic-style text detector and a new Movie-Poster dataset
Although current text detection algorithms demonstrate effectiveness in general scenarios, their performance declines when confronted with artistic-style text featuring complex structures. This paper proposes a method that utilizes Criss-Cross Attention and residual dense block to address the incomplete and misdiagnosis of artistic-style text detection by current algorithms. Specifically, our method mainly consists of a feature extraction backbone, a feature enhancement network, a multi-scale feature fusion module, and a boundary discrimination module. The feature enhancement network significantly enhances the model's perceptual capabilities in complex environments by fusing horizontal and vertical contextual information, allowing it to capture detailed features overlooked in artistic-style text. We incorporate residual dense block into the Feature Pyramid Network to suppress the effect of background noise during feature fusion. Aiming to omit the complex post-processing, we explore a boundary discrimination module that guides the correct generation of boundary proposals. Furthermore, given that movie poster titles often use stylized art fonts, we collected a Movie-Poster dataset to address the scarcity of artistic-style text data. Extensive experiments demonstrate that our proposed method performs superiorly on the Movie-Poster dataset and produces excellent results on multiple benchmark datasets. The code and the Movie-Poster dataset will be available at: https://github.com/biedaxiaohua/Artistic-style-text-detection
☆ UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos ICMR'24
With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.
comment: Accepted by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ Priorformer: A UGC-VQA Method with content and distortion priors
User Generated Content (UGC) videos are susceptible to complicated and variant degradations and contents, which prevents the existing blind video quality assessment (BVQA) models from good performance since the lack of the adapability of distortions and contents. To mitigate this, we propose a novel prior-augmented perceptual vision transformer (PriorFormer) for the BVQA of UGC, which boots its adaptability and representation capability for divergent contents and distortions. Concretely, we introduce two powerful priors, i.e., the content and distortion priors, by extracting the content and distortion embeddings from two pre-trained feature extractors. Then we adopt these two powerful embeddings as the adaptive prior tokens, which are transferred to the vision transformer backbone jointly with implicit quality features. Based on the above strategy, the proposed PriorFormer achieves state-of-the-art performance on three public UGC VQA datasets including KoNViD-1K, LIVE-VQC and YouTube-UGC.
comment: 7 pages
☆ Crowd-Sourced NeRF: Collecting Data from Production Vehicles for 3D Street View Reconstruction
Recently, Neural Radiance Fields (NeRF) achieved impressive results in novel view synthesis. Block-NeRF showed the capability of leveraging NeRF to build large city-scale models. For large-scale modeling, a mass of image data is necessary. Collecting images from specially designed data-collection vehicles can not support large-scale applications. How to acquire massive high-quality data remains an opening problem. Noting that the automotive industry has a huge amount of image data, crowd-sourcing is a convenient way for large-scale data collection. In this paper, we present a crowd-sourced framework, which utilizes substantial data captured by production vehicles to reconstruct the scene with the NeRF model. This approach solves the key problem of large-scale reconstruction, that is where the data comes from and how to use them. Firstly, the crowd-sourced massive data is filtered to remove redundancy and keep a balanced distribution in terms of time and space. Then a structure-from-motion module is performed to refine camera poses. Finally, images, as well as poses, are used to train the NeRF model in a certain block. We highlight that we present a comprehensive framework that integrates multiple modules, including data selection, sparse 3D reconstruction, sequence appearance embedding, depth supervision of ground surface, and occlusion completion. The complete system is capable of effectively processing and reconstructing high-quality 3D scenes from crowd-sourced data. Extensive quantitative and qualitative experiments were conducted to validate the performance of our system. Moreover, we proposed an application, named first-view navigation, which leveraged the NeRF model to generate 3D street view and guide the driver with a synthesized video.
☆ SegNet4D: Effective and Efficient 4D LiDAR Semantic Segmentation in Autonomous Driving Environments
4D LiDAR semantic segmentation, also referred to as multi-scan semantic segmentation, plays a crucial role in enhancing the environmental understanding capabilities of autonomous vehicles. It entails identifying the semantic category of each point in the LiDAR scan and distinguishing whether it is dynamic, a critical aspect in downstream tasks such as path planning and autonomous navigation. Existing methods for 4D semantic segmentation often rely on computationally intensive 4D convolutions for multi-scan input, resulting in poor real-time performance. In this article, we introduce SegNet4D, a novel real-time multi-scan semantic segmentation method leveraging a projection-based approach for fast motion feature encoding, showcasing outstanding performance. SegNet4D treats 4D semantic segmentation as two distinct tasks: single-scan semantic segmentation and moving object segmentation, each addressed by dedicated head. These results are then fused in the proposed motion-semantic fusion module to achieve comprehensive multi-scan semantic segmentation. Besides, we propose extracting instance information from the current scan and incorporating it into the network for instance-aware segmentation. Our approach exhibits state-of-the-art performance across multiple datasets and stands out as a real-time multi-scan semantic segmentation method. The implementation of SegNet4D will be made available at \url{https://github.com/nubot-nudt/SegNet4D}.
comment: 10 pages, 5 figures
☆ YouDream: Generating Anatomically Controllable Consistent Text-to-3D Animals
3D generation guided by text-to-image diffusion models enables the creation of visually compelling assets. However previous methods explore generation based on image or text. The boundaries of creativity are limited by what can be expressed through words or the images that can be sourced. We present YouDream, a method to generate high-quality anatomically controllable animals. YouDream is guided using a text-to-image diffusion model controlled by 2D views of a 3D pose prior. Our method generates 3D animals that are not possible to create using previous text-to-3D generative methods. Additionally, our method is capable of preserving anatomic consistency in the generated animals, an area where prior text-to-3D approaches often struggle. Moreover, we design a fully automated pipeline for generating commonly found animals. To circumvent the need for human intervention to create a 3D pose, we propose a multi-agent LLM that adapts poses from a limited library of animal 3D poses to represent the desired animal. A user study conducted on the outcomes of YouDream demonstrates the preference of the animal models generated by our method over others. Turntable results and code are released at https://youdream3d.github.io/
☆ Repairing Catastrophic-Neglect in Text-to-Image Diffusion Models via Attention-Guided Feature Enhancement
Text-to-Image Diffusion Models (T2I DMs) have garnered significant attention for their ability to generate high-quality images from textual descriptions. However, these models often produce images that do not fully align with the input prompts, resulting in semantic inconsistencies. The most prominent issue among these semantic inconsistencies is catastrophic-neglect, where the images generated by T2I DMs miss key objects mentioned in the prompt. We first conduct an empirical study on this issue, exploring the prevalence of catastrophic-neglect, potential mitigation strategies with feature enhancement, and the insights gained. Guided by the empirical findings, we propose an automated repair approach named Patcher to address catastrophic-neglect in T2I DMs. Specifically, Patcher first determines whether there are any neglected objects in the prompt, and then applies attention-guided feature enhancement to these neglected objects, resulting in a repaired prompt. Experimental results on three versions of Stable Diffusion demonstrate that Patcher effectively repairs the issue of catastrophic-neglect, achieving 10.1%-16.3% higher Correct Rate in image generation compared to baselines.
comment: 11 pages, 3 figures
☆ Feature-prompting GBMSeg: One-Shot Reference Guided Training-Free Prompt Engineering for Glomerular Basement Membrane Segmentation MICCAI2024
Assessment of the glomerular basement membrane (GBM) in transmission electron microscopy (TEM) is crucial for diagnosing chronic kidney disease (CKD). The lack of domain-independent automatic segmentation tools for the GBM necessitates an AI-based solution to automate the process. In this study, we introduce GBMSeg, a training-free framework designed to automatically segment the GBM in TEM images guided only by a one-shot annotated reference. Specifically, GBMSeg first exploits the robust feature matching capabilities of the pretrained foundation model to generate initial prompt points, then introduces a series of novel automatic prompt engineering techniques across the feature and physical space to optimize the prompt scheme. Finally, GBMSeg employs a class-agnostic foundation segmentation model with the generated prompt scheme to obtain accurate segmentation results. Experimental results on our collected 2538 TEM images confirm that GBMSeg achieves superior segmentation performance with a Dice similarity coefficient (DSC) of 87.27% using only one labeled reference image in a training-free manner, outperforming recently proposed one-shot or few-shot methods. In summary, GBMSeg introduces a distinctive automatic prompt framework that facilitates robust domain-independent segmentation performance without training, particularly advancing the automatic prompting of foundation segmentation models for medical images. Future work involves automating the thickness measurement of segmented GBM and quantifying pathological indicators, holding significant potential for advancing pathology assessments in clinical applications. The source code is available on https://github.com/SnowRain510/GBMSeg
comment: Accepted for MICCAI2024
☆ Video-Infinity: Distributed Long Video Generation
Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
☆ POPCat: Propagation of particles for complex annotation tasks
Novel dataset creation for all multi-object tracking, crowd-counting, and industrial-based videos is arduous and time-consuming when faced with a unique class that densely populates a video sequence. We propose a time efficient method called POPCat that exploits the multi-target and temporal features of video data to produce a semi-supervised pipeline for segmentation or box-based video annotation. The method retains the accuracy level associated with human level annotation while generating a large volume of semi-supervised annotations for greater generalization. The method capitalizes on temporal features through the use of a particle tracker to expand the domain of human-provided target points. This is done through the use of a particle tracker to reassociate the initial points to a set of images that follow the labeled frame. A YOLO model is then trained with this generated data, and then rapidly infers on the target video. Evaluations are conducted on GMOT-40, AnimalTrack, and Visdrone-2019 benchmarks. These multi-target video tracking/detection sets contain multiple similar-looking targets, camera movements, and other features that would commonly be seen in "wild" situations. We specifically choose these difficult datasets to demonstrate the efficacy of the pipeline and for comparison purposes. The method applied on GMOT-40, AnimalTrack, and Visdrone shows a margin of improvement on recall/mAP50/mAP over the best results by a value of 24.5%/9.6%/4.8%, -/43.1%/27.8%, and 7.5%/9.4%/7.5% where metrics were collected.
comment: 10 pages, 5 figures, Accepted in "Conference on Robots and Vision 2024"
☆ Diff3Dformer: Leveraging Slice Sequence Diffusion for Enhanced 3D CT Classification with Transformer Networks
The manifestation of symptoms associated with lung diseases can vary in different depths for individual patients, highlighting the significance of 3D information in CT scans for medical image classification. While Vision Transformer has shown superior performance over convolutional neural networks in image classification tasks, their effectiveness is often demonstrated on sufficiently large 2D datasets and they easily encounter overfitting issues on small medical image datasets. To address this limitation, we propose a Diffusion-based 3D Vision Transformer (Diff3Dformer), which utilizes the latent space of the Diffusion model to form the slice sequence for 3D analysis and incorporates clustering attention into ViT to aggregate repetitive information within 3D CT scans, thereby harnessing the power of the advanced transformer in 3D classification tasks on small datasets. Our method exhibits improved performance on two different scales of small datasets of 3D lung CT scans, surpassing the state of the art 3D methods and other transformer-based approaches that emerged during the COVID-19 pandemic, demonstrating its robust and superior performance across different scales of data. Experimental results underscore the superiority of our proposed method, indicating its potential for enhancing medical image classification tasks in real-world scenarios.
comment: conference
☆ Virtual Mines -- Component-level recycling of printed circuit boards using deep learning
This contribution gives an overview of an ongoing project using machine learning and computer vision components for improving the electronic waste recycling process. In circular economy, the "virtual mines" concept refers to production cycles where interesting raw materials are reclaimed in an efficient and cost-effective manner from end-of-life items. In particular, the growth of e-waste, due to the increasingly shorter life cycle of hi-tech goods, is a global problem. In this paper, we describe a pipeline based on deep learning model to recycle printed circuit boards at the component level. A pre-trained YOLOv5 model is used to analyze the results of the locally developed dataset. With a different distribution of class instances, YOLOv5 managed to achieve satisfactory precision and recall, with the ability to optimize with large component instances.
comment: 10 pages, 5 figures
☆ Unambiguous Recognition Should Not Rely Solely on Natural Language Training
In LaTeX text recognition using Transformer-based architectures, this paper identifies certain "bias" issues. For instance, $e-t$ is frequently misrecognized as $e^{-t}$. This bias stems from the inherent characteristics of the dataset. To mitigate this bias, we propose a LaTeX printed text recognition model trained on a mixed dataset of pseudo-formulas and pseudo-text. The model employs a Swin Transformer as the encoder and a RoBERTa model as the decoder. Experimental results demonstrate that this approach reduces "bias", enhancing the accuracy and robustness of text recognition. For clear images, the model strictly adheres to the image content; for blurred images, it integrates both image and contextual information to produce reasonable recognition results.
☆ Vastextures: Vast repository of textures and PBR materials extracted from real-world images using unsupervised methods
Vastextures is a vast repository of 500,000 textures and PBR materials extracted from real-world images using an unsupervised process. The extracted materials and textures are extremely diverse and cover a vast range of real-world patterns, but at the same time less refined compared to existing repositories. The repository is composed of 2D textures cropped from natural images and SVBRDF/PBR materials generated from these textures. Textures and PBR materials are essential for CGI. Existing materials repositories focus on games, animation, and arts, that demand a limited amount of high-quality assets. However, virtual worlds and synthetic data are becoming increasingly important for training A.I systems for computer vision. This application demands a huge amount of diverse assets but at the same time less affected by noisy and unrefined assets. Vastexture aims to address this need by creating a free, huge, and diverse assets repository that covers as many real-world materials as possible. The materials are automatically extracted from natural images in two steps: 1) Automatically scanning a giant amount of images to identify and crop regions with uniform textures. This is done by splitting the image into a grid of cells and identifying regions in which all of the cells share a similar statistical distribution. 2) Extracting the properties of the PBR material from the cropped texture. This is done by randomly guessing every correlation between the properties of the texture image and the properties of the PBR material. The resulting PBR materials exhibit a vast amount of real-world patterns as well as unexpected emergent properties. Neutral nets trained on this repository outperformed nets trained using handcrafted assets.
comment: Vastexture was published as part of Learning Zero-Shot Material States Segmentation, by Implanting Natural Image Patterns in Synthetic Data, refer to this work in citations. This document gives a more detailed and technical discussion of this repository
☆ MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
Spurious bias, a tendency to use spurious correlations between non-essential input attributes and target variables for predictions, has revealed a severe robustness pitfall in deep learning models trained on single modality data. Multimodal Large Language Models (MLLMs), which integrate both vision and language models, have demonstrated strong capability in joint vision-language understanding. However, whether spurious biases are prevalent in MLLMs remains under-explored. We mitigate this gap by analyzing the spurious biases in a multimodal setting, uncovering the specific test data patterns that can manifest this problem when biases in the vision model cascade into the alignment between visual and text tokens in MLLMs. To better understand this problem, we introduce MM-SpuBench, a comprehensive visual question-answering (VQA) benchmark designed to evaluate MLLMs' reliance on nine distinct categories of spurious correlations from five open-source image datasets. The VQA dataset is built from human-understandable concept information (attributes). Leveraging this benchmark, we conduct a thorough evaluation of current state-of-the-art MLLMs. Our findings illuminate the persistence of the reliance on spurious correlations from these models and underscore the urge for new methodologies to mitigate spurious biases. To support the MLLM robustness research, we release our VQA benchmark at https://huggingface.co/datasets/mmbench/MM-SpuBench.
☆ Accelerating Phase Field Simulations Through a Hybrid Adaptive Fourier Neural Operator with U-Net Backbone
Prolonged contact between a corrosive liquid and metal alloys can cause progressive dealloying. For such liquid-metal dealloying (LMD) process, phase field models have been developed. However, the governing equations often involve coupled non-linear partial differential equations (PDE), which are challenging to solve numerically. In particular, stiffness in the PDEs requires an extremely small time steps (e.g. $10^{-12}$ or smaller). This computational bottleneck is especially problematic when running LMD simulation until a late time horizon is required. This motivates the development of surrogate models capable of leaping forward in time, by skipping several consecutive time steps at-once. In this paper, we propose U-Shaped Adaptive Fourier Neural Operators (U-AFNO), a machine learning (ML) model inspired by recent advances in neural operator learning. U-AFNO employs U-Nets for extracting and reconstructing local features within the physical fields, and passes the latent space through a vision transformer (ViT) implemented in the Fourier space (AFNO). We use U-AFNOs to learn the dynamics mapping the field at a current time step into a later time step. We also identify global quantities of interest (QoI) describing the corrosion process (e.g. the deformation of the liquid-metal interface) and show that our proposed U-AFNO model is able to accurately predict the field dynamics, in-spite of the chaotic nature of LMD. Our model reproduces the key micro-structure statistics and QoIs with a level of accuracy on-par with the high-fidelity numerical solver. We also investigate the opportunity of using hybrid simulations, in which we alternate forward leap in time using the U-AFNO with high-fidelity time stepping. We demonstrate that while advantageous for some surrogate model design choices, our proposed U-AFNO model in fully auto-regressive settings consistently outperforms hybrid schemes.
☆ Speeding Up Image Classifiers with Little Companions
Scaling up neural networks has been a key recipe to the success of large language and vision models. However, in practice, up-scaled models can be disproportionately costly in terms of computations, providing only marginal improvements in performance; for example, EfficientViT-L3-384 achieves <2% improvement on ImageNet-1K accuracy over the base L1-224 model, while requiring $14\times$ more multiply-accumulate operations (MACs). In this paper, we investigate scaling properties of popular families of neural networks for image classification, and find that scaled-up models mostly help with "difficult" samples. Decomposing the samples by difficulty, we develop a simple model-agnostic two-pass Little-Big algorithm that first uses a light-weight "little" model to make predictions of all samples, and only passes the difficult ones for the "big" model to solve. Good little companion achieve drastic MACs reduction for a wide variety of model families and scales. Without loss of accuracy or modification of existing models, our Little-Big models achieve MACs reductions of 76% for EfficientViT-L3-384, 81% for EfficientNet-B7-600, 71% for DeiT3-L-384 on ImageNet-1K. Little-Big also speeds up the InternImage-G-512 model by 62% while achieving 90% ImageNet-1K top-1 accuracy, serving both as a strong baseline and as a simple practical method for large model compression.
☆ Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-Vision-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.
☆ GMT: Guided Mask Transformer for Leaf Instance Segmentation
Leaf instance segmentation is a challenging multi-instance segmentation task, aiming to separate and delineate each leaf in an image of a plant. The delineation of each leaf is a necessary prerequisite task for several biology-related applications such as the fine-grained monitoring of plant growth, and crop yield estimation. The task is challenging because self-similarity of instances is high (similar shape and colour) and instances vary greatly in size under heavy occulusion. We believe that the key to overcoming the aforementioned challenges lies in the specific spatial patterns of leaf distribution. For example, leaves typically grow around the plant's center, with smaller leaves clustering and overlapped near this central point. In this paper, we propose a novel approach named Guided Mask Transformer (GMT), which contains three key components, namely Guided Positional Encoding (GPE), Guided Embedding Fusion Module (GEFM) and Guided Dynamic Positional Queries (GDPQ), to extend the meta-architecture of Mask2Former and incorporate with a set of harmonic guide functions. These guide functions are tailored to the pixel positions of instances and trained to separate distinct instances in an embedding space. The proposed GMT consistently outperforms State-of-the-Art models on three public plant datasets.
☆ Fine-tuning Diffusion Models for Enhancing Face Quality in Text-to-image Generation
Diffusion models (DMs) have achieved significant success in generating imaginative images given textual descriptions. However, they are likely to fall short when it comes to real-life scenarios with intricate details.The low-quality, unrealistic human faces in text-to-image generation are one of the most prominent issues, hindering the wide application of DMs in practice. Targeting addressing such an issue, we first assess the face quality of generations from popular pre-trained DMs with the aid of human annotators and then evaluate the alignment between existing metrics such as ImageReward, Human Preference Score, Aesthetic Score Predictor, and Face Quality Assessment, with human judgments. Observing that existing metrics can be unsatisfactory for quantifying face quality, we develop a novel metric named Face Score (FS) by fine-tuning ImageReward on a dataset of (good, bad) face pairs cheaply crafted by an inpainting pipeline of DMs. Extensive studies reveal that FS enjoys a superior alignment with humans. On the other hand, FS opens up the door for refining DMs for better face generation. To achieve this, we incorporate a guidance loss on the denoising trajectories of the aforementioned face pairs for fine-tuning pre-trained DMs such as Stable Diffusion V1.5 and Realistic Vision V5.1. Intuitively, such a loss pushes the trajectory of bad faces toward that of good ones. Comprehensive experiments verify the efficacy of our approach for improving face quality while preserving general capability.
comment: Under review
☆ Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net) for 3D Medical Image Segmentation and Visualization
Vision Transformers have shown superior performance to the traditional convolutional-based frameworks in many vision applications, including but not limited to the segmentation of 3D medical images. To further advance this area, this study introduces the Multi-Aperture Fusion of Transformer-Convolutional Network (MFTC-Net), which integrates the output of Swin Transformers and their corresponding convolutional blocks using 3D fusion blocks. The Multi-Aperture incorporates each image patch at its original resolutions with its pyramid representation to better preserve minute details. The proposed architecture has demonstrated a score of 89.73 and 7.31 for Dice and HD95, respectively, on the Synapse multi-organs dataset an improvement over the published results. The improved performance also comes with the added benefits of the reduced complexity of approximately 40 million parameters. Our code is available at https://github.com/Siyavashshabani/MFTC-Net
☆ Reducing the Memory Footprint of 3D Gaussian Splatting
3D Gaussian splatting provides excellent visual quality for novel view synthesis, with fast training and real-time rendering; unfortunately, the memory requirements of this method for storing and transmission are unreasonably high. We first analyze the reasons for this, identifying three main areas where storage can be reduced: the number of 3D Gaussian primitives used to represent a scene, the number of coefficients for the spherical harmonics used to represent directional radiance, and the precision required to store Gaussian primitive attributes. We present a solution to each of these issues. First, we propose an efficient, resolution-aware primitive pruning approach, reducing the primitive count by half. Second, we introduce an adaptive adjustment method to choose the number of coefficients used to represent directional radiance for each Gaussian primitive, and finally a codebook-based quantization method, together with a half-float representation for further memory reduction. Taken together, these three components result in a 27 reduction in overall size on disk on the standard datasets we tested, along with a 1.7 speedup in rendering speed. We demonstrate our method on standard datasets and show how our solution results in significantly reduced download times when using the method on a mobile device.
comment: Project website: https://repo-sam.inria.fr/fungraph/reduced_3dgs/
☆ Leveraging Knowledge Distillation for Lightweight Skin Cancer Classification: Balancing Accuracy and Computational Efficiency
Skin cancer is a major concern to public health, accounting for one-third of the reported cancers. If not detected early, the cancer has the potential for severe consequences. Recognizing the critical need for effective skin cancer classification, we address the limitations of existing models, which are often too large to deploy in areas with limited computational resources. In response, we present a knowledge distillation based approach for creating a lightweight yet high-performing classifier. The proposed solution involves fusing three models, namely ResNet152V2, ConvNeXtBase, and ViT Base, to create an effective teacher model. The teacher model is then employed to guide a lightweight student model of size 2.03 MB. This student model is further compressed to 469.77 KB using 16-bit quantization, enabling smooth incorporation into edge devices. With six-stage image preprocessing, data augmentation, and a rigorous ablation study, the model achieves an impressive accuracy of 98.75% on the HAM10000 dataset and 98.94% on the Kaggle dataset in classifying benign and malignant skin cancers. With its high accuracy and compact size, our model appears to be a potential choice for accurate skin cancer classification, particularly in resource-constrained settings.
☆ Enhancing Scientific Figure Captioning Through Cross-modal Learning
Scientific charts are essential tools for effectively communicating research findings, serving as a vital medium for conveying information and revealing data patterns. With the rapid advancement of science and technology, coupled with the advent of the big data era, the volume and diversity of scientific research data have surged, leading to an increase in the number and variety of charts. This trend presents new challenges for researchers, particularly in efficiently and accurately generating appropriate titles for these charts to better convey their information and results. Automatically generated chart titles can enhance information retrieval systems by providing precise data for detailed chart classification. As research in image captioning and text summarization matures, the automatic generation of scientific chart titles has gained significant attention. By leveraging natural language processing, machine learning, and multimodal techniques, it is possible to automatically extract key information from charts and generate accurate, concise titles that better serve the needs of researchers. This paper presents a novel approach to scientific chart title generation, demonstrating its effectiveness in improving the clarity and accessibility of research data.
comment: 7 pages
☆ Dwarf: Disease-weighted network for attention map refinement
The interpretability of deep learning is crucial for evaluating the reliability of medical imaging models and reducing the risks of inaccurate patient recommendations. This study addresses the "human out of the loop" and "trustworthiness" issues in medical image analysis by integrating medical professionals into the interpretability process. We propose a disease-weighted attention map refinement network (Dwarf) that leverages expert feedback to enhance model relevance and accuracy. Our method employs cyclic training to iteratively improve diagnostic performance, generating precise and interpretable feature maps. Experimental results demonstrate significant improvements in interpretability and diagnostic accuracy across multiple medical imaging datasets. This approach fosters effective collaboration between AI systems and healthcare professionals, ultimately aiming to improve patient outcomes
♻ ☆ Single-image camera calibration with model-free distortion correction
Camera calibration is a process of paramount importance in computer vision applications that require accurate quantitative measurements. The popular method developed by Zhang relies on the use of a large number of images of a planar grid of fiducial points captured in multiple poses. Although flexible and easy to implement, Zhang's method has some limitations. The simultaneous optimization of the entire parameter set, including the coefficients of a predefined distortion model, may result in poor distortion correction at the image boundaries or in miscalculation of the intrinsic parameters, even with a reasonably small reprojection error. Indeed, applications involving image stitching (e.g. multi-camera systems) require accurate mapping of distortion up to the outermost regions of the image. Moreover, intrinsic parameters affect the accuracy of camera pose estimation, which is fundamental for applications such as vision servoing in robot navigation and automated assembly. This paper proposes a method for estimating the complete set of calibration parameters from a single image of a planar speckle pattern covering the entire sensor. The correspondence between image points and physical points on the calibration target is obtained using Digital Image Correlation. The effective focal length and the extrinsic parameters are calculated separately after a prior evaluation of the principal point. At the end of the procedure, a dense and uniform model-free distortion map is obtained over the entire image. Synthetic data with different noise levels were used to test the feasibility of the proposed method and to compare its metrological performance with Zhang's method. Real-world tests demonstrate the potential of the developed method to reveal aspects of the image formation that are hidden by averaging over multiple images.
comment: Accepted manuscript
♻ ☆ VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
♻ ☆ Unlearnable Examples for Diffusion Models: Protect Data from Unauthorized Exploitation
Diffusion models have demonstrated remarkable performance in image generation tasks, paving the way for powerful AIGC applications. However, these widely-used generative models can also raise security and privacy concerns, such as copyright infringement, and sensitive data leakage. To tackle these issues, we propose a method, Unlearnable Diffusion Perturbation, to safeguard images from unauthorized exploitation. Our approach involves designing an algorithm to generate sample-wise perturbation noise for each image to be protected. This imperceptible protective noise makes the data almost unlearnable for diffusion models, i.e., diffusion models trained or fine-tuned on the protected data cannot generate high-quality and diverse images related to the protected training data. Theoretically, we frame this as a max-min optimization problem and introduce EUDP, a noise scheduler-based method to enhance the effectiveness of the protective noise. We evaluate our methods on both Denoising Diffusion Probabilistic Model and Latent Diffusion Models, demonstrating that training diffusion models on the protected data lead to a significant reduction in the quality of the generated images. Especially, the experimental results on Stable Diffusion demonstrate that our method effectively safeguards images from being used to train Diffusion Models in various tasks, such as training specific objects and styles. This achievement holds significant importance in real-world scenarios, as it contributes to the protection of privacy and copyright against AI-generated content.
♻ ☆ EGTR: Extracting Graph from Transformer for Scene Graph Generation CVPR 2024
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at https://github.com/naver-ai/egtr.
comment: CVPR 2024 (Best paper award candidate)
♻ ☆ Attribute Diversity Determines the Systematicity Gap in VQA
The degree to which neural networks can generalize to new combinations of familiar concepts, and the conditions under which they are able to do so, has long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that while increased quantity of training data does not reduce the systematicity gap, increased training data diversity of the attributes in the unseen combination does. In all, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.
comment: 33 pages, 20 figures
♻ ☆ Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?
Stable Diffusion has established itself as a foundation model in generative AI artistic applications, receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However, these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies, researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images, it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper, we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore, we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods.
♻ ☆ Versatile Backdoor Attack with Visible, Semantic, Sample-Specific, and Compatible Triggers
Deep neural networks (DNNs) can be manipulated to exhibit specific behaviors when exposed to specific trigger patterns, without affecting their performance on benign samples, dubbed \textit{backdoor attack}. Currently, implementing backdoor attacks in physical scenarios still faces significant challenges. Physical attacks are labor-intensive and time-consuming, and the triggers are selected in a manual and heuristic way. Moreover, expanding digital attacks to physical scenarios faces many challenges due to their sensitivity to visual distortions and the absence of counterparts in the real world. To address these challenges, we define a novel trigger called the \textbf{V}isible, \textbf{S}emantic, \textbf{S}ample-Specific, and \textbf{C}ompatible (VSSC) trigger, to achieve effective, stealthy and robust simultaneously, which can also be effectively deployed in the physical scenario using corresponding objects. To implement the VSSC trigger, we propose an automated pipeline comprising three modules: a trigger selection module that systematically identifies suitable triggers leveraging large language models, a trigger insertion module that employs generative models to seamlessly integrate triggers into images, and a quality assessment module that ensures the natural and successful insertion of triggers through vision-language models. Extensive experimental results and analysis validate the effectiveness, stealthiness, and robustness of the VSSC trigger. It can not only maintain robustness under visual distortions but also demonstrates strong practicality in the physical scenario. We hope that the proposed VSSC trigger and implementation approach could inspire future studies on designing more practical triggers in backdoor attacks.
comment: 23 pages, 21 figures, 18 tables
♻ ☆ Sim2Real Bilevel Adaptation for Object Surface Classification using Vision-Based Tactile Sensors ICRA 2024
In this paper, we address the Sim2Real gap in the field of vision-based tactile sensors for classifying object surfaces. We train a Diffusion Model to bridge this gap using a relatively small dataset of real-world images randomly collected from unlabeled everyday objects via the DIGIT sensor. Subsequently, we employ a simulator to generate images by uniformly sampling the surface of objects from the YCB Model Set. These simulated images are then translated into the real domain using the Diffusion Model and automatically labeled to train a classifier. During this training, we further align features of the two domains using an adversarial procedure. Our evaluation is conducted on a dataset of tactile images obtained from a set of ten 3D printed YCB objects. The results reveal a total accuracy of 81.9%, a significant improvement compared to the 34.7% achieved by the classifier trained solely on simulated images. This demonstrates the effectiveness of our approach. We further validate our approach using the classifier on a 6D object pose estimation task from tactile data.
comment: 6 pages, accepted to ICRA 2024
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ Asymmetrical Siamese Network for Point Clouds Normal Estimation
In recent years, deep learning-based point cloud normal estimation has made great progress. However, existing methods mainly rely on the PCPNet dataset, leading to overfitting. In addition, the correlation between point clouds with different noise scales remains unexplored, resulting in poor performance in cross-domain scenarios. In this paper, we explore the consistency of intrinsic features learned from clean and noisy point clouds using an Asymmetric Siamese Network architecture. By applying reasonable constraints between features extracted from different branches, we enhance the quality of normal estimation. Moreover, we introduce a novel multi-view normal estimation dataset that includes a larger variety of shapes with different noise levels. Evaluation of existing methods on this new dataset reveals their inability to adapt to different types of shapes, indicating a degree of overfitting. Extensive experiments show that the proposed dataset poses significant challenges for point cloud normal estimation and that our feature constraint mechanism effectively improves upon existing methods and reduces overfitting in current architectures.
♻ ☆ QUBIQ: Uncertainty Quantification for Biomedical Image Segmentation Challenge
Uncertainty in medical image segmentation tasks, especially inter-rater variability, arising from differences in interpretations and annotations by various experts, presents a significant challenge in achieving consistent and reliable image segmentation. This variability not only reflects the inherent complexity and subjective nature of medical image interpretation but also directly impacts the development and evaluation of automated segmentation algorithms. Accurately modeling and quantifying this variability is essential for enhancing the robustness and clinical applicability of these algorithms. We report the set-up and summarize the benchmark results of the Quantification of Uncertainties in Biomedical Image Quantification Challenge (QUBIQ), which was organized in conjunction with International Conferences on Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2020 and 2021. The challenge focuses on the uncertainty quantification of medical image segmentation which considers the omnipresence of inter-rater variability in imaging datasets. The large collection of images with multi-rater annotations features various modalities such as MRI and CT; various organs such as the brain, prostate, kidney, and pancreas; and different image dimensions 2D-vs-3D. A total of 24 teams submitted different solutions to the problem, combining various baseline models, Bayesian neural networks, and ensemble model techniques. The obtained results indicate the importance of the ensemble models, as well as the need for further research to develop efficient 3D methods for uncertainty quantification methods in 3D segmentation tasks.
comment: initial technical report
♻ ☆ A Systematic Review of Few-Shot Learning in Medical Imaging
The lack of annotated medical images limits the performance of deep learning models, which usually need large-scale labelled datasets. Few-shot learning techniques can reduce data scarcity issues and enhance medical image analysis, especially with meta-learning. This systematic review gives a comprehensive overview of few-shot learning in medical imaging. We searched the literature systematically and selected 80 relevant articles published from 2018 to 2023. We clustered the articles based on medical outcomes, such as tumour segmentation, disease classification, and image registration; anatomical structure investigated (i.e. heart, lung, etc.); and the meta-learning method used. For each cluster, we examined the papers' distributions and the results provided by the state-of-the-art. In addition, we identified a generic pipeline shared among all the studies. The review shows that few-shot learning can overcome data scarcity in most outcomes and that meta-learning is a popular choice to perform few-shot learning because it can adapt to new tasks with few labelled samples. In addition, following meta-learning, supervised learning and semi-supervised learning stand out as the predominant techniques employed to tackle few-shot learning challenges in medical imaging and also best performing. Lastly, we observed that the primary application areas predominantly encompass cardiac, pulmonary, and abdominal domains. This systematic review aims to inspire further research to improve medical image analysis and patient care.
comment: 48 pages, 29 figures, 10 tables, submitted to Elsevier on 19 Sep 2023
♻ ☆ NeST: Neural Stress Tensor Tomography by leveraging 3D Photoelasticity
Photoelasticity enables full-field stress analysis in transparent objects through stress-induced birefringence. Existing techniques are limited to 2D slices and require destructively slicing the object. Recovering the internal 3D stress distribution of the entire object is challenging as it involves solving a tensor tomography problem and handling phase wrapping ambiguities. We introduce NeST, an analysis-by-synthesis approach for reconstructing 3D stress tensor fields as neural implicit representations from polarization measurements. Our key insight is to jointly handle phase unwrapping and tensor tomography using a differentiable forward model based on Jones calculus. Our non-linear model faithfully matches real captures, unlike prior linear approximations. We develop an experimental multi-axis polariscope setup to capture 3D photoelasticity and experimentally demonstrate that NeST reconstructs the internal stress distribution for objects with varying shape and force conditions. Additionally, we showcase novel applications in stress analysis, such as visualizing photoelastic fringes by virtually slicing the object and viewing photoelastic fringes from unseen viewpoints. NeST paves the way for scalable non-destructive 3D photoelastic analysis.
comment: Project webpage: https://akshatdave.github.io/nest
♻ ☆ MonoBox: Tightness-free Box-supervised Polyp Segmentation using Monotonicity Constraint
We propose MonoBox, an innovative box-supervised segmentation method constrained by monotonicity to liberate its training from the user-unfriendly box-tightness assumption. In contrast to conventional box-supervised segmentation, where the box edges must precisely touch the target boundaries, MonoBox leverages imprecisely-annotated boxes to achieve robust pixel-wise segmentation. The 'linchpin' is that, within the noisy zones around box edges, MonoBox discards the traditional misguiding multiple-instance learning loss, and instead optimizes a carefully-designed objective, termed monotonicity constraint. Along directions transitioning from the foreground to background, this new constraint steers responses to adhere to a trend of monotonically decreasing values. Consequently, the originally unreliable learning within the noisy zones is transformed into a correct and effective monotonicity optimization. Moreover, an adaptive label correction is introduced, enabling MonoBox to enhance the tightness of box annotations using predicted masks from the previous epoch and dynamically shrink the noisy zones as training progresses. We verify MonoBox in the box-supervised segmentation task of polyps, where satisfying box-tightness is challenging due to the vague boundaries between the polyp and normal tissues. Experiments on both public synthetic and in-house real noisy datasets demonstrate that MonoBox exceeds other anti-noise state-of-the-arts by improving Dice by at least 5.5% and 3.3%, respectively. Codes are at https://github.com/Huster-Hq/MonoBox.
♻ ☆ RGB-Sonar Tracking Benchmark and Spatial Cross-Attention Transformer Tracker
Vision camera and sonar are naturally complementary in the underwater environment. Combining the information from two modalities will promote better observation of underwater targets. However, this problem has not received sufficient attention in previous research. Therefore, this paper introduces a new challenging RGB-Sonar (RGB-S) tracking task and investigates how to achieve efficient tracking of an underwater target through the interaction of RGB and sonar modalities. Specifically, we first propose an RGBS50 benchmark dataset containing 50 sequences and more than 87000 high-quality annotated bounding boxes. Experimental results show that the RGBS50 benchmark poses a challenge to currently popular SOT trackers. Second, we propose an RGB-S tracker called SCANet, which includes a spatial cross-attention module (SCAM) consisting of a novel spatial cross-attention layer and two independent global integration modules. The spatial cross-attention is used to overcome the problem of spatial misalignment of between RGB and sonar images. Third, we propose a SOT data-based RGB-S simulation training method (SRST) to overcome the lack of RGB-S training datasets. It converts RGB images into sonar-like saliency images to construct pseudo-data pairs, enabling the model to learn the semantic structure of RGB-S-like data. Comprehensive experiments show that the proposed spatial cross-attention effectively achieves the interaction between RGB and sonar modalities and SCANet achieves state-of-the-art performance on the proposed benchmark. The code is available at https://github.com/LiYunfengLYF/RGBS50.
♻ ☆ AdaTreeFormer: Few Shot Domain Adaptation for Tree Counting from a Single High-Resolution Image
The process of estimating and counting tree density using only a single aerial or satellite image is a difficult task in the fields of photogrammetry and remote sensing. However, it plays a crucial role in the management of forests. The huge variety of trees in varied topography severely hinders tree counting models to perform well. The purpose of this paper is to propose a framework that is learnt from the source domain with sufficient labeled trees and is adapted to the target domain with only a limited number of labeled trees. Our method, termed as AdaTreeFormer, contains one shared encoder with a hierarchical feature extraction scheme to extract robust features from the source and target domains. It also consists of three subnets: two for extracting self-domain attention maps from source and target domains respectively and one for extracting cross-domain attention maps. For the latter, an attention-to-adapt mechanism is introduced to distill relevant information from different domains while generating tree density maps; a hierarchical cross-domain feature alignment scheme is proposed that progressively aligns the features from the source and target domains. We also adopt adversarial learning into the framework to further reduce the gap between source and target domains. Our AdaTreeFormer is evaluated on six designed domain adaptation tasks using three tree counting datasets, \ie Jiangsu, Yosemite, and London. Experimental results show that AdaTreeFormer significantly surpasses the state of the art, \eg in the cross domain from the Yosemite to Jiangsu dataset, it achieves a reduction of 15.9 points in terms of the absolute counting errors and an increase of 10.8\% in the accuracy of the detected trees' locations. The codes and datasets are available at \emph{\color{magenta}{https://github.com/HAAClassic/AdaTreeFormer}}.
♻ ☆ Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting
Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting -- that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.
comment: TMLR 2024
♻ ☆ Continual Road-Scene Semantic Segmentation via Feature-Aligned Symmetric Multi-Modal Network ICIP 2024
State-of-the-art multimodal semantic segmentation strategies combining LiDAR and color data are usually designed on top of asymmetric information-sharing schemes and assume that both modalities are always available. This strong assumption may not hold in real-world scenarios, where sensors are prone to failure or can face adverse conditions that make the acquired information unreliable. This problem is exacerbated when continual learning scenarios are considered since they have stringent data reliability constraints. In this work, we re-frame the task of multimodal semantic segmentation by enforcing a tightly coupled feature representation and a symmetric information-sharing scheme, which allows our approach to work even when one of the input modalities is missing. We also introduce an ad-hoc class-incremental continual learning scheme, proving our approach's effectiveness and reliability even in safety-critical settings, such as autonomous driving. We evaluate our approach on the SemanticKITTI dataset, achieving impressive performances.
comment: Accepted ad ICIP 2024, 6 pages, 5 figures, 3 tables, 7 equations
♻ ☆ RaDe-GS: Rasterizing Depth in Gaussian Splatting
Gaussian Splatting (GS) has proven to be highly effective in novel view synthesis, achieving high-quality and real-time rendering. However, its potential for reconstructing detailed 3D shapes has not been fully explored. Existing methods often suffer from limited shape accuracy due to the discrete and unstructured nature of Gaussian splats, which complicates the shape extraction. While recent techniques like 2D GS have attempted to improve shape reconstruction, they often reformulate the Gaussian primitives in ways that reduce both rendering quality and computational efficiency. To address these problems, our work introduces a rasterized approach to render the depth maps and surface normal maps of general 3D Gaussian splats. Our method not only significantly enhances shape reconstruction accuracy but also maintains the computational efficiency intrinsic to Gaussian Splatting. It achieves a Chamfer distance error comparable to NeuraLangelo on the DTU dataset and maintains similar computational efficiency as the original 3D GS methods. Our method is a significant advancement in Gaussian Splatting and can be directly integrated into existing Gaussian Splatting-based methods.
♻ ☆ P-Mamba: Marrying Perona Malik Diffusion with Mamba for Efficient Pediatric Echocardiographic Left Ventricular Segmentation
In pediatric cardiology, the accurate and immediate assessment of cardiac function through echocardiography is crucial since it can determine whether urgent intervention is required in many emergencies. However, echocardiography is characterized by ambiguity and heavy background noise interference, causing more difficulty in accurate segmentation. Present methods lack efficiency and are prone to mistakenly segmenting some background noise areas, such as the left ventricular area, due to noise disturbance. To address these issues, we introduce P-Mamba, which integrates the Mixture of Experts (MoE) concept for efficient pediatric echocardiographic left ventricular segmentation. Specifically, we utilize the recently proposed ViM layers from the vision mamba to enhance our model's computational and memory efficiency while modeling global dependencies.In the DWT-based Perona-Malik Diffusion (PMD) Block, we devise a PMD Block for noise suppression while preserving the left ventricle's local shape cues. Consequently, our proposed P-Mamba innovatively combines the PMD's noise suppression and local feature extraction capabilities with Mamba's efficient design for global dependency modeling. We conducted segmentation experiments on two pediatric ultrasound datasets and a general ultrasound dataset, namely Echonet-dynamic, and achieved state-of-the-art (SOTA) results. Leveraging the strengths of the P-Mamba block, our model demonstrates superior accuracy and efficiency compared to established models, including vision transformers with quadratic and linear computational complexity.
♻ ☆ OpenDlign: Enhancing Open-World 3D Learning with Depth-Aligned Images
Recent open-world 3D representation learning methods using Vision-Language Models (VLMs) to align 3D data with image-text information have shown superior 3D zero-shot performance. However, CAD-rendered images for this alignment often lack realism and texture variation, compromising alignment robustness. Moreover, the volume discrepancy between 3D and 2D pretraining datasets highlights the need for effective strategies to transfer the representational abilities of VLMs to 3D learning. In this paper, we present OpenDlign, a novel open-world 3D model using depth-aligned images generated from a diffusion model for robust multimodal alignment. These images exhibit greater texture diversity than CAD renderings due to the stochastic nature of the diffusion model. By refining the depth map projection pipeline and designing depth-specific prompts, OpenDlign leverages rich knowledge in pre-trained VLM for 3D representation learning with streamlined fine-tuning. Our experiments show that OpenDlign achieves high zero-shot and few-shot performance on diverse 3D tasks, despite only fine-tuning 6 million parameters on a limited ShapeNet dataset. In zero-shot classification, OpenDlign surpasses previous models by 8.0% on ModelNet40 and 16.4% on OmniObject3D. Additionally, using depth-aligned images for multimodal alignment consistently enhances the performance of other state-of-the-art models.
comment: 12 pages
♻ ☆ Of Mice and Mates: Automated Classification and Modelling of Mouse Behaviour in Groups using a Single Model across Cages
Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contribution is the novel Group Behaviour Model (GBM) which summarises the joint behaviour of groups of mice across cages, using a permutation matrix to match the mouse identities in each cage to the model. In support of the above, we also (a) developed the Activity Labelling Module (ALM) to automatically classify mouse behaviour from video, and (b) released two datasets, ABODe for training behaviour classifiers and IMADGE for modelling behaviour.
comment: International Journal of Computer Vision (2024)
♻ ☆ Advancing Surgical VQA with Scene Graph Knowledge
Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA
comment: IPCAI 2024, Int J CARS (2024)
♻ ☆ Surgical Triplet Recognition via Diffusion Model
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released.
♻ ☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
♻ ☆ SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS ICLR2024
Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its derivative, SWAP-Score, a novel high-performance training-free metric. It measures the expressivity of networks over a batch of input samples. The SWAP-Score is strongly correlated with ground-truth performance across various search spaces and tasks, outperforming 15 existing training-free metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be further enhanced by regularisation, which leads to even higher correlations in cell-based search space and enables model size control during the search. For example, Spearman's rank correlation coefficient between regularised SWAP-Score and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90, significantly higher than 0.80 from the second-best metric, NWOT. When integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and 9 minutes of GPU time respectively.
comment: ICLR2024 Spotlight
VCR: Visual Caption Restoration
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.
comment: 17 pages, 2 figures
♻ ☆ LIR: A Lightweight Baseline for Image Restoration
Recently, there have been significant advancements in Image Restoration based on CNN and transformer. However, the inherent characteristics of the Image Restoration task are often overlooked in many works. They, instead, tend to focus on the basic block design and stack numerous such blocks to the model, leading to parameters redundant and computations unnecessary. Thus, the efficiency of the image restoration is hindered. In this paper, we propose a Lightweight Baseline network for Image Restoration called LIR to efficiently restore the image and remove degradations. First of all, through an ingenious structural design, LIR removes the degradations existing in the local and global residual connections that are ignored by modern networks. Then, a Lightweight Adaptive Attention (LAA) Block is introduced which is mainly composed of proposed Adaptive Filters and Attention Blocks. The proposed Adaptive Filter is used to adaptively extract high-frequency information and enhance object contours in various IR tasks, and Attention Block involves a novel Patch Attention module to approximate the self-attention part of the transformer. On the deraining task, our LIR achieves the state-of-the-art Structure Similarity Index Measure (SSIM) and comparable performance to state-of-the-art models on Peak Signal-to-Noise Ratio (PSNR). For denoising, dehazing, and deblurring tasks, LIR also achieves a comparable performance to state-of-the-art models with a parameter size of about 30\%. In addition, it is worth noting that our LIR produces better visual results that are more in line with the human aesthetic.
♻ ☆ CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech. It still suffers from low speaker similarity and poor prosody naturalness. In this paper, we propose a multi-modal DSR model by leveraging neural codec language modeling to improve the reconstruction results, especially for the speaker similarity and prosody naturalness. Our proposed model consists of: (i) a multi-modal content encoder to extract robust phoneme embeddings from dysarthric speech with auxiliary visual inputs; (ii) a speaker codec encoder to extract and normalize the speaker-aware codecs from the dysarthric speech, in order to provide original timbre and normal prosody; (iii) a codec language model based speech decoder to reconstruct the speech based on the extracted phoneme embeddings and normalized codecs. Evaluations on the commonly used UASpeech corpus show that our proposed model can achieve significant improvements in terms of speaker similarity and prosody naturalness.
comment: Accepted by Interspeech 2024
♻ ☆ Learning to utilize image second-order derivative information for crisp edge detection
Edge detection is a fundamental task in computer vision. It has made great progress under the development of deep convolutional neural networks (DCNNs), some of which have achieved a beyond human-level performance. However, recent top-performing edge detection methods tend to generate thick and noisy edge lines. In this work, we solve this problem from two aspects: (1) leveraging the precise edge pixel location characteristics of second-order image derivatives, and (2) alleviating the issue of imbalanced pixel distribution. We propose a second-order derivative-based multi-scale contextual enhancement module (SDMC) to help the model locate true edge pixels accurately and construct a hybrid focal loss function (HFL) to alleviate the imbalanced distribution issue. We test our method on three standard benchmarks and the experiment results illustrate that our method can make the output edge maps crisp and achieves a top performance among several state-of-the-art methods on the BSDS500 dataset (ODS F-score in standard evaluation is 0.829, in crispness evaluation is 0.720), NYUD-V2 dataset (ODS F-score in standard evaluation is 0.768, in crispness evaluation is 0.546), and BIPED dataset (ODS F-score in standard evaluation is 0.903).
♻ ☆ Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality. Result videos are also available on our project page: https://haruolabs.github.io/style-n2n/
comment: 16 pages, 9 figures
♻ ☆ Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation
We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain, assumptions and biases have to be imposed to solve this inverse problem, and we choose to model them explicitly. Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation, we select one of the images as a reference, and model the deformation in this image by the aggregation of the optical flow from it to other images, exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module, the model registers each image TO the template but WITHOUT the template, avoiding artifacts related to poor template initialization. To illustrate the robustness of the method, we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings, yet the improvement in registration is decisive in the final reconstruction, as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines, or with domain-specific methods if so desired.
♻ ☆ CCC++: Optimized Color Classified Colorization with Segment Anything Model (SAM) Empowered Object Selective Color Harmonization
In this paper, we formulate the colorization problem into a multinomial classification problem and then apply a weighted function to classes. We propose a set of formulas to transform color values into color classes and vice versa. To optimize the classes, we experiment with different bin sizes for color class transformation. Observing class appearance, standard deviation, and model parameters on various extremely large-scale real-time images in practice we propose 532 color classes for our classification task. During training, we propose a class-weighted function based on true class appearance in each batch to ensure proper saturation of individual objects. We adjust the weights of the major classes, which are more frequently observed, by lowering them, while escalating the weights of the minor classes, which are less commonly observed. In our class re-weight formula, we propose a hyper-parameter for finding the optimal trade-off between the major and minor appeared classes. As we apply regularization to enhance the stability of the minor class, occasional minor noise may appear at the object's edges. We propose a novel object-selective color harmonization method empowered by the Segment Anything Model (SAM) to refine and enhance these edges. We propose two new color image evaluation metrics, the Color Class Activation Ratio (CCAR), and the True Activation Ratio (TAR), to quantify the richness of color components. We compare our proposed model with state-of-the-art models using six different dataset: Place, ADE, Celeba, COCO, Oxford 102 Flower, and ImageNet, in qualitative and quantitative approaches. The experimental results show that our proposed model outstrips other models in visualization, CNR and in our proposed CCAR and TAR measurement criteria while maintaining satisfactory performance in regression (MSE, PSNR), similarity (SSIM, LPIPS, UIUI), and generative criteria (FID).
comment: arXiv admin note: text overlap with arXiv:2403.01476
♻ ☆ MiM-ISTD: Mamba-in-Mamba for Efficient Infrared Small Target Detection
Recently, infrared small target detection (ISTD) has made significant progress, thanks to the development of basic models. Specifically, the models combining CNNs with transformers can successfully extract both local and global features. However, the disadvantage of the transformer is also inherited, i.e., the quadratic computational complexity to sequence length. Inspired by the recent basic model with linear complexity for long-distance modeling, Mamba, we explore the potential of this state space model for ISTD task in terms of effectiveness and efficiency in the paper. However, directly applying Mamba achieves suboptimal performances due to the insufficient harnessing of local features, which are imperative for detecting small targets. Instead, we tailor a nested structure, Mamba-in-Mamba (MiM-ISTD), for efficient ISTD. It consists of Outer and Inner Mamba blocks to adeptly capture both global and local features. Specifically, we treat the local patches as "visual sentences" and use the Outer Mamba to explore the global information. We then decompose each visual sentence into sub-patches as "visual words" and use the Inner Mamba to further explore the local information among words in the visual sentence with negligible computational costs. By aggregating the visual word and visual sentence features, our MiM-ISTD can effectively explore both global and local information. Experiments on NUAA-SIRST and IRSTD-1k show the superior accuracy and efficiency of our method. Specifically, MiM-ISTD is $8 \times$ faster than the SOTA method and reduces GPU memory usage by 62.2$\%$ when testing on $2048 \times 2048$ images, overcoming the computation and memory constraints on high-resolution infrared images.
comment: The first Mamba-based model for infrared small target detection
♻ ☆ MovieLLM: Enhancing Long Video Understanding with AI-Generated Movies
Development of multimodal models has marked a significant step forward in how machines understand videos. These models have shown promise in analyzing short video clips. However, when it comes to longer formats like movies, they often fall short. The main hurdles are the lack of high-quality, diverse video data and the intensive work required to collect or annotate such data. In face of these challenges, we propose MovieLLM, a novel framework designed to synthesize consistent and high-quality video data for instruction tuning. The pipeline is carefully designed to control the style of videos by improving textual inversion technique with powerful text generation capability of GPT-4. As the first framework to do such thing, our approach stands out for its flexibility and scalability, empowering users to create customized movies with only one description. This makes it a superior alternative to traditional data collection methods. Our extensive experiments validate that the data produced by MovieLLM significantly improves the performance of multimodal models in understanding complex video narratives, overcoming the limitations of existing datasets regarding scarcity and bias.
♻ ☆ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods CVPR24
In order to gain insights about the decision-making of different visual recognition backbones, we propose two methodologies, sub-explanation counting and cross-testing, that systematically applies deep explanation algorithms on a dataset-wide basis, and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional, in the sense that they jointly consider multiple parts of the image in building their decisions, whereas traditional CNNs and distilled transformers are less compositional and more disjunctive, which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments, we pinpointed the choice of normalization to be especially important in the compositionality of a model, in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally, we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.
comment: 25 pages with 37 figures, to be published in CVPR24. Project Webpage: https://mingqij.github.io/projects/cdmmtc/
♻ ☆ Adaptively Clustering Neighbor Elements for Image-Text Generation
We propose a novel Transformer-based image-to-text generation model termed as \textbf{ACF} that adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments for better visual-text coherence. To achieve this, we design a novel self-attention layer that applies self-attention over the elements in a local cluster window instead of the whole sequence. The window size is softly decided by a clustering matrix that is calculated by the current input data and thus this process is adaptive. By stacking these revised self-attention layers to construct ACF, the small clusters in the lower layers can be grouped into a bigger cluster, \eg vision/language. ACF clusters small objects/phrases into bigger ones. In this gradual clustering process, a parsing tree is generated which embeds the hierarchical knowledge of the input sequence. As a result, by using ACF to build the vision encoder and language decoder, the hierarchical object-phrase alignments are embedded and then transferred from vision to language domains in two popular image-to-text tasks: Image captioning and Visual Question Answering. The experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models and achieves comparable scores compared with some large-scale pre-trained models. Our code is available \href{https://github.com/ZihuaEvan/ACFModel/}{[here]}.
comment: Compared to v1 and v2, we expanded this method to VQA. And it proved that our method can be applied on more general image-text generation tasks
♻ ☆ WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3.5, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
comment: 26 pages, 9 figures
♻ ☆ Accurately Classifying Out-Of-Distribution Data in Facial Recognition
Standard classification theory assumes that the distribution of images in the test and training sets are identical. Unfortunately, real-life scenarios typically feature unseen data ("out-of-distribution data") which is different from data in the training distribution("in-distribution"). This issue is most prevalent in social justice problems where data from under-represented groups may appear in the test data without representing an equal proportion of the training data. This may result in a model returning confidently wrong decisions and predictions. We are interested in the following question: Can the performance of a neural network improve on facial images of out-of-distribution data when it is trained simultaneously on multiple datasets of in-distribution data? We approach this problem by incorporating the Outlier Exposure model and investigate how the model's performance changes when other datasets of facial images were implemented. We observe that the accuracy and other metrics of the model can be increased by applying Outlier Exposure, incorporating a trainable weight parameter to increase the machine's emphasis on outlier images, and by re-weighting the importance of different class labels. We also experimented with whether sorting the images and determining outliers via image features would have more of an effect on the metrics than sorting by average pixel value. Our goal was to make models not only more accurate but also more fair by scanning a more expanded range of images. We also tested the datasets in reverse order to see whether a more fair dataset with balanced features has an effect on the model's accuracy.
comment: 18 pages, 6 tables, 6 figures
♻ ☆ PSAvatar: A Point-based Shape Model for Real-Time Head Avatar Animation with 3D Gaussian Splatting
Despite much progress, achieving real-time high-fidelity head avatar animation is still difficult and existing methods have to trade-off between speed and quality. 3DMM based methods often fail to model non-facial structures such as eyeglasses and hairstyles, while neural implicit models suffer from deformation inflexibility and rendering inefficiency. Although 3D Gaussian has been demonstrated to possess promising capability for geometry representation and radiance field reconstruction, applying 3D Gaussian in head avatar creation remains a major challenge since it is difficult for 3D Gaussian to model the head shape variations caused by changing poses and expressions. In this paper, we introduce PSAvatar, a novel framework for animatable head avatar creation that utilizes discrete geometric primitive to create a parametric morphable shape model and employs 3D Gaussian for fine detail representation and high fidelity rendering. The parametric morphable shape model is a Point-based Morphable Shape Model (PMSM) which uses points instead of meshes for 3D representation to achieve enhanced representation flexibility. The PMSM first converts the FLAME mesh to points by sampling on the surfaces as well as off the meshes to enable the reconstruction of not only surface-like structures but also complex geometries such as eyeglasses and hairstyles. By aligning these points with the head shape in an analysis-by-synthesis manner, the PMSM makes it possible to utilize 3D Gaussian for fine detail representation and appearance modeling, thus enabling the creation of high-fidelity avatars. We show that PSAvatar can reconstruct high-fidelity head avatars of a variety of subjects and the avatars can be animated in real-time ($\ge$ 25 fps at a resolution of 512 $\times$ 512 ).
comment: 13 pages, 10 figures
♻ ☆ EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models ACL 2024
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.
comment: ACL 2024 System Demonstrations; Code: https://github.com/zjunlp/EasyEdit HF Demo: https://huggingface.co/spaces/zjunlp/EasyEdit Video: https://youtu.be/Gm6T0QaaskU Docs: https://zjunlp.gitbook.io/easyedit
♻ ☆ Interpreting the Second-Order Effects of Neurons in CLIP
We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) fails to capture the neurons' function in CLIP. Therefore, we present the "second-order lens", analyzing the effect flowing from a neuron through the later attention heads, directly to the output. We find that these effects are highly selective: for each neuron, the effect is significant for <2% of the images. Moreover, each effect can be approximated by a single direction in the text-image space of CLIP. We describe neurons by decomposing these directions into sparse sets of text representations. The sets reveal polysemantic behavior - each neuron corresponds to multiple, often unrelated, concepts (e.g. ships and cars). Exploiting this neuron polysemy, we mass-produce "semantic" adversarial examples by generating images with concepts spuriously correlated to the incorrect class. Additionally, we use the second-order effects for zero-shot segmentation and attribute discovery in images. Our results indicate that a scalable understanding of neurons can be used for model deception and for introducing new model capabilities.
comment: project page: https://yossigandelsman.github.io/clip_neurons/index.html
♻ ☆ Intrinsic LoRA: A Generalist Approach for Discovering Knowledge in Generative Models
Generative models excel at creating images that closely mimic real scenes, suggesting they inherently encode scene representations. We introduce Intrinsic LoRA (I-LoRA), a general approach that uses Low-Rank Adaptation (LoRA) to discover scene intrinsics such as normals, depth, albedo, and shading from a wide array of generative models. I-LoRA is lightweight, adding minimally to the model's parameters and requiring very small datasets for this knowledge discovery. Our approach, applicable to Diffusion models, GANs, and Autoregressive models alike, generates intrinsics using the same output head as the original images. Through control experiments, we establish a correlation between the generative model's quality and the extracted intrinsics' accuracy. Finally, scene intrinsics obtained by our method with just hundreds to thousands of labeled images, perform on par with those from supervised methods trained on millions of labeled examples.
comment: https://intrinsic-lora.github.io/
♻ ☆ Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling CVPR 2024
Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made, there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-penalize drastic deformations in the 3D shape sequence. This paper proposes to resolve the above issues from a spatial-temporal modeling perspective. First, we propose a novel Temporally-smooth Procrustean Alignment module that estimates 3D deforming shapes and adjusts the camera motion by aligning the 3D shape sequence consecutively. Our new alignment module remedies the requirement of complex reference 3D shape during alignment, which is more conductive to non-isotropic deformation modeling. Second, we propose a spatial-weighted approach to enforce the low-rank constraint adaptively at different locations to accommodate drastic spatially-variant deformation reconstruction better. Our modeling outperform existing low-rank based methods, and extensive experiments across different datasets validate the effectiveness of our method.
comment: Accepted by CVPR 2024; V2 adds new experiments
♻ ☆ EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models
Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
♻ ☆ Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation
This paper introduces a novel approach to leverage the generalizability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DMSFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then use a diffusion model-based image mixup strategy to bridge the domain gap between the source and target domains. We validate our approach through comprehensive experiments across a range of datasets, including Office-31 [39], Office-Home [48], and VisDA [35]. The results demonstrate significant improvements in SFDA performance, highlighting the potential of diffusion models in generating contextually relevant, domain-specific images.
comment: arXiv admin note: substantial text overlap with arXiv:2310.01701
♻ ☆ MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions ICML 2024
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
♻ ☆ LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
In the rapidly evolving landscape of artificial intelligence, multi-modal large language models are emerging as a significant area of interest. These models, which combine various forms of data input, are becoming increasingly popular. However, understanding their internal mechanisms remains a complex task. Numerous advancements have been made in the field of explainability tools and mechanisms, yet there is still much to explore. In this work, we present a novel interactive application aimed towards understanding the internal mechanisms of large vision-language models. Our interface is designed to enhance the interpretability of the image patches, which are instrumental in generating an answer, and assess the efficacy of the language model in grounding its output in the image. With our application, a user can systematically investigate the model and uncover system limitations, paving the way for enhancements in system capabilities. Finally, we present a case study of how our application can aid in understanding failure mechanisms in a popular large multi-modal model: LLaVA.
♻ ☆ Rethinking Pruning for Vision-Language Models: Strategies for Effective Sparsity and Performance Restoration
Vision-Language Models (VLMs) integrate information from multiple modalities and have shown remarkable success across various tasks. However, deploying large-scale VLMs in resource-constrained scenarios is challenging. Pruning followed by finetuning offers a potential solution but remains underexplored for VLMs. This study addresses two key questions: how to distribute sparsity across different modality-specific models, and how to restore the performance of pruned sparse VLMs. Our preliminary studies identified two effective pruning settings: applying the same sparsity to both vision and language models, and pruning only the language models. While LoRA finetuning aims to restore sparse models, it faces challenges due to incompatibility with sparse models, disrupting the pruned sparsity. To overcome these issues, we propose SparseLoRA, which applies sparsity directly to LoRA weights. Our experimental results demonstrate significant improvements, including an 11.3\% boost under 2:4 sparsity and a 47.6\% enhancement under unstructured 70\% sparsity. Code is released at: \url{https://github.com/Shwai-He/VLM-Compression}.
♻ ☆ WATT: Weight Average Test-Time Adaptation of CLIP
Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{https://github.com/Mehrdad-Noori/WATT.git}.
♻ ☆ UCM-Net: A Lightweight and Efficient Solution for Skin Lesion Segmentation using MLP and CNN
Skin cancer poses a significant public health challenge, necessitating efficient diagnostic tools. We introduce UCM-Net, a novel skin lesion segmentation model combining Multi-Layer Perceptrons (MLP) and Convolutional Neural Networks (CNN). This lightweight, efficient architecture, deviating from traditional UNet designs, dramatically reduces computational demands, making it ideal for mobile health applications. Evaluated on PH2, ISIC 2017, and ISIC 2018 datasets, UCM-Net demonstrates robust performance with fewer than 50KB parameters and requires less than 0.05 Giga Operations Per Second (GLOPs). Moreover, its minimal memory requirement is just 1.19MB in CPU environment positions. It is a potential benchmark for efficiency in skin lesion segmentation, suitable for deployment in resource-constrained settings. In order to facilitate accessibility and further research in the field, the UCM-Net source code is https://github.com/chunyuyuan/UCM-Net.
comment: 17 pages, accepted by Journal of Biomedical Signal Processing and Control
♻ ☆ Improving the Explain-Any-Concept by Introducing Nonlinearity to the Trainable Surrogate Model
In the evolving field of Explainable AI (XAI), interpreting the decisions of deep neural networks (DNNs) in computer vision tasks is an important process. While pixel-based XAI methods focus on identifying significant pixels, existing concept-based XAI methods use pre-defined or human-annotated concepts. The recently proposed Segment Anything Model (SAM) achieved a significant step forward to prepare automatic concept sets via comprehensive instance segmentation. Building upon this, the Explain Any Concept (EAC) model emerged as a flexible method for explaining DNN decisions. EAC model is based on using a surrogate model which has one trainable linear layer to simulate the target model. In this paper, by introducing an additional nonlinear layer to the original surrogate model, we show that we can improve the performance of the EAC model. We compare our proposed approach to the original EAC model and report improvements obtained on both ImageNet and MS COCO datasets.
comment: This paper is accepted for publication at IEEE SIU conference, 2024
♻ ☆ MRISegmentator-Abdomen: A Fully Automated Multi-Organ and Structure Segmentation Tool for T1-weighted Abdominal MRI
Background: Segmentation of organs and structures in abdominal MRI is useful for many clinical applications, such as disease diagnosis and radiotherapy. Current approaches have focused on delineating a limited set of abdominal structures (13 types). To date, there is no publicly available abdominal MRI dataset with voxel-level annotations of multiple organs and structures. Consequently, a segmentation tool for multi-structure segmentation is also unavailable. Methods: We curated a T1-weighted abdominal MRI dataset consisting of 195 patients who underwent imaging at National Institutes of Health (NIH) Clinical Center. The dataset comprises of axial pre-contrast T1, arterial, venous, and delayed phases for each patient, thereby amounting to a total of 780 series (69,248 2D slices). Each series contains voxel-level annotations of 62 abdominal organs and structures. A 3D nnUNet model, dubbed as MRISegmentator-Abdomen (MRISegmentator in short), was trained on this dataset, and evaluation was conducted on an internal test set and two large external datasets: AMOS22 and Duke Liver. The predicted segmentations were compared against the ground-truth using the Dice Similarity Coefficient (DSC) and Normalized Surface Distance (NSD). Findings: MRISegmentator achieved an average DSC of 0.861$\pm$0.170 and a NSD of 0.924$\pm$0.163 in the internal test set. On the AMOS22 dataset, MRISegmentator attained an average DSC of 0.829$\pm$0.133 and a NSD of 0.908$\pm$0.067. For the Duke Liver dataset, an average DSC of 0.933$\pm$0.015 and a NSD of 0.929$\pm$0.021 was obtained. Interpretation: The proposed MRISegmentator provides automatic, accurate, and robust segmentations of 62 organs and structures in T1-weighted abdominal MRI sequences. The tool has the potential to accelerate research on various clinical topics, such as abnormality detection, radiotherapy, disease classification among others.
comment: We made the segmentation model publicly available
Computation and Language 154
☆ EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
☆ Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking exponential decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
comment: Under review
☆ RaTEScore: A Metric for Radiology Report Generation
This paper introduces a novel, entity-aware metric, termed as Radiological Report (Text) Evaluation (RaTEScore), to assess the quality of medical reports generated by AI models. RaTEScore emphasizes crucial medical entities such as diagnostic outcomes and anatomical details, and is robust against complex medical synonyms and sensitive to negation expressions. Technically, we developed a comprehensive medical NER dataset, RaTE-NER, and trained an NER model specifically for this purpose. This model enables the decomposition of complex radiological reports into constituent medical entities. The metric itself is derived by comparing the similarity of entity embeddings, obtained from a language model, based on their types and relevance to clinical significance. Our evaluations demonstrate that RaTEScore aligns more closely with human preference than existing metrics, validated both on established public benchmarks and our newly proposed RaTE-Eval benchmark.
☆ Exploring Factual Entailment with NLI: A News Media Study
We explore the relationship between factuality and Natural Language Inference (NLI) by introducing FactRel -- a novel annotation scheme that models \textit{factual} rather than \textit{textual} entailment, and use it to annotate a dataset of naturally occurring sentences from news articles. Our analysis shows that 84\% of factually supporting pairs and 63\% of factually undermining pairs do not amount to NLI entailment or contradiction, respectively, suggesting that factual relationships are more apt for analyzing media discourse. We experiment with models for pairwise classification on the new dataset, and find that in some cases, generating synthetic data with GPT-4 on the basis of the annotated dataset can improve performance. Surprisingly, few-shot learning with GPT-4 yields strong results on par with medium LMs (DeBERTa) trained on the labelled dataset. We hypothesize that these results indicate the fundamental dependence of this task on both world knowledge and advanced reasoning abilities.
comment: Presented at *SEM 2024
☆ From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.
☆ USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Identifying user's opinions and stances in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing (NLP) tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user's stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user's overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].
comment: 32 pages, 18 figures
☆ Understanding and Mitigating Tokenization Bias in Language Models
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that, for encoding schemes such as maximum prefix matching, tokenization induces a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, we propose a novel algorithm to obtain unbiased estimates from a model that was trained on tokenized data. Our method does not require finetuning the model, and its complexity, defined as the number of model runs, scales linearly with the sequence length. As a consequence, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
☆ Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.
☆ PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs
Recently, machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs. However, unlearning approaches for LLMs that have been considered thus far have focused on the removal of independent data points and have not taken into account that the stored facts are logically connected to one another and form an implicit knowledge graph. To facilitate the development of structural unlearning methods, which are essential for the practical application of unlearning, we propose PISTOL, a pipeline for compiling multi-scenario datasets for benchmarking structural LLM unlearning. Additionally, leveraging sample datasets synthesized using PISTOL, we conducted benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models. This analysis helps to illustrate the prevailing challenges in effectively and robustly removing highly inter-connected data, batched data, or data skewed towards a specific domain. It also highlights the choice of pre-trained model can impact unlearning performance. This work not only advances our understandings on the limitation of current LLMs unlearning methods and proposes future research directions, but also provides a replicable framework for ongoing exploration and validation in the field.
☆ Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.
☆ RES-Q: Evaluating Code-Editing Large Language Model Systems at the Repository Scale
The instruction-following ability of Large Language Models (LLMs) has cultivated a class of LLM-based systems capable of approaching complex tasks such as making edits to large code repositories. Due to the high sensitivity and unpredictability of LLM behavior in response to changes in prompting, robust evaluation tools are needed to drive future iteration of these systems. We propose RES-Q, a natural language instruction-based benchmark for evaluating $\textbf{R}$epository $\textbf{E}$diting $\textbf{S}$ystems, which consists of 100 repository editing tasks derived from real GitHub commits. Given an edit instruction and a code repository, RES-Q evaluates an LLM system's ability to gather information and construct an edit that satisfies the criteria set by the instruction. We argue that evaluating LLMs in this way addresses issues with traditional benchmarks and provides a more holistic assessment of a model's abilities. We evaluate various state-of-the-art LLMs as language agents in a repository-editing system built on Qurrent OS, our language agent development software. Despite their 1% pass@1 performance difference on HumanEval, we find Claude Sonnet 3.5 outperforms GPT-4o by 12% pass@1 on RES-Q, indicating RES-Q's capacity to differentiate model capability as traditional benchmarks approach saturation. We further investigate token efficiency, performance relationships with existing benchmarks, and interesting disparities between closed and open-source LLMs. Code and dataset are available at https://github.com/Qurrent-AI/RES-Q.
☆ Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over \emph{lottery tickets} (or \emph{sparse task vectors}), LoTA also enables model merging over highly dissimilar tasks.
☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. Numerous effective IFT datasets have been proposed in the recent past, but most focus on high resource languages such as English. In this work, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual, to better align LLMs on a diverse set of languages and tasks. M2Lingual contains a total of 182K IFT pairs that are built upon diverse seeds, covering 70 languages, 17 NLP tasks and general instruction-response pairs. LLMs finetuned with M2Lingual substantially outperform the majority of existing multilingual IFT datasets. Importantly, LLMs trained with M2Lingual consistently achieve competitive results across a wide variety of evaluation benchmarks compared to existing multilingual IFT datasets. Specifically, LLMs finetuned with M2Lingual achieve strong performance on our translated multilingual, multi-turn evaluation benchmark as well as a wide variety of multilingual tasks. Thus we contribute, and the 2 step Evol taxonomy used for its creation. M2Lingual repository - https://huggingface.co/datasets/ServiceNow-AI/M2Lingual
comment: 39 pages
☆ It Is Not About What You Say, It Is About How You Say It: A Surprisingly Simple Approach for Improving Reading Comprehension ACL
Natural language processing has seen rapid progress over the past decade. Due to the speed of developments, some practices get established without proper evaluation. Considering one such case and focusing on reading comprehension, we ask our first research question: 1) How does the order of inputs -- i.e., question and context -- affect model performance? Additionally, given recent advancements in input emphasis, we ask a second research question: 2) Does emphasizing either the question, the context, or both enhance performance? Experimenting with 9 large language models across 3 datasets, we find that presenting the context before the question improves model performance, with an accuracy increase of up to $31\%$. Furthermore, emphasizing the context yields superior results compared to question emphasis, and in general, emphasizing parts of the input is particularly effective for addressing questions that models lack the parametric knowledge to answer. Experimenting with both prompt-based and attention-based emphasis methods, we additionally find that the best method is surprisingly simple: it only requires concatenating a few tokens to the input and results in an accuracy improvement of up to $36\%$, allowing smaller models to outperform their significantly larger counterparts.
comment: Accepted to ACL Findings
☆ Finding Transformer Circuits with Edge Pruning
The path to interpreting a language model often proceeds via analysis of circuits -- sparse computational subgraphs of the model that capture specific aspects of its behavior. Recent work has automated the task of discovering circuits. Yet, these methods have practical limitations, as they rely either on inefficient search algorithms or inaccurate approximations. In this paper, we frame automated circuit discovery as an optimization problem and propose *Edge Pruning* as an effective and scalable solution. Edge Pruning leverages gradient-based pruning techniques, but instead of removing neurons or components, it prunes the \emph{edges} between components. Our method finds circuits in GPT-2 that use less than half the number of edges compared to circuits found by previous methods while being equally faithful to the full model predictions on standard circuit-finding tasks. Edge Pruning is efficient even with as many as 100K examples, outperforming previous methods in speed and producing substantially better circuits. It also perfectly recovers the ground-truth circuits in two models compiled with Tracr. Thanks to its efficiency, we scale Edge Pruning to CodeLlama-13B, a model over 100x the scale that prior methods operate on. We use this setting for a case study comparing the mechanisms behind instruction prompting and in-context learning. We find two circuits with more than 99.96% sparsity that match the performance of the full model and reveal that the mechanisms in the two settings overlap substantially. Our case study shows that Edge Pruning is a practical and scalable tool for interpretability and sheds light on behaviors that only emerge in large models.
comment: We release our code and data publicly at https://github.com/princeton-nlp/Edge-Pruning
☆ Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.
☆ OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).
comment: 10 pages
☆ The GPT-WritingPrompts Dataset: A Comparative Analysis of Character Portrayal in Short Stories
The improved generative capabilities of large language models have made them a powerful tool for creative writing and storytelling. It is therefore important to quantitatively understand the nature of generated stories, and how they differ from human storytelling. We augment the Reddit WritingPrompts dataset with short stories generated by GPT-3.5, given the same prompts. We quantify and compare the emotional and descriptive features of storytelling from both generative processes, human and machine, along a set of six dimensions. We find that generated stories differ significantly from human stories along all six dimensions, and that human and machine generations display similar biases when grouped according to the narrative point-of-view and gender of the main protagonist. We release our dataset and code at https://github.com/KristinHuangg/gpt-writing-prompts.
☆ Towards Fast Multilingual LLM Inference: Speculative Decoding and Specialized Drafters
Large language models (LLMs) have revolutionized natural language processing and broadened their applicability across diverse commercial applications. However, the deployment of these models is constrained by high inference time in multilingual settings. To mitigate this challenge, this paper explores a training recipe of an assistant model in speculative decoding, which are leveraged to draft and-then its future tokens are verified by the target LLM. We show that language-specific draft models, optimized through a targeted pretrain-and-finetune strategy, substantially brings a speedup of inference time compared to the previous methods. We validate these models across various languages in inference time, out-of-domain speedup, and GPT-4o evaluation.
☆ Towards Zero-Shot Text-To-Speech for Arabic Dialects
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic, a language of more than 450 million native speakers, by first adapting a sizeable existing dataset to suit the needs of speech synthesis. Additionally, we employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting. Subsequently, we fine-tune the XTTS\footnote{https://docs.coqui.ai/en/latest/models/xtts.html}\footnote{https://medium.com/machine-learns/xtts-v2-new-version-of-the-open-source-text-to-speech-model-af73914db81f}\footnote{https://medium.com/@erogol/xtts-v1-techincal-notes-eb83ff05bdc} model, an open-source architecture. We then evaluate our models on a dataset comprising 31 unseen speakers and an in-house dialectal dataset. Our automated and human evaluation results show convincing performance while capable of generating dialectal speech. Our study highlights significant potential for improvements in this emerging area of research in Arabic.
☆ OCALM: Object-Centric Assessment with Language Models
Properly defining a reward signal to efficiently train a reinforcement learning (RL) agent is a challenging task. Designing balanced objective functions from which a desired behavior can emerge requires expert knowledge, especially for complex environments. Learning rewards from human feedback or using large language models (LLMs) to directly provide rewards are promising alternatives, allowing non-experts to specify goals for the agent. However, black-box reward models make it difficult to debug the reward. In this work, we propose Object-Centric Assessment with Language Models (OCALM) to derive inherently interpretable reward functions for RL agents from natural language task descriptions. OCALM uses the extensive world-knowledge of LLMs while leveraging the object-centric nature common to many environments to derive reward functions focused on relational concepts, providing RL agents with the ability to derive policies from task descriptions.
comment: Accepted at the RLBRew Workshop at RLC 2024
☆ Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.
comment: preprint
☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
☆ Adversarial Contrastive Decoding: Boosting Safety Alignment of Large Language Models via Opposite Prompt Optimization
With the widespread application of Large Language Models (LLMs), it has become a significant concern to ensure their safety and prevent harmful responses. While current safe-alignment methods based on instruction fine-tuning and Reinforcement Learning from Human Feedback (RLHF) can effectively reduce harmful responses from LLMs, they often require high-quality datasets and heavy computational overhead during model training. Another way to align language models is to modify the logit of tokens in model outputs without heavy training. Recent studies have shown that contrastive decoding can enhance the performance of language models by reducing the likelihood of confused tokens. However, these methods require the manual selection of contrastive models or instruction templates. To this end, we propose Adversarial Contrastive Decoding (ACD), an optimization-based framework to generate two opposite system prompts for prompt-based contrastive decoding. ACD only needs to apply a lightweight prompt tuning on a rather small anchor dataset (< 3 min for each model) without training the target model. Experiments conducted on extensive models and benchmarks demonstrate that the proposed method achieves much better safety performance than previous model training-free decoding methods without sacrificing its original generation ability.
☆ CLIMATELI: Evaluating Entity Linking on Climate Change Data
Climate Change (CC) is a pressing topic of global importance, attracting increasing attention across research fields, from social sciences to Natural Language Processing (NLP). CC is also discussed in various settings and communication platforms, from academic publications to social media forums. Understanding who and what is mentioned in such data is a first critical step to gaining new insights into CC. We present CLIMATELI (CLIMATe Entity LInking), the first manually annotated CC dataset that links 3,087 entity spans to Wikipedia. Using CLIMATELI (CLIMATe Entity LInking), we evaluate existing entity linking (EL) systems on the CC topic across various genres and propose automated filtering methods for CC entities. We find that the performance of EL models notably lags behind humans at both token and entity levels. Testing within the scope of retaining or excluding non-nominal and/or non-CC entities particularly impacts the models' performances.
comment: 7 pages, ClimateNLP 2024
☆ Venturing into Uncharted Waters: The Navigation Compass from Transformer to Mamba
Transformer, a deep neural network architecture, has long dominated the field of natural language processing and beyond. Nevertheless, the recent introduction of Mamba challenges its supremacy, sparks considerable interest among researchers, and gives rise to a series of Mamba-based models that have exhibited notable potential. This survey paper orchestrates a comprehensive discussion, diving into essential research dimensions, covering: (i) the functioning of the Mamba mechanism and its foundation on the principles of structured state space models; (ii) the proposed improvements and the integration of Mamba with various networks, exploring its potential as a substitute for Transformers; (iii) the combination of Transformers and Mamba to compensate for each other's shortcomings. We have also made efforts to interpret Mamba and Transformer in the framework of kernel functions, allowing for a comparison of their mathematical nature within a unified context. Our paper encompasses the vast majority of improvements related to Mamba to date.
☆ AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.
☆ Task Oriented In-Domain Data Augmentation
Large Language Models (LLMs) have shown superior performance in various applications and fields. To achieve better performance on specialized domains such as law and advertisement, LLMs are often continue pre-trained on in-domain data. However, existing approaches suffer from two major issues. First, in-domain data are scarce compared with general domain-agnostic data. Second, data used for continual pre-training are not task-aware, such that they may not be helpful to downstream applications. We propose TRAIT, a task-oriented in-domain data augmentation framework. Our framework is divided into two parts: in-domain data selection and task-oriented synthetic passage generation. The data selection strategy identifies and selects a large amount of in-domain data from general corpora, and thus significantly enriches domain knowledge in the continual pre-training data. The synthetic passages contain guidance on how to use domain knowledge to answer questions about downstream tasks. By training on such passages, the model aligns with the need of downstream applications. We adapt LLMs to two domains: advertisement and math. On average, TRAIT improves LLM performance by 8% in the advertisement domain and 7.5% in the math domain.
☆ Scaling Laws for Linear Complexity Language Models
The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.
comment: Technical report. Yiran Zhong is the corresponding author
☆ Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.
☆ Computational Approaches to the Detection of Lesser-Known Rhetorical Figures: A Systematic Survey and Research Challenges
Rhetorical figures play a major role in our everyday communication as they make text more interesting, more memorable, or more persuasive. Therefore, it is important to computationally detect rhetorical figures to fully understand the meaning of a text. We provide a comprehensive overview of computational approaches to lesser-known rhetorical figures. We explore the linguistic and computational perspectives on rhetorical figures, emphasizing their significance for the domain of Natural Language Processing. We present different figures in detail, delving into datasets, definitions, rhetorical functions, and detection approaches. We identified challenges such as dataset scarcity, language limitations, and reliance on rule-based methods.
comment: Submitted to ACM Computing Surveys. 35 pages
☆ CAVE: Controllable Authorship Verification Explanations
Authorship Verification (AV) (do two documents have the same author?) is essential for many sensitive real-life applications. AV is often used in proprietary domains that require a private, offline model, making SOTA online models like ChatGPT undesirable. Other SOTA systems use methods, e.g. Siamese Networks, that are uninterpretable, and hence cannot be trusted in high-stakes applications. In this work, we take the first step to address the above challenges with our model CAVE (Controllable Authorship Verification Explanations): CAVE generates free-text AV explanations that are controlled to be 1) structured (can be decomposed into sub-explanations with respect to relevant linguistic features), and 2) easily verified for explanation-label consistency (via intermediate labels in sub-explanations). In this work, we train a Llama-3-8B as CAVE; since there are no human-written corpora for AV explanations, we sample silver-standard explanations from GPT-4-TURBO and distill them into a pretrained Llama-3-8B. Results on three difficult AV datasets IMdB2, Blog-Auth, and FanFiction show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracies.
☆ Large Language Models Are Cross-Lingual Knowledge-Free Reasoners
Large Language Models have demonstrated impressive reasoning capabilities across multiple languages. However, the relationship between capabilities in different languages is less explored. In this work, we decompose the process of reasoning tasks into two separated parts: knowledge retrieval and knowledge-free reasoning, and analyze the cross-lingual transferability of them. With adapted and constructed knowledge-free reasoning datasets, we show that the knowledge-free reasoning capability can be nearly perfectly transferred across various source-target language directions despite the secondary impact of resource in some specific target languages, while cross-lingual knowledge retrieval significantly hinders the transfer. Moreover, by analyzing the hidden states and feed-forward network neuron activation during the reasoning tasks, we show that higher similarity of hidden representations and larger overlap of activated neurons could explain the better cross-lingual transferability of knowledge-free reasoning than knowledge retrieval. Thus, we hypothesize that knowledge-free reasoning embeds in some language-shared mechanism, while knowledge is stored separately in different languages.
☆ ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
☆ Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings
Since the emergence of the Transformer architecture, language model development has increased, driven by their promising potential. However, releasing these models into production requires properly understanding their behavior, particularly in sensitive domains such as medicine. Despite this need, the medical literature still lacks technical assessments of pre-trained language models, which are especially valuable in resource-constrained settings in terms of computational power or limited budget. To address this gap, we provide a comprehensive survey of language models in the medical domain. In addition, we selected a subset of these models for thorough evaluation, focusing on classification and text generation tasks. Our subset encompasses 53 models, ranging from 110 million to 13 billion parameters, spanning the three families of Transformer-based models and from diverse knowledge domains. This study employs a series of approaches for text classification together with zero-shot prompting instead of model training or fine-tuning, which closely resembles the limited resource setting in which many users of language models find themselves. Encouragingly, our findings reveal remarkable performance across various tasks and datasets, underscoring the latent potential of certain models to contain medical knowledge, even without domain specialization. Consequently, our study advocates for further exploration of model applications in medical contexts, particularly in resource-constrained settings. The code is available on https://github.com/anpoc/Language-models-in-medicine.
☆ CLEAR: Can Language Models Really Understand Causal Graphs?
Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.
☆ Data Augmentation of Multi-turn Psychological Dialogue via Knowledge-driven Progressive Thought Prompting
Existing dialogue data augmentation (DA) techniques predominantly focus on augmenting utterance-level dialogues, which makes it difficult to take dialogue contextual information into account. The advent of large language models (LLMs) has simplified the implementation of multi-turn dialogues. Due to absence of professional understanding and knowledge, it remains challenging to deliver satisfactory performance in low-resource domain, like psychological dialogue dialogue. DA involves creating new training or prompting data based on the existing data, which help the model better understand and generate psychology-related responses. In this paper, we aim to address the issue of multi-turn dialogue data augmentation for boosted performance in the psychology domain. We propose a knowledge-driven progressive thought prompting method to guide LLM to generate multi-turn psychology-related dialogue. This method integrates a progressive thought generator, a psychology knowledge generator, and a multi-turn dialogue generator. The thought generated by the progressive thought generator serves as a prompt to prevent the generated dialogue from having significant semantic deviations, while the psychology knowledge generator produces psychological knowledge to serve as the dialogue history for the LLM, guiding the dialogue generator to create multi-turn psychological dialogue. To ensure the precision of multi-turn psychological dialogue generation by LLM, a meticulous professional evaluation is required. Extensive experiments conducted on three datasets related to psychological dialogue verify the effectiveness of the proposed method.
☆ Are there identifiable structural parts in the sentence embedding whole?
Sentence embeddings from transformer models encode in a fixed length vector much linguistic information. We explore the hypothesis that these embeddings consist of overlapping layers of information that can be separated, and on which specific types of information -- such as information about chunks and their structural and semantic properties -- can be detected. We show that this is the case using a dataset consisting of sentences with known chunk structure, and two linguistic intelligence datasets, solving which relies on detecting chunks and their grammatical number, and respectively, their semantic roles, and through analyses of the performance on the tasks and of the internal representations built during learning.
comment: 17 pages, 14 figures, 5 tables
☆ EvalAlign: Evaluating Text-to-Image Models through Precision Alignment of Multimodal Large Models with Supervised Fine-Tuning to Human Annotations
The recent advancements in text-to-image generative models have been remarkable. Yet, the field suffers from a lack of evaluation metrics that accurately reflect the performance of these models, particularly lacking fine-grained metrics that can guide the optimization of the models. In this paper, we propose EvalAlign, a metric characterized by its accuracy, stability, and fine granularity. Our approach leverages the capabilities of Multimodal Large Language Models (MLLMs) pre-trained on extensive datasets. We develop evaluation protocols that focus on two key dimensions: image faithfulness and text-image alignment. Each protocol comprises a set of detailed, fine-grained instructions linked to specific scoring options, enabling precise manual scoring of the generated images. We Supervised Fine-Tune (SFT) the MLLM to align closely with human evaluative judgments, resulting in a robust evaluation model. Our comprehensive tests across 24 text-to-image generation models demonstrate that EvalAlign not only provides superior metric stability but also aligns more closely with human preferences than existing metrics, confirming its effectiveness and utility in model assessment.
comment: Github Repository: https://github.com/SAIS-FUXI/EvalAlign
☆ LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe .
☆ C-LLM: Learn to Check Chinese Spelling Errors Character by Character
Chinese Spell Checking (CSC) aims to detect and correct spelling errors in sentences. Despite Large Language Models (LLMs) exhibit robust capabilities and are widely applied in various tasks, their performance on CSC is often unsatisfactory. We find that LLMs fail to meet the Chinese character-level constraints of the CSC task, namely equal length and phonetic similarity, leading to a performance bottleneck. Further analysis reveal that this issue stems from the granularity of tokenization, as current mixed character-word tokenization struggles to satisfy these character-level constraints. To address this issue, we propose C-LLM, a Large Language Model-based Chinese Spell Checking method that learns to check errors Character by Character. Character-level tokenization enables the model to learn character-level alignment, effectively mitigating issues related to character-level constraints. Furthermore, CSC is simplified to replication-dominated and substitution-supplemented tasks. Experiments on two CSC benchmarks demonstrate that C-LLM achieves an average improvement of 10% over existing methods. Specifically, it shows a 2.1% improvement in general scenarios and a significant 12% improvement in vertical domain scenarios, establishing state-of-the-art performance. The source code can be accessed at https://github.com/ktlKTL/C-LLM.
☆ Token-based Decision Criteria Are Suboptimal in In-context Learning
In-Context Learning (ICL) typically utilizes classification criteria from probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we use the nearest centroid classification on the hidden states, assigning the category of the nearest centroid previously observed from a few-shot calibration set to the test sample as the predicted label. Our experiments on 3 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based calibrations by about 20%. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-categories overlap, and LMs provide linearly separable intra-category clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the conventional ICL.
comment: 21 pages, 14 figures, 8 tables
☆ Towards Better Graph-based Cross-document Relation Extraction via Non-bridge Entity Enhancement and Prediction Debiasing ACL 2024
Cross-document Relation Extraction aims to predict the relation between target entities located in different documents. In this regard, the dominant models commonly retain useful information for relation prediction via bridge entities, which allows the model to elaborately capture the intrinsic interdependence between target entities. However, these studies ignore the non-bridge entities, each of which co-occurs with only one target entity and offers the semantic association between target entities for relation prediction. Besides, the commonly-used dataset--CodRED contains substantial NA instances, leading to the prediction bias during inference. To address these issues, in this paper, we propose a novel graph-based cross-document RE model with non-bridge entity enhancement and prediction debiasing. Specifically, we use a unified entity graph to integrate numerous non-bridge entities with target entities and bridge entities, modeling various associations between them, and then use a graph recurrent network to encode this graph. Finally, we introduce a novel debiasing strategy to calibrate the original prediction distribution. Experimental results on the closed and open settings show that our model significantly outperforms all baselines, including the GPT-3.5-turbo and InstructUIE, achieving state-of-the-art performance. Particularly, our model obtains 66.23% and 55.87% AUC points in the official leaderboard\footnote{\url{https://codalab.lisn.upsaclay.fr/competitions/3770#results}} under the two settings, respectively, ranking the first place in all submissions since December 2023. Our code is available at https://github.com/DeepLearnXMU/CoRE-NEPD.
comment: Accepted to ACL 2024 Findings
☆ Evaluating the Ability of Large Language Models to Reason about Cardinal Directions
We investigate the abilities of a representative set of Large language Models (LLMs) to reason about cardinal directions (CDs). To do so, we create two datasets: the first, co-created with ChatGPT, focuses largely on recall of world knowledge about CDs; the second is generated from a set of templates, comprehensively testing an LLM's ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first , second or third person. Even with a temperature setting of zero, Our experiments show that although LLMs are able to perform well in the simpler dataset, in the second more complex dataset no LLM is able to reliably determine the correct CD, even with a temperature setting of zero.
comment: 9 pages, 3 figures, 1 table. Short paper accepted by COSIT 24, The 16th Conference on Spatial Information Theory
☆ SyROCCo: Enhancing Systematic Reviews using Machine Learning
The sheer number of research outputs published every year makes systematic reviewing increasingly time- and resource-intensive. This paper explores the use of machine learning techniques to help navigate the systematic review process. ML has previously been used to reliably 'screen' articles for review - that is, identify relevant articles based on reviewers' inclusion criteria. The application of ML techniques to subsequent stages of a review, however, such as data extraction and evidence mapping, is in its infancy. We therefore set out to develop a series of tools that would assist in the profiling and analysis of 1,952 publications on the theme of 'outcomes-based contracting'. Tools were developed for the following tasks: assign publications into 'policy area' categories; identify and extract key information for evidence mapping, such as organisations, laws, and geographical information; connect the evidence base to an existing dataset on the same topic; and identify subgroups of articles that may share thematic content. An interactive tool using these techniques and a public dataset with their outputs have been released. Our results demonstrate the utility of ML techniques to enhance evidence accessibility and analysis within the systematic review processes. These efforts show promise in potentially yielding substantial efficiencies for future systematic reviewing and for broadening their analytical scope. Our work suggests that there may be implications for the ease with which policymakers and practitioners can access evidence. While ML techniques seem poised to play a significant role in bridging the gap between research and policy by offering innovative ways of gathering, accessing, and analysing data from systematic reviews, we also highlight their current limitations and the need to exercise caution in their application, particularly given the potential for errors and biases.
comment: 28 pages, 5 figures. To appear in Data & Policy journal
☆ The Privileged Students: On the Value of Initialization in Multilingual Knowledge Distillation
Knowledge distillation (KD) has proven to be a successful strategy to improve the performance of a smaller model in many NLP tasks. However, most of the work in KD only explores monolingual scenarios. In this paper, we investigate the value of KD in multilingual settings. We find the significance of KD and model initialization by analyzing how well the student model acquires multilingual knowledge from the teacher model. Our proposed method emphasizes copying the teacher model's weights directly to the student model to enhance initialization. Our finding shows that model initialization using copy-weight from the fine-tuned teacher contributes the most compared to the distillation process itself across various multilingual settings. Furthermore, we demonstrate that efficient weight initialization preserves multilingual capabilities even in low-resource scenarios.
comment: 8 pages
☆ Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback
Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative feedback has not yet been explored, even though positive and negative feedback are both necessary to grow self-motivation. To facilitate self-motivation, we propose CArrot and STICk (CASTIC) dataset, consisting of 12,590 sentences with 5 different strategies for enhancing self-motivation. Our data and code are publicly available at here.
comment: 10 pages, 8 figures
☆ Large Vocabulary Size Improves Large Language Models
This paper empirically investigates the relationship between subword vocabulary size and the performance of large language models (LLMs) to provide insights on how to define the vocabulary size. Experimental results show that larger vocabulary sizes lead to better performance in LLMs. Moreover, we consider a continual training scenario where a pre-trained language model is trained on a different target language. We introduce a simple method to use a new vocabulary instead of the pre-defined one. We show that using the new vocabulary outperforms the model with the vocabulary used in pre-training.
comment: Work in progress
☆ OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.
☆ eagerlearners at SemEval2024 Task 5: The Legal Argument Reasoning Task in Civil Procedure
This study investigates the performance of the zero-shot method in classifying data using three large language models, alongside two models with large input token sizes and the two pre-trained models on legal data. Our main dataset comes from the domain of U.S. civil procedure. It includes summaries of legal cases, specific questions, potential answers, and detailed explanations for why each solution is relevant, all sourced from a book aimed at law students. By comparing different methods, we aimed to understand how effectively they handle the complexities found in legal datasets. Our findings show how well the zero-shot method of large language models can understand complicated data. We achieved our highest F1 score of 64% in these experiments.
☆ Deepfake tweets automatic detection
This study addresses the critical challenge of detecting DeepFake tweets by leveraging advanced natural language processing (NLP) techniques to distinguish between genuine and AI-generated texts. Given the increasing prevalence of misinformation, our research utilizes the TweepFake dataset to train and evaluate various machine learning models. The objective is to identify effective strategies for recognizing DeepFake content, thereby enhancing the integrity of digital communications. By developing reliable methods for detecting AI-generated misinformation, this work contributes to a more trustworthy online information environment.
☆ EMMI -- Empathic Multimodal Motivational Interviews Dataset: Analyses and Annotations
The study of multimodal interaction in therapy can yield a comprehensive understanding of therapist and patient behavior that can be used to develop a multimodal virtual agent supporting therapy. This investigation aims to uncover how therapists skillfully blend therapy's task goal (employing classical steps of Motivational Interviewing) with the social goal (building a trusting relationship and expressing empathy). Furthermore, we seek to categorize patients into various ``types'' requiring tailored therapeutic approaches. To this intent, we present multimodal annotations of a corpus consisting of simulated motivational interviewing conversations, wherein actors portray the roles of patients and therapists. We introduce EMMI, composed of two publicly available MI corpora, AnnoMI and the Motivational Interviewing Dataset, for which we add multimodal annotations. We analyze these annotations to characterize functional behavior for developing a virtual agent performing motivational interviews emphasizing social and empathic behaviors. Our analysis found three clusters of patients expressing significant differences in behavior and adaptation of the therapist's behavior to those types. This shows the importance of a therapist being able to adapt their behavior depending on the current situation within the dialog and the type of user.
comment: 9 pages
☆ DaLPSR: Leverage Degradation-Aligned Language Prompt for Real-World Image Super-Resolution
Image super-resolution pursuits reconstructing high-fidelity high-resolution counterpart for low-resolution image. In recent years, diffusion-based models have garnered significant attention due to their capabilities with rich prior knowledge. The success of diffusion models based on general text prompts has validated the effectiveness of textual control in the field of text2image. However, given the severe degradation commonly presented in low-resolution images, coupled with the randomness characteristics of diffusion models, current models struggle to adequately discern semantic and degradation information within severely degraded images. This often leads to obstacles such as semantic loss, visual artifacts, and visual hallucinations, which pose substantial challenges for practical use. To address these challenges, this paper proposes to leverage degradation-aligned language prompt for accurate, fine-grained, and high-fidelity image restoration. Complementary priors including semantic content descriptions and degradation prompts are explored. Specifically, on one hand, image-restoration prompt alignment decoder is proposed to automatically discern the degradation degree of LR images, thereby generating beneficial degradation priors for image restoration. On the other hand, much richly tailored descriptions from pretrained multimodal large language model elicit high-level semantic priors closely aligned with human perception, ensuring fidelity control for image restoration. Comprehensive comparisons with state-of-the-art methods have been done on several popular synthetic and real-world benchmark datasets. The quantitative and qualitative analysis have demonstrated that the proposed method achieves a new state-of-the-art perceptual quality level, especially in real-world cases based on reference-free metrics.
☆ Evaluating Visual and Cultural Interpretation: The K-Viscuit Benchmark with Human-VLM Collaboration
To create culturally inclusive vision-language models (VLMs), the foremost requirement is developing a test benchmark that can diagnose the models' ability to respond to questions reflecting cultural elements. This paper addresses the necessity for such benchmarks, noting that existing research has relied on human annotators' manual efforts, which impedes diversity and efficiency. We propose a semi-automated pipeline for constructing cultural VLM benchmarks to enhance diversity and efficiency. This pipeline leverages human-VLM collaboration, where VLMs generate questions based on guidelines, human-annotated examples, and image-wise relevant knowledge, which are then reviewed by native speakers for quality and cultural relevance. The effectiveness of our adaptable pipeline is demonstrated through a specific application: creating a dataset tailored to Korean culture, dubbed K-Viscuit. The resulting benchmark features two types of questions: Type 1 questions measure visual recognition abilities, while Type 2 assess fine-grained visual reasoning skills. This ensures a thorough diagnosis of VLM models across various aspects. Our evaluation using K-Viscuit revealed that open-source models notably lag behind proprietary models in understanding Korean culture, highlighting areas for improvement. We provided diverse analyses of VLM performance across different cultural aspects. Besides, we explored the potential of incorporating external knowledge retrieval to enhance the generation process, suggesting future directions for improving cultural interpretation ability of VLMs. Our dataset and code will be made publicly available.
☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at [https://github.com/CoderChen01/InterCLIP-MEP](https://github.com/CoderChen01/InterCLIP-MEP).
comment: 8 pages, 6 figures, 6 tables
☆ Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers
State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter count and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFN), which are less studied than attention blocks. We consider three candidate linear layer approximations in the FFN by combining efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from the training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We first demonstrate they can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Experiments on the large RefinedWeb dataset show that our methods are both efficient and effective for training and inference. Interestingly, these structured FFNs exhibit steeper scaling curves than the original models. Further applying self-guided training to the structured matrices with 32\% FFN parameters and 2.5$\times$ speed-up enables only a 0.4 perplexity increase under the same training FLOPs. Finally, we develop the wide and structured networks surpassing the current medium-sized and large-sized Transformer in perplexity and throughput performance. Our code is available at \url{https://github.com/CLAIRE-Labo/StructuredFFN/tree/main}.
☆ UniCoder: Scaling Code Large Language Model via Universal Code ACL 2024
Intermediate reasoning or acting steps have successfully improved large language models (LLMs) for handling various downstream natural language processing (NLP) tasks. When applying LLMs for code generation, recent works mainly focus on directing the models to articulate intermediate natural-language reasoning steps, as in chain-of-thought (CoT) prompting, and then output code with the natural language or other structured intermediate steps. However, such output is not suitable for code translation or generation tasks since the standard CoT has different logical structures and forms of expression with the code. In this work, we introduce the universal code (UniCode) as the intermediate representation. It is a description of algorithm steps using a mix of conventions of programming languages, such as assignment operator, conditional operator, and loop. Hence, we collect an instruction dataset UniCoder-Instruct to train our model UniCoder on multi-task learning objectives. UniCoder-Instruct comprises natural-language questions, code solutions, and the corresponding universal code. The alignment between the intermediate universal code representation and the final code solution significantly improves the quality of the generated code. The experimental results demonstrate that UniCoder with the universal code significantly outperforms the previous prompting methods by a large margin, showcasing the effectiveness of the structural clues in pseudo-code.
comment: Accepted by ACL 2024 (Main)
☆ Multilingual Knowledge Editing with Language-Agnostic Factual Neurons
Multilingual knowledge editing (MKE) aims to simultaneously revise factual knowledge across multilingual languages within large language models (LLMs). However, most existing MKE methods just adapt existing monolingual editing methods to multilingual scenarios, overlooking the deep semantic connections of the same factual knowledge between different languages, thereby limiting edit performance. To address this issue, we first investigate how LLMs represent multilingual factual knowledge and discover that the same factual knowledge in different languages generally activates a shared set of neurons, which we call language-agnostic factual neurons. These neurons represent the semantic connections between multilingual knowledge and are mainly located in certain layers. Inspired by this finding, we propose a new MKE method by locating and modifying Language-Agnostic Factual Neurons (LAFN) to simultaneously edit multilingual knowledge. Specifically, we first generate a set of paraphrases for each multilingual knowledge to be edited to precisely locate the corresponding language-agnostic factual neurons. Then we optimize the update values for modifying these located neurons to achieve simultaneous modification of the same factual knowledge in multiple languages. Experimental results on Bi-ZsRE and MzsRE benchmarks demonstrate that our method outperforms existing MKE methods and achieves remarkable edit performance, indicating the importance of considering the semantic connections among multilingual knowledge.
comment: 12 pages, 4 figures, 7 tables
☆ A Symmetry Property of Christoffel Words SC
Motivated by the theory of trapezoidal words, whose sequences of cardinality of factors by length are symmetric, we introduce a bivariate variant of this symmetry. We show that this symmetry characterizes Christoffel words, and establish other related results.
comment: In Proceedings GASCom 2024, arXiv:2406.14588
☆ UNO Arena for Evaluating Sequential Decision-Making Capability of Large Language Models
Sequential decision-making refers to algorithms that take into account the dynamics of the environment, where early decisions affect subsequent decisions. With large language models (LLMs) demonstrating powerful capabilities between tasks, we can't help but ask: Can Current LLMs Effectively Make Sequential Decisions? In order to answer this question, we propose the UNO Arena based on the card game UNO to evaluate the sequential decision-making capability of LLMs and explain in detail why we choose UNO. In UNO Arena, We evaluate the sequential decision-making capability of LLMs dynamically with novel metrics based Monte Carlo methods. We set up random players, DQN-based reinforcement learning players, and LLM players (e.g. GPT-4, Gemini-pro) for comparison testing. Furthermore, in order to improve the sequential decision-making capability of LLMs, we propose the TUTRI player, which can involves having LLMs reflect their own actions wtih the summary of game history and the game strategy. Numerous experiments demonstrate that the TUTRI player achieves a notable breakthrough in the performance of sequential decision-making compared to the vanilla LLM player.
☆ On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation directions, each of which facilitates a variety of applications. Our work offers a holistic view that unifies numerous existing studies and suggests potential research directions. We envision our work as a useful roadmap for future research on LLMs.
☆ KEHRL: Learning Knowledge-Enhanced Language Representations with Hierarchical Reinforcement Learning
Knowledge-enhanced pre-trained language models (KEPLMs) leverage relation triples from knowledge graphs (KGs) and integrate these external data sources into language models via self-supervised learning. Previous works treat knowledge enhancement as two independent operations, i.e., knowledge injection and knowledge integration. In this paper, we propose to learn Knowledge-Enhanced language representations with Hierarchical Reinforcement Learning (KEHRL), which jointly addresses the problems of detecting positions for knowledge injection and integrating external knowledge into the model in order to avoid injecting inaccurate or irrelevant knowledge. Specifically, a high-level reinforcement learning (RL) agent utilizes both internal and prior knowledge to iteratively detect essential positions in texts for knowledge injection, which filters out less meaningful entities to avoid diverting the knowledge learning direction. Once the entity positions are selected, a relevant triple filtration module is triggered to perform low-level RL to dynamically refine the triples associated with polysemic entities through binary-valued actions. Experiments validate KEHRL's effectiveness in probing factual knowledge and enhancing the model's performance on various natural language understanding tasks.
☆ UniPSDA: Unsupervised Pseudo Semantic Data Augmentation for Zero-Shot Cross-Lingual Natural Language Understanding
Cross-lingual representation learning transfers knowledge from resource-rich data to resource-scarce ones to improve the semantic understanding abilities of different languages. However, previous works rely on shallow unsupervised data generated by token surface matching, regardless of the global context-aware semantics of the surrounding text tokens. In this paper, we propose an Unsupervised Pseudo Semantic Data Augmentation (UniPSDA) mechanism for cross-lingual natural language understanding to enrich the training data without human interventions. Specifically, to retrieve the tokens with similar meanings for the semantic data augmentation across different languages, we propose a sequential clustering process in 3 stages: within a single language, across multiple languages of a language family, and across languages from multiple language families. Meanwhile, considering the multi-lingual knowledge infusion with context-aware semantics while alleviating computation burden, we directly replace the key constituents of the sentences with the above-learned multi-lingual family knowledge, viewed as pseudo-semantic. The infusion process is further optimized via three de-biasing techniques without introducing any neural parameters. Extensive experiments demonstrate that our model consistently improves the performance on general zero-shot cross-lingual natural language understanding tasks, including sequence classification, information extraction, and question answering.
☆ Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation
Instruction-tuned Large Language Models (LLMs) have achieved remarkable performance across various benchmark tasks. While providing instructions to LLMs for guiding their generations is user-friendly, assessing their instruction-following capabilities is still unclarified due to a lack of evaluation metrics. In this paper, we focus on evaluating the instruction-following ability of LLMs in the context of story-ending generation, which requires diverse and context-specific instructions. We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction. Our findings demonstrate that our proposed metric aligns with human evaluation. Furthermore, our experiments confirm that recent open-source LLMs can achieve instruction-following performance close to GPT-3.5, as assessed through automatic evaluation.
☆ ADVSCORE: A Metric for the Evaluation and Creation of Adversarial Benchmarks
Adversarial benchmarks validate model abilities by providing samples that fool models but not humans. However, despite the proliferation of datasets that claim to be adversarial, there does not exist an established metric to evaluate how adversarial these datasets are. To address this lacuna, we introduce ADVSCORE, a metric which quantifies how adversarial and discriminative an adversarial dataset is and exposes the features that make data adversarial. We then use ADVSCORE to underpin a dataset creation pipeline that incentivizes writing a high-quality adversarial dataset. As a proof of concept, we use ADVSCORE to collect an adversarial question answering (QA) dataset, ADVQA, from our pipeline. The high-quality questions in ADVQA surpasses three adversarial benchmarks across domains at fooling several models but not humans. We validate our result based on difficulty estimates from 9,347 human responses on four datasets and predictions from three models. Moreover, ADVSCORE uncovers which adversarial tactics used by human writers fool models (e.g., GPT-4) but not humans. Through ADVSCORE and its analyses, we offer guidance on revealing language model vulnerabilities and producing reliable adversarial examples.
comment: arXiv admin note: substantial text overlap with arXiv:2401.11185
☆ EHRCon: Dataset for Checking Consistency between Unstructured Notes and Structured Tables in Electronic Health Records
Electronic Health Records (EHRs) are integral for storing comprehensive patient medical records, combining structured data (e.g., medications) with detailed clinical notes (e.g., physician notes). These elements are essential for straightforward data retrieval and provide deep, contextual insights into patient care. However, they often suffer from discrepancies due to unintuitive EHR system designs and human errors, posing serious risks to patient safety. To address this, we developed EHRCon, a new dataset and task specifically designed to ensure data consistency between structured tables and unstructured notes in EHRs. EHRCon was crafted in collaboration with healthcare professionals using the MIMIC-III EHR dataset, and includes manual annotations of 3,943 entities across 105 clinical notes checked against database entries for consistency. EHRCon has two versions, one using the original MIMIC-III schema, and another using the OMOP CDM schema, in order to increase its applicability and generalizability. Furthermore, leveraging the capabilities of large language models, we introduce CheckEHR, a novel framework for verifying the consistency between clinical notes and database tables. CheckEHR utilizes an eight-stage process and shows promising results in both few-shot and zero-shot settings. The code is available at https://github.com/dustn1259/EHRCon.
☆ DemoRank: Selecting Effective Demonstrations for Large Language Models in Ranking Task
Recently, there has been increasing interest in applying large language models (LLMs) as zero-shot passage rankers. However, few studies have explored how to select appropriate in-context demonstrations for the passage ranking task, which is the focus of this paper. Previous studies mainly apply a demonstration retriever to retrieve demonstrations and use top-$k$ demonstrations for in-context learning (ICL). Although effective, this approach overlooks the dependencies between demonstrations, leading to inferior performance of few-shot ICL in the passage ranking task. In this paper, we formulate the demonstration selection as a \textit{retrieve-then-rerank} process and introduce the DemoRank framework. In this framework, we first use LLM feedback to train a demonstration retriever and construct a novel dependency-aware training samples to train a demonstration reranker to improve few-shot ICL. The construction of such training samples not only considers demonstration dependencies but also performs in an efficient way. Extensive experiments demonstrate DemoRank's effectiveness in in-domain scenarios and strong generalization to out-of-domain scenarios. Our codes are available at~\url{https://github.com/8421BCD/DemoRank}.
☆ Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.
☆ What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Noise-free Text-Image Corruption and Evaluation
Vision-Language Models (VLMs) have gained community-spanning prominence due to their ability to integrate visual and textual inputs to perform complex tasks. Despite their success, the internal decision-making processes of these models remain opaque, posing challenges in high-stakes applications. To address this, we introduce NOTICE, the first Noise-free Text-Image Corruption and Evaluation pipeline for mechanistic interpretability in VLMs. NOTICE incorporates a Semantic Minimal Pairs (SMP) framework for image corruption and Symmetric Token Replacement (STR) for text. This approach enables semantically meaningful causal mediation analysis for both modalities, providing a robust method for analyzing multimodal integration within models like BLIP. Our experiments on the SVO-Probes, MIT-States, and Facial Expression Recognition datasets reveal crucial insights into VLM decision-making, identifying the significant role of middle-layer cross-attention heads. Further, we uncover a set of ``universal cross-attention heads'' that consistently contribute across tasks and modalities, each performing distinct functions such as implicit image segmentation, object inhibition, and outlier inhibition. This work paves the way for more transparent and interpretable multimodal systems.
☆ Modelled Multivariate Overlap: A method for measuring vowel merger
This paper introduces a novel method for quantifying vowel overlap. There is a tension in previous work between using multivariate measures, such as those derived from empirical distributions, and the ability to control for unbalanced data and extraneous factors, as is possible when using fitted model parameters. The method presented here resolves this tension by jointly modelling all acoustic dimensions of interest and by simulating distributions from the model to compute a measure of vowel overlap. An additional benefit of this method is that computation of uncertainty becomes straightforward. We evaluate this method on corpus speech data targeting the PIN-PEN merger in four dialects of English and find that using modelled distributions to calculate Bhattacharyya affinity substantially improves results compared to empirical distributions, while the difference between multivariate and univariate modelling is subtle.
comment: Accepted to Interspeech 2024
☆ Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models? ACL 2024
Alignment of the language model with human preferences is a common approach to making a language model useful to end users. However, most alignment work is done in English, and human preference datasets are dominated by English, reflecting only the preferences of English-speaking annotators. Nevertheless, it is common practice to use the English preference data, either directly or by translating it into the target language, when aligning a multilingual language model. The question is whether such an alignment strategy marginalizes the preference of non-English speaking users. To this end, we investigate the effect of aligning Japanese language models with (mostly) English resources. In particular, we focus on evaluating whether the commonsense morality of the resulting fine-tuned models is aligned with Japanese culture using the JCommonsenseMorality (JCM) and ETHICS datasets. The experimental results show that the fine-tuned model outperforms the SFT model. However, it does not demonstrate the same level of improvement as a model fine-tuned using the JCM, suggesting that while some aspects of commonsense morality are transferable, others may not be.
comment: The 2nd Workshop on Cross-Cultural Considerations in NLP (C3NLP) at ACL 2024
☆ Song Data Cleansing for End-to-End Neural Singer Diarization Using Neural Analysis and Synthesis Framework INTERSPEECH 2024
We propose a data cleansing method that utilizes a neural analysis and synthesis (NANSY++) framework to train an end-to-end neural diarization model (EEND) for singer diarization. Our proposed model converts song data with choral singing which is commonly contained in popular music and unsuitable for generating a simulated dataset to the solo singing data. This cleansing is based on NANSY++, which is a framework trained to reconstruct an input non-overlapped audio signal. We exploit the pre-trained NANSY++ to convert choral singing into clean, non-overlapped audio. This cleansing process mitigates the mislabeling of choral singing to solo singing and helps the effective training of EEND models even when the majority of available song data contains choral singing sections. We experimentally evaluated the EEND model trained with a dataset using our proposed method using annotated popular duet songs. As a result, our proposed method improved 14.8 points in diarization error rate.
comment: INTERSPEECH 2024 accepted
☆ Anomaly Detection of Tabular Data Using LLMs
Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.
comment: accepted at the Anomaly Detection with Foundation Models workshop
☆ Cascade Reward Sampling for Efficient Decoding-Time Alignment
Aligning large language models (LLMs) with human preferences is critical for their deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that requires no fine-tuning of model parameters. However, generating text that achieves both high reward and high likelihood remains a significant challenge. Existing methods often fail to generate high-reward text or incur substantial computational costs. In this paper, we propose Cascade Reward Sampling (CARDS) to address both issues, guaranteeing the generation of high-reward and high-likelihood text with significantly low costs. Based on our analysis of reward models (RMs) on incomplete text and our observation that high-reward prefixes induce high-reward complete text, we use rejection sampling to iteratively generate small semantic segments to form such prefixes. The segment length is dynamically determined by the predictive uncertainty of LLMs. This strategy guarantees desirable prefixes for subsequent generations and significantly reduces wasteful token re-generations and the number of reward model scoring. Our experiments demonstrate substantial gains in both generation efficiency and alignment ratings compared to the baselines, achieving five times faster text generation and 99\% win-ties in GPT-4/Claude-3 helpfulness evaluation.
☆ Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other
Emergent Large Language Models (LLMs) use their extraordinary performance and powerful deduction capacity to discern from traditional language models. However, the expenses of computational resources and storage for these LLMs are stunning, quantization then arises as a trending conversation. To address accuracy decay caused by quantization, two streams of works in post-training quantization methods stand out. One uses other weights to compensate existing quantization error, while the other transfers the quantization difficulty to other parts in the model. Combining both merits, we introduce Learnable Singular value Increment (LSI) as an advanced solution. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. Incorporating LSI with existing techniques, we achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios. By unleashing the potential of LSI, efficient finetuning on quantized model is no longer a prohibitive problem.
comment: Efficient quantization method
☆ LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments
Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely on language descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents' capacity to develop ``internalized world knowledge'' with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuitE represents a significant step toward building embodied generalists in the context of language models.
☆ Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels
Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.
☆ PlagBench: Exploring the Duality of Large Language Models in Plagiarism Generation and Detection
Recent literature has highlighted potential risks to academic integrity associated with large language models (LLMs), as they can memorize parts of training instances and reproduce them in the generated texts without proper attribution. In addition, given their capabilities in generating high-quality texts, plagiarists can exploit LLMs to generate realistic paraphrases or summaries indistinguishable from original work. In response to possible malicious use of LLMs in plagiarism, we introduce PlagBench, a comprehensive dataset consisting of 46.5K synthetic plagiarism cases generated using three instruction-tuned LLMs across three writing domains. The quality of PlagBench is ensured through fine-grained automatic evaluation for each type of plagiarism, complemented by human annotation. We then leverage our proposed dataset to evaluate the plagiarism detection performance of five modern LLMs and three specialized plagiarism checkers. Our findings reveal that GPT-3.5 tends to generates paraphrases and summaries of higher quality compared to Llama2 and GPT-4. Despite LLMs' weak performance in summary plagiarism identification, they can surpass current commercial plagiarism detectors. Overall, our results highlight the potential of LLMs to serve as robust plagiarism detection tools.
comment: 9 pages
☆ Investigating the Influence of Prompt-Specific Shortcuts in AI Generated Text Detection
AI Generated Text (AIGT) detectors are developed with texts from humans and LLMs of common tasks. Despite the diversity of plausible prompt choices, these datasets are generally constructed with a limited number of prompts. The lack of prompt variation can introduce prompt-specific shortcut features that exist in data collected with the chosen prompt, but do not generalize to others. In this paper, we analyze the impact of such shortcuts in AIGT detection. We propose Feedback-based Adversarial Instruction List Optimization (FAILOpt), an attack that searches for instructions deceptive to AIGT detectors exploiting prompt-specific shortcuts. FAILOpt effectively drops the detection performance of the target detector, comparable to other attacks based on adversarial in-context examples. We also utilize our method to enhance the robustness of the detector by mitigating the shortcuts. Based on the findings, we further train the classifier with the dataset augmented by FAILOpt prompt. The augmented classifier exhibits improvements across generation models, tasks, and attacks. Our code will be available at https://github.com/zxcvvxcz/FAILOpt.
comment: 19 pages, 3 figures, 13 tables, under review
☆ One Thousand and One Pairs: A "novel" challenge for long-context language models
Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.
comment: preprint, 29 pages
☆ Confidence Regulation Neurons in Language Models
Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits. Our work shows that entropy neurons operate by writing onto an unembedding null space, allowing them to impact the residual stream norm with minimal direct effect on the logits themselves. We observe the presence of entropy neurons across a range of models, up to 7 billion parameters. On the other hand, token frequency neurons, which we discover and describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution. Finally, we present a detailed case study where entropy neurons actively manage confidence in the setting of induction, i.e. detecting and continuing repeated subsequences.
comment: 25 pages, 14 figures
☆ LLMs assist NLP Researchers: Critique Paper (Meta-)Reviewing
This work is motivated by two key trends. On one hand, large language models (LLMs) have shown remarkable versatility in various generative tasks such as writing, drawing, and question answering, significantly reducing the time required for many routine tasks. On the other hand, researchers, whose work is not only time-consuming but also highly expertise-demanding, face increasing challenges as they have to spend more time reading, writing, and reviewing papers. This raises the question: how can LLMs potentially assist researchers in alleviating their heavy workload? This study focuses on the topic of LLMs assist NLP Researchers, particularly examining the effectiveness of LLM in assisting paper (meta-)reviewing and its recognizability. To address this, we constructed the ReviewCritique dataset, which includes two types of information: (i) NLP papers (initial submissions rather than camera-ready) with both human-written and LLM-generated reviews, and (ii) each review comes with "deficiency" labels and corresponding explanations for individual segments, annotated by experts. Using ReviewCritique, this study explores two threads of research questions: (i) "LLMs as Reviewers", how do reviews generated by LLMs compare with those written by humans in terms of quality and distinguishability? (ii) "LLMs as Metareviewers", how effectively can LLMs identify potential issues, such as Deficient or unprofessional review segments, within individual paper reviews? To our knowledge, this is the first work to provide such a comprehensive analysis.
☆ CLERC: A Dataset for Legal Case Retrieval and Retrieval-Augmented Analysis Generation
Legal professionals need to write analyses that rely on citations to relevant precedents, i.e., previous case decisions. Intelligent systems assisting legal professionals in writing such documents provide great benefits but are challenging to design. Such systems need to help locate, summarize, and reason over salient precedents in order to be useful. To enable systems for such tasks, we work with legal professionals to transform a large open-source legal corpus into a dataset supporting two important backbone tasks: information retrieval (IR) and retrieval-augmented generation (RAG). This dataset CLERC (Case Law Evaluation Retrieval Corpus), is constructed for training and evaluating models on their ability to (1) find corresponding citations for a given piece of legal analysis and to (2) compile the text of these citations (as well as previous context) into a cogent analysis that supports a reasoning goal. We benchmark state-of-the-art models on CLERC, showing that current approaches still struggle: GPT-4o generates analyses with the highest ROUGE F-scores but hallucinates the most, while zero-shot IR models only achieve 48.3% recall@1000.
☆ Vaporetto: Efficient Japanese Tokenization Based on Improved Pointwise Linear Classification
This paper proposes an approach to improve the runtime efficiency of Japanese tokenization based on the pointwise linear classification (PLC) framework, which formulates the whole tokenization process as a sequence of linear classification problems. Our approach optimizes tokenization by leveraging the characteristics of the PLC framework and the task definition. Our approach involves (1) composing multiple classifications into array-based operations, (2) efficient feature lookup with memory-optimized automata, and (3) three orthogonal pre-processing methods for reducing actual score calculation. Thus, our approach makes the tokenization speed 5.7 times faster than the current approach based on the same model without decreasing tokenization accuracy. Our implementation is available at https://github.com/daac-tools/vaporetto under the MIT or Apache-2.0 license.
☆ Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at https://github.com/Mihir3009/Multi-LogiEval.
comment: 23 Pages
☆ Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors SIGIR 2024
Large language models (LLM) have achieved remarkable success in natural language generation but lesser focus has been given to their applicability in decision making tasks such as classification. We show that LLMs like LLaMa can achieve high performance on large multi-class classification tasks but still make classification errors and worse, generate out-of-vocabulary class labels. To address these critical issues, we introduce Paraphrase and AGgregate (PAG)-LLM approach wherein an LLM generates multiple paraphrases of the input query (parallel queries), performs multi-class classification for the original query and each paraphrase, and at the end aggregate all the classification labels based on their confidence scores. We evaluate PAG-LLM on two large multi-class classication datasets: CLINC, and Banking and show 22.7% and 15.1% error reduction. We show that PAG-LLM is especially effective for hard examples where LLM is uncertain, and reduces the critical misclassification and hallucinated label generation errors
comment: Accepted at SIGIR 2024
☆ DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER
comment: under submission, 22 pages
☆ Testing network clustering algorithms with Natural Language Processing
The advent of online social networks has led to the development of an abundant literature on the study of online social groups and their relationship to individuals' personalities as revealed by their textual productions. Social structures are inferred from a wide range of social interactions. Those interactions form complex -- sometimes multi-layered -- networks, on which community detection algorithms are applied to extract higher order structures. The choice of the community detection algorithm is however hardily questioned in relation with the cultural production of the individual they classify. In this work, we assume the entangled nature of social networks and their cultural production to propose a definition of cultural based online social groups as sets of individuals whose online production can be categorized as social group-related. We take advantage of this apparently self-referential description of online social groups with a hybrid methodology that combines a community detection algorithm and a natural language processing classification algorithm. A key result of this analysis is the possibility to score community detection algorithms using their agreement with the natural language processing classification. A second result is that we can assign the opinion of a random user at >85% accuracy.
comment: 10 pages, 8 figures
☆ Automated Adversarial Discovery for Safety Classifiers NAACL 2024
Safety classifiers are critical in mitigating toxicity on online forums such as social media and in chatbots. Still, they continue to be vulnerable to emergent, and often innumerable, adversarial attacks. Traditional automated adversarial data generation methods, however, tend to produce attacks that are not diverse, but variations of previously observed harm types. We formalize the task of automated adversarial discovery for safety classifiers - to find new attacks along previously unseen harm dimensions that expose new weaknesses in the classifier. We measure progress on this task along two key axes (1) adversarial success: does the attack fool the classifier? and (2) dimensional diversity: does the attack represent a previously unseen harm type? Our evaluation of existing attack generation methods on the CivilComments toxicity task reveals their limitations: Word perturbation attacks fail to fool classifiers, while prompt-based LLM attacks have more adversarial success, but lack dimensional diversity. Even our best-performing prompt-based method finds new successful attacks on unseen harm dimensions of attacks only 5\% of the time. Automatically finding new harmful dimensions of attack is crucial and there is substantial headroom for future research on our new task.
comment: Published at Fourth Workshop on TrustworthyNLP (TrustNLP) at NAACL 2024
☆ Attention Instruction: Amplifying Attention in the Middle via Prompting
The context window of large language models has been extended to 128k tokens or more. However, language models still suffer from position bias and have difficulty in accessing and using the middle part of the context due to the lack of attention. We examine the relative position awareness of LLMs and the feasibility of mitigating disproportional attention through prompting. We augment the original task instruction with $\texttt{attention instructions}$ that direct language models to allocate more attention towards a selected segment of the context. We conduct a comprehensive investigation on multi-document question answering task with both position-based and index-based instructions. We find that language models do not have relative position awareness of the context. Nevertheless, they demonstrate the capacity to adapt attention to a specific segment using matching indexes. Our analysis contributes to a deeper understanding of position bias in LLMs and provides a pathway to mitigate this bias by instruction, thus benefiting LLMs in locating and utilizing relevant information from retrieved documents in RAG applications.
☆ Large Language Models Assume People are More Rational than We Really are
In order for AI systems to communicate effectively with people, they must understand how we make decisions. However, people's decisions are not always rational, so the implicit internal models of human decision-making in Large Language Models (LLMs) must account for this. Previous empirical evidence seems to suggest that these implicit models are accurate -- LLMs offer believable proxies of human behavior, acting how we expect humans would in everyday interactions. However, by comparing LLM behavior and predictions to a large dataset of human decisions, we find that this is actually not the case: when both simulating and predicting people's choices, a suite of cutting-edge LLMs (GPT-4o & 4-Turbo, Llama-3-8B & 70B, Claude 3 Opus) assume that people are more rational than we really are. Specifically, these models deviate from human behavior and align more closely with a classic model of rational choice -- expected value theory. Interestingly, people also tend to assume that other people are rational when interpreting their behavior. As a consequence, when we compare the inferences that LLMs and people draw from the decisions of others using another psychological dataset, we find that these inferences are highly correlated. Thus, the implicit decision-making models of LLMs appear to be aligned with the human expectation that other people will act rationally, rather than with how people actually act.
☆ modeLing: A Novel Dataset for Testing Linguistic Reasoning in Language Models
We introduce modeLing, a novel benchmark of Linguistics Olympiad-style puzzles which tests few-shot reasoning in AI systems. Solving these puzzles necessitates inferring aspects of a language's grammatical structure from a small number of examples. Such puzzles provide a natural testbed for language models, as they require compositional generalization and few-shot inductive reasoning. Consisting solely of new puzzles written specifically for this work, modeLing has no risk of appearing in the training data of existing AI systems: this ameliorates the risk of data leakage, a potential confounder for many prior evaluations of reasoning. Evaluating several large open source language models and GPT on our benchmark, we observe non-negligible accuracy, demonstrating few-shot emergent reasoning ability which cannot merely be attributed to shallow memorization. However, imperfect model performance suggests that modeLing can be used to measure further progress in linguistic reasoning.
♻ ☆ Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.
♻ ☆ Low-Resource Multi-Granularity Academic Function Recognition Based on Multiple Prompt Knowledge
Fine-tuning pre-trained language models (PLMs), e.g., SciBERT, generally requires large numbers of annotated data to achieve state-of-the-art performance on a range of NLP tasks in the scientific domain. However, obtaining the fine-tune data for scientific NLP task is still challenging and expensive. Inspired by recent advancement in prompt learning, in this paper, we propose the Mix Prompt Tuning (MPT), which is a semi-supervised method to alleviate the dependence on annotated data and improve the performance of multi-granularity academic function recognition tasks with a small number of labeled examples. Specifically, the proposed method provides multi-perspective representations by combining manual prompt templates with automatically learned continuous prompt templates to help the given academic function recognition task take full advantage of knowledge in PLMs. Based on these prompt templates and the fine-tuned PLM, a large number of pseudo labels are assigned to the unlabeled examples. Finally, we fine-tune the PLM using the pseudo training set. We evaluate our method on three academic function recognition tasks of different granularity including the citation function, the abstract sentence function, and the keyword function, with datasets from computer science domain and biomedical domain. Extensive experiments demonstrate the effectiveness of our method and statistically significant improvements against strong baselines. In particular, it achieves an average increase of 5% in Macro-F1 score compared with fine-tuning, and 6% in Macro-F1 score compared with other semi-supervised method under low-resource settings. In addition, MPT is a general method that can be easily applied to other low-resource scientific classification tasks.
comment: This article has been accepted by The Electronic Library and the full article is now available on Emerald Insight
♻ ☆ Structured Packing in LLM Training Improves Long Context Utilization
Recent advancements in long-context large language models have attracted significant attention, yet their practical applications often suffer from suboptimal context utilization. This study investigates structuring training data to enhance semantic interdependence, demonstrating that this approach effectively improves context utilization. To this end, we introduce the Structured Packing for Long Context (SPLiCe) method, which utilizes retrieval to collate mutually relevant documents into long and coherent training examples. We validate SPLiCe empirically across models of varying sizes -- 3B, 7B, and 13B -- achieving improved performance in long-context tasks, such as Qasper and HotpotQA. Remarkably, even brief fine-tuning with SPLiCe is sufficient to realize these benefits. Additionally, SPLiCe effectively mitigates the lost-in-middle phenomenon often observed in large models. Our comprehensive analysis of SPLiCe explores its design choices and reveals intriguing transfer effects; for instance, training on programming code enhances performance on natural language tasks.
comment: new experiments with a 13B model
♻ ☆ Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!
Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. The former uses in-context examples with model-generated rationales, and the latter without. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.
comment: work in progress
♻ ☆ Attribute Diversity Determines the Systematicity Gap in VQA
The degree to which neural networks can generalize to new combinations of familiar concepts, and the conditions under which they are able to do so, has long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that while increased quantity of training data does not reduce the systematicity gap, increased training data diversity of the attributes in the unseen combination does. In all, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.
comment: 33 pages, 20 figures
♻ ☆ Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking
Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While supervised fine-tuning helps with quality, it requires extensive supervision data to capture the full diversity of solutions. Alternatively, reinforcement learning methods like PPO aim to find limited highest-reward solutions while neglecting the solution diversity, akin to convergent thinking. To address these limitations, we propose Flow of Reasoning (FoR) -- an efficient LLM training approach enabling diverse reasoning with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow from an initial state to terminal states. The formulation allows to adapt principled GFlowNet approaches to train the LLM as a policy, which is able to sample multiple reasoning paths with probabilities proportional to the unnormalized reward. Empirical results show that, with limited training data (e.g., 15 examples), FoR can discover diverse high-quality solutions that excel greatly beyond current state-of-the-art methods across three tasks, including embodied reasoning (BlocksWorld), math puzzle solving (Game24), and logical reasoning (PrOntoQA). Code is available at https://github.com/Yu-Fangxu/FoR.
♻ ☆ FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages
Question Answering (QA) datasets are crucial in assessing reading comprehension skills for both machines and humans. While numerous datasets have been developed in English for this purpose, a noticeable void exists in less-resourced languages. To alleviate this gap, our paper introduces machine-translated versions of FairytaleQA, a renowned QA dataset designed to assess and enhance narrative comprehension skills in young children. By employing fine-tuned, modest-scale models, we establish benchmarks for both Question Generation (QG) and QA tasks within the translated datasets. In addition, we present a case study proposing a model for generating question-answer pairs, with an evaluation incorporating quality metrics such as question well-formedness, answerability, relevance, and children suitability. Our evaluation prioritizes quantifying and describing error cases, along with providing directions for future work. This paper contributes to the advancement of QA and QG research in less-resourced languages, promoting accessibility and inclusivity in the development of these models for reading comprehension. The code and data is publicly available at github.com/bernardoleite/fairytaleqa-translated.
comment: Preprint - Accepted for publication at ECTEL 2024
♻ ☆ Children's Speech Recognition through Discrete Token Enhancement
Children's speech recognition is considered a low-resource task mainly due to the lack of publicly available data. There are several reasons for such data scarcity, including expensive data collection and annotation processes, and data privacy, among others. Transforming speech signals into discrete tokens that do not carry sensitive information but capture both linguistic and acoustic information could be a solution for privacy concerns. In this study, we investigate the integration of discrete speech tokens into children's speech recognition systems as input without significantly degrading the ASR performance. Additionally, we explored single-view and multi-view strategies for creating these discrete labels. Furthermore, we tested the models for generalization capabilities with unseen domain and nativity dataset. Results reveal that the discrete token ASR for children achieves nearly equivalent performance with an approximate 83% reduction in parameters.
comment: Accepted at Interspeech 2024
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ When Parts are Greater Than Sums: Individual LLM Components Can Outperform Full Models
This paper studies in-context learning (ICL) by decomposing the output of large language models into the individual contributions of attention heads and MLPs (components). We observe curious components: good-performing ones that individually do well on a classification task, even when the model performs poorly; bad-performing ones that do much worse than chance; and label-biased components that always predict the same label. We find that component accuracies are well-correlated across different demonstration sets and perturbations of prompt templates, even when the full-model accuracy varies greatly. Based on our findings, we propose component reweighting, which learns to linearly re-scale the component activations from a few labeled examples. Given 24 labeled examples, our method improves by an average of 6.0% accuracy points over 24-shot ICL across 8 tasks on Llama-2-7B. Overall, this paper both enriches our understanding of ICL and provides a practical method for improvement by examining model internals.
comment: fix typos and citations; appendix
♻ ☆ Limited Out-of-Context Knowledge Reasoning in Large Language Models
Large Language Models (LLMs) have demonstrated strong capabilities as knowledge bases and significant in-context reasoning capabilities. However, previous work challenges their out-of-context reasoning ability, i.e., the ability to infer information from their training data, instead of from the context or prompt. This paper focuses on a significant facet of out-of-context reasoning: Out-of-Context Knowledge Reasoning (OCKR), which is to combine multiple knowledge to infer new knowledge. We designed a synthetic dataset with seven representative OCKR tasks to systematically assess the OCKR capabilities of LLMs. Using this dataset, we evaluated the LLaMA2-13B-chat model and discovered that its proficiency in this aspect is limited, regardless of whether the knowledge is trained in a separate or adjacent training settings. Moreover, training the model to reason with complete reasoning data did not result in significant improvement. Training the model to perform explicit knowledge retrieval helps in only one of the tasks, indicating that the model's limited OCKR capabilities are due to difficulties in retrieving relevant knowledge. Furthermore, we treat cross-lingual knowledge transfer as a distinct form of OCKR, and evaluate this ability. Our results show that the evaluated model also exhibits limited ability in transferring knowledge across languages. The dataset used in this study is available at https://github.com/NJUNLP/ID-OCKR.
♻ ☆ CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
♻ ☆ A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers
Relation extraction (RE) involves identifying the relations between entities from underlying content. RE serves as the foundation for many natural language processing (NLP) and information retrieval applications, such as knowledge graph completion and question answering. In recent years, deep neural networks have dominated the field of RE and made noticeable progress. Subsequently, the large pre-trained language models have taken the state-of-the-art RE to a new level. This survey provides a comprehensive review of existing deep learning techniques for RE. First, we introduce RE resources, including datasets and evaluation metrics. Second, we propose a new taxonomy to categorize existing works from three perspectives, i.e., text representation, context encoding, and triplet prediction. Third, we discuss several important challenges faced by RE and summarize potential techniques to tackle these challenges. Finally, we outline some promising future directions and prospects in this field. This survey is expected to facilitate researchers' collaborative efforts to address the challenges of real-world RE systems.
♻ ☆ Make Large Language Model a Better Ranker
Large Language Models (LLMs) demonstrate robust capabilities across various fields, leading to a paradigm shift in LLM-enhanced Recommender System (RS). Research to date focuses on point-wise and pair-wise recommendation paradigms, which are inefficient for LLM-based recommenders due to high computational costs. However, existing list-wise approaches also fall short in ranking tasks due to misalignment between ranking objectives and next-token prediction. Moreover, these LLM-based methods struggle to effectively address the order relation among candidates, particularly given the scale of ratings. To address these challenges, this paper introduces the large language model framework with Aligned Listwise Ranking Objectives (ALRO). ALRO is designed to bridge the gap between the capabilities of LLMs and the nuanced requirements of ranking tasks. Specifically, ALRO employs explicit feedback in a listwise manner by introducing soft lambda loss, a customized adaptation of lambda loss designed for optimizing order relations. This mechanism provides more accurate optimization goals, enhancing the ranking process. Additionally, ALRO incorporates a permutation-sensitive learning mechanism that addresses position bias, a prevalent issue in generative models, without imposing additional computational burdens during inference. Our evaluative studies reveal that ALRO outperforms both existing embedding-based recommendation methods and LLM-based recommendation baselines.
comment: 12 pages, 5 figures
♻ ☆ Deep Prompt Multi-task Network for Abuse Language Detection ICPR
The detection of abusive language remains a long-standing challenge with the extensive use of social networks. The detection task of abusive language suffers from limited accuracy. We argue that the existing detection methods utilize the fine-tuning technique of the pre-trained language models (PLMs) to handle downstream tasks. Hence, these methods fail to stimulate the general knowledge of the PLMs. To address the problem, we propose a novel Deep Prompt Multi-task Network (DPMN) for abuse language detection. Specifically, DPMN first attempts to design two forms of deep prompt tuning and light prompt tuning for the PLMs. The effects of different prompt lengths, tuning strategies, and prompt initialization methods on detecting abusive language are studied. In addition, we propose a Task Head based on Bi-LSTM and FFN, which can be used as a short text classifier. Eventually, DPMN utilizes multi-task learning to improve detection metrics further. The multi-task network has the function of transferring effective knowledge. The proposed DPMN is evaluated against eight typical methods on three public datasets: OLID, SOLID, and AbuseAnalyzer. The experimental results show that our DPMN outperforms the state-of-the-art methods.
comment: Accepted by the International Conference on Pattern Recognition (ICPR) 2024
♻ ☆ MELoRA: Mini-Ensemble Low-Rank Adapters for Parameter-Efficient Fine-Tuning ACL2024
Parameter-efficient fine-tuning (PEFT) is a popular method for tailoring pre-trained large language models (LLMs), especially as the models' scale and the diversity of tasks increase. Low-rank adaptation (LoRA) is based on the idea that the adaptation process is intrinsically low-dimensional, i.e., significant model changes can be represented with relatively few parameters. However, decreasing the rank encounters challenges with generalization errors for specific tasks when compared to full-parameter fine-tuning. We present MELoRA, a mini-ensemble low-rank adapters that uses fewer trainable parameters while maintaining a higher rank, thereby offering improved performance potential. The core idea is to freeze original pretrained weights and train a group of mini LoRAs with only a small number of parameters. This can capture a significant degree of diversity among mini LoRAs, thus promoting better generalization ability. We conduct a theoretical analysis and empirical studies on various NLP tasks. Our experimental results show that, compared to LoRA, MELoRA achieves better performance with 8 times fewer trainable parameters on natural language understanding tasks and 36 times fewer trainable parameters on instruction following tasks, which demonstrates the effectiveness of MELoRA.
comment: ACL2024
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at https://github.com/SamuelCahyawijaya/in-context-alignment
♻ ☆ CLUE: A Clinical Language Understanding Evaluation for LLMs
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE.
♻ ☆ ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
comment: Pre-print
♻ ☆ A Survey on Neural Topic Models: Methods, Applications, and Challenges
Topic models have been prevalent for decades to discover latent topics and infer topic proportions of documents in an unsupervised fashion. They have been widely used in various applications like text analysis and context recommendation. Recently, the rise of neural networks has facilitated the emergence of a new research field -- Neural Topic Models (NTMs). Different from conventional topic models, NTMs directly optimize parameters without requiring model-specific derivations. This endows NTMs with better scalability and flexibility, resulting in significant research attention and plentiful new methods and applications. In this paper, we present a comprehensive survey on neural topic models concerning methods, applications, and challenges. Specifically, we systematically organize current NTM methods according to their network structures and introduce the NTMs for various scenarios like short texts and bilingual documents. We also discuss a wide range of popular applications built on NTMs. Finally, we highlight the challenges confronted by NTMs to inspire future research. We accompany this survey with a repository for easier access to the mentioned paper resources: https://github.com/bobxwu/Paper-Neural-Topic-Models.
comment: Accepted to Artificial Intelligence Review. See https://doi.org/10.1007/s10462-023-10661-7 and a paper list at https://github.com/BobXWu/Paper-Neural-Topic-Models
♻ ☆ Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning ACL24
While Large language models (LLMs) have the capability to iteratively reflect on their own outputs, recent studies have observed their struggles with knowledge-rich problems without access to external resources. In addition to the inefficiency of LLMs in self-assessment, we also observe that LLMs struggle to revisit their predictions despite receiving explicit negative feedback. Therefore, We propose Mirror, a Multiple-perspective self-reflection method for knowledge-rich reasoning, to avoid getting stuck at a particular reflection iteration. Mirror enables LLMs to reflect from multiple-perspective clues, achieved through a heuristic interaction between a Navigator and a Reasoner. It guides agents toward diverse yet plausibly reliable reasoning trajectory without access to ground truth by encouraging (1) diversity of directions generated by Navigator and (2) agreement among strategically induced perturbations in responses generated by the Reasoner. The experiments on five reasoning datasets demonstrate that Mirror's superiority over several contemporary self-reflection approaches. Additionally, the ablation study studies clearly indicate that our strategies alleviate the aforementioned challenges.
comment: ACL24, Main Conference, long paper. Code is available at https://github.com/hanqi-qi/Mirror.git
♻ ☆ Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions ACL 2024
This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.
comment: 10 pages, 4 figures, ACL 2024, for a screencast see https://www.youtube.com/watch?v=9GJExMelhdI
♻ ☆ ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
comment: 17 pages, preprint
♻ ☆ Prompting with Divide-and-Conquer Program Makes Large Language Models Discerning to Hallucination and Deception
Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, simple instructional prompts suffer from inaccurate responses. Existing works show that more complicated prompting strategies, such as Chain-of-Thoughts and Least-to-Most, can unlock LLM's powerful capacity in diverse areas. Recent researches reveal that simple divide-and-conquer prompting strategy, i.e. simply dividing the input sequence to multiple sub-inputs, can also substantially improve LLM's performance in some specific tasks such as misinformation detection. In this paper, we aim at examining the utility of divide-and-conquer prompting strategy and answer on which kind of tasks this strategy gets advantages. Specifically, we provide a theoretic analysis to divide-and-conquer prompting strategy and help us identify the specific tasks where DaC prompting can bring performance boost with theoretic guarantee. We then present two cases (large integer arithmetic and fact verification) where experimental results aligns with our theoretic analysis.
comment: Preprint
♻ ☆ C-ICL: Contrastive In-context Learning for Information Extraction
There has been increasing interest in exploring the capabilities of advanced large language models (LLMs) in the field of information extraction (IE), specifically focusing on tasks related to named entity recognition (NER) and relation extraction (RE). Although researchers are exploring the use of few-shot information extraction through in-context learning with LLMs, they tend to focus only on using correct or positive examples for demonstration, neglecting the potential value of incorporating incorrect or negative examples into the learning process. In this paper, we present c-ICL, a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations. This approach enhances the ability of LLMs to extract entities and relations by utilizing prompts that incorporate not only the positive samples but also the reasoning behind them. This method allows for the identification and correction of potential interface errors. Specifically, our proposed method taps into the inherent contextual information and valuable information in hard negative samples and the nearest positive neighbors to the test and then applies the in-context learning demonstrations based on LLMs. Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods, delivering substantial enhancements in performance across a broad spectrum of related tasks. These improvements are noteworthy, showcasing the versatility of our approach in miscellaneous scenarios.
comment: 15 pages
♻ ☆ Improving In-context Learning via Bidirectional Alignment
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL). Despite their success in showing such emergent abilities, the scale and complexity of larger models also lead to unprecedentedly high computational demands and deployment challenges. In reaction, researchers explore transferring the powerful capabilities of larger models to more efficient and compact models by typically aligning the output of smaller (student) models with that of larger (teacher) models. Existing methods either train student models on the generated outputs of teacher models or imitate their token-level probability distributions. However, these distillation methods pay little to no attention to the input, which also plays a crucial role in ICL. Based on the finding that the performance of ICL is highly sensitive to the selection of demonstration examples, we propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models. Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss, in addition to aligning the token-level output distribution. With extensive experiments and analysis, we demonstrate that BiAlign can consistently outperform existing baselines on a variety of tasks involving language understanding, reasoning, and coding.
♻ ☆ TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs
Benchmarking is the de-facto standard for evaluating LLMs, due to its speed, replicability and low cost. However, recent work has pointed out that the majority of the open source benchmarks available today have been contaminated or leaked into LLMs, meaning that LLMs have access to test data during pretraining and/or fine-tuning. This raises serious concerns about the validity of benchmarking studies conducted so far and the future of evaluation using benchmarks. To solve this problem, we propose Private Benchmarking, a solution where test datasets are kept private and models are evaluated without revealing the test data to the model. We describe various scenarios (depending on the trust placed on model owners or dataset owners), and present solutions to avoid data contamination using private benchmarking. For scenarios where the model weights need to be kept private, we describe solutions from confidential computing and cryptography that can aid in private benchmarking. We build an end-to-end system, TRUCE, that enables such private benchmarking showing that the overheads introduced to protect models and benchmark are negligible (in the case of confidential computing) and tractable (when cryptographic security is required). Finally, we also discuss solutions to the problem of benchmark dataset auditing, to ensure that private benchmarks are of sufficiently high quality.
♻ ☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models ACL 2024
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and received over 24,000 stars and 3,000 forks.
comment: 13 pages, accepted to ACL 2024 System Demonstration Track
♻ ☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
♻ ☆ Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach
Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
♻ ☆ CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation ACL 2024
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning.
comment: ACL 2024 Camera Ready Version
♻ ☆ The Uli Dataset: An Exercise in Experience Led Annotation of oGBV
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.
♻ ☆ Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks
Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence's dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.
comment: preprint
♻ ☆ What Makes Two Language Models Think Alike?
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous methods, MLEMs offer a transparent comparison, by identifying the specific linguistic features responsible for similarities and differences. More generally, the method uses formal, symbolic descriptions of a domain, and use these to compare neural representations. As such, the approach can straightforwardly be extended to other domains, such as speech and vision, and to other neural systems, including human brains.
comment: 7 pages, 6 figures
♻ ☆ In-context Pretraining: Language Modeling Beyond Document Boundaries
Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
♻ ☆ PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
comment: Code: https://github.com/gmkim-ai/PromptKD
♻ ☆ NaijaHate: Evaluating Hate Speech Detection on Nigerian Twitter Using Representative Data ACL 2024
To address the global issue of online hate, hate speech detection (HSD) systems are typically developed on datasets from the United States, thereby failing to generalize to English dialects from the Majority World. Furthermore, HSD models are often evaluated on non-representative samples, raising concerns about overestimating model performance in real-world settings. In this work, we introduce NaijaHate, the first dataset annotated for HSD which contains a representative sample of Nigerian tweets. We demonstrate that HSD evaluated on biased datasets traditionally used in the literature consistently overestimates real-world performance by at least two-fold. We then propose NaijaXLM-T, a pretrained model tailored to the Nigerian Twitter context, and establish the key role played by domain-adaptive pretraining and finetuning in maximizing HSD performance. Finally, owing to the modest performance of HSD systems in real-world conditions, we find that content moderators would need to review about ten thousand Nigerian tweets flagged as hateful daily to moderate 60% of all hateful content, highlighting the challenges of moderating hate speech at scale as social media usage continues to grow globally. Taken together, these results pave the way towards robust HSD systems and a better protection of social media users from hateful content in low-resource settings.
comment: ACL 2024 main conference. Data and models available at https://github.com/worldbank/NaijaHate
♻ ☆ Testing the Limits of Jailbreaking Defenses with the Purple Problem
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
♻ ☆ Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
comment: Preprint, 57 pages
♻ ☆ LLM-Assisted Content Conditional Debiasing for Fair Text Embedding
Mitigating biases in machine learning models has become an increasing concern in Natural Language Processing (NLP), particularly in developing fair text embeddings, which are crucial yet challenging for real-world applications like search engines. In response, this paper proposes a novel method for learning fair text embeddings. First, we define a novel content-conditional equal distance (CCED) fairness for text embeddings, ensuring content-conditional independence between sensitive attributes and text embeddings. Building on CCED, we introduce a content-conditional debiasing (CCD) loss to ensure that embeddings of texts with different sensitive attributes but identical content maintain the same distance from the embedding of their corresponding neutral text. Additionally, we tackle the issue of insufficient training data by using Large Language Models (LLMs) with instructions to fairly augment texts into different sensitive groups. Our extensive evaluations show that our approach effectively enhances fairness while maintaining the utility of embeddings. Furthermore, our augmented dataset, combined with the CCED metric, serves as an new benchmark for evaluating fairness.
♻ ☆ SafetyBench: Evaluating the Safety of Large Language Models ACL 2024
With the rapid development of Large Language Models (LLMs), increasing attention has been paid to their safety concerns. Consequently, evaluating the safety of LLMs has become an essential task for facilitating the broad applications of LLMs. Nevertheless, the absence of comprehensive safety evaluation benchmarks poses a significant impediment to effectively assess and enhance the safety of LLMs. In this work, we present SafetyBench, a comprehensive benchmark for evaluating the safety of LLMs, which comprises 11,435 diverse multiple choice questions spanning across 7 distinct categories of safety concerns. Notably, SafetyBench also incorporates both Chinese and English data, facilitating the evaluation in both languages. Our extensive tests over 25 popular Chinese and English LLMs in both zero-shot and few-shot settings reveal a substantial performance advantage for GPT-4 over its counterparts, and there is still significant room for improving the safety of current LLMs. We also demonstrate that the measured safety understanding abilities in SafetyBench are correlated with safety generation abilities. Data and evaluation guidelines are available at \url{https://github.com/thu-coai/SafetyBench}{https://github.com/thu-coai/SafetyBench}. Submission entrance and leaderboard are available at \url{https://llmbench.ai/safety}{https://llmbench.ai/safety}.
comment: ACL 2024 Main Conference
♻ ☆ WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3.5, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
comment: 26 pages, 9 figures
♻ ☆ Smurfs: Leveraging Multiple Proficiency Agents with Context-Efficiency for Tool Planning
The emergence of large language models (LLMs) has opened up unprecedented possibilities for automating complex tasks that are often comparable to human performance. Despite their capabilities, LLMs still encounter difficulties in completing tasks that require high levels of accuracy and complexity due to their inherent limitations in handling multifaceted problems single-handedly. This paper introduces `Smurfs', a cutting-edge multi-agent framework designed to revolutionize the application of LLMs. By seamlessly transforming a conventional LLM into a synergistic multi-agent ensemble, Smurfs can enhance the model's ability to solve complex tasks at no additional cost. This is achieved through innovative prompting strategies that allocate distinct roles within the model, thereby facilitating collaboration among specialized agents and forming an intelligent multi-agent system. Our empirical investigation on both open-ended task of StableToolBench and closed-ended task on HotpotQA showcases Smurfs' superior capability in intricate tool utilization scenarios. Notably, Smurfs outmatches all the baseline methods in both experiments, setting new state-of-the-art performance. Furthermore, through comprehensive ablation studies, we dissect the contribution of the core components of the multi-agent framework to its overall efficacy. This not only verifies the effectiveness of the framework, but also sets a route for future exploration of multi-agent LLM systems.
♻ ☆ Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and find that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon
♻ ☆ EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models ACL 2024
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.
comment: ACL 2024 System Demonstrations; Code: https://github.com/zjunlp/EasyEdit HF Demo: https://huggingface.co/spaces/zjunlp/EasyEdit Video: https://youtu.be/Gm6T0QaaskU Docs: https://zjunlp.gitbook.io/easyedit
♻ ☆ Reducing Privacy Risks in Online Self-Disclosures with Language Models ACL 2024
Self-disclosure, while being common and rewarding in social media interaction, also poses privacy risks. In this paper, we take the initiative to protect the user-side privacy associated with online self-disclosure through detection and abstraction. We develop a taxonomy of 19 self-disclosure categories and curate a large corpus consisting of 4.8K annotated disclosure spans. We then fine-tune a language model for detection, achieving over 65% partial span F$_1$. We further conduct an HCI user study, with 82% of participants viewing the model positively, highlighting its real-world applicability. Motivated by the user feedback, we introduce the task of self-disclosure abstraction, which is rephrasing disclosures into less specific terms while preserving their utility, e.g., "Im 16F" to "I'm a teenage girl". We explore various fine-tuning strategies, and our best model can generate diverse abstractions that moderately reduce privacy risks while maintaining high utility according to human evaluation. To help users in deciding which disclosures to abstract, we present a task of rating their importance for context understanding. Our fine-tuned model achieves 80% accuracy, on-par with GPT-3.5. Given safety and privacy considerations, we will only release our corpus and models to researcher who agree to the ethical guidelines outlined in Ethics Statement.
comment: Accepted at ACL 2024
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models ACL 2024
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: ACL 2024 System Demonstrations; Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ ClaimVer: Explainable Claim-Level Verification and Evidence Attribution of Text Through Knowledge Graphs
In the midst of widespread misinformation and disinformation through social media and the proliferation of AI-generated texts, it has become increasingly difficult for people to validate and trust information they encounter. Many fact-checking approaches and tools have been developed, but they often lack appropriate explainability or granularity to be useful in various contexts. A text validation method that is easy to use, accessible, and can perform fine-grained evidence attribution has become crucial. More importantly, building user trust in such a method requires presenting the rationale behind each prediction, as research shows this significantly influences people's belief in automated systems. Localizing and bringing users' attention to the specific problematic content is also paramount, instead of providing simple blanket labels. In this paper, we present ClaimVer, a human-centric framework tailored to meet users' informational and verification needs by generating rich annotations and thereby reducing cognitive load. Designed to deliver comprehensive evaluations of texts, it highlights each claim, verifies it against a trusted knowledge graph (KG), presents the evidence, and provides succinct, clear explanations for each claim prediction. Finally, our framework introduces an attribution score, enhancing applicability across a wide range of downstream tasks.
♻ ☆ EFUF: Efficient Fine-grained Unlearning Framework for Mitigating Hallucinations in Multimodal Large Language Models
Multimodal large language models (MLLMs) have attracted increasing attention in the past few years, but they may still generate descriptions that include objects not present in the corresponding images, a phenomenon known as object hallucination. To eliminate hallucinations, existing methods manually annotate paired responses with and without hallucinations, and then employ various alignment algorithms to improve the alignment capability between images and text. However, they not only demand considerable computation resources during the finetuning stage but also require expensive human annotation to construct paired data needed by the alignment algorithms. To address these issues, we borrow the idea of unlearning and propose an efficient fine-grained unlearning framework (EFUF), which can eliminate hallucinations without the need for paired data. Extensive experiments show that our method consistently reduces hallucinations while preserving the generation quality with modest computational overhead. Our code and datasets will be publicly available.
♻ ☆ VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
♻ ☆ MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions ICML 2024
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
♻ ☆ Annotation alignment: Comparing LLM and human annotations of conversational safety
To what extent do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
comment: Working draft, short paper. Main text is 5 pages, 1 figure. (v3 corrects minor typo)
♻ ☆ Cross-Care: Assessing the Healthcare Implications of Pre-training Data on Language Model Bias
Large language models (LLMs) are increasingly essential in processing natural languages, yet their application is frequently compromised by biases and inaccuracies originating in their training data. In this study, we introduce Cross-Care, the first benchmark framework dedicated to assessing biases and real world knowledge in LLMs, specifically focusing on the representation of disease prevalence across diverse demographic groups. We systematically evaluate how demographic biases embedded in pre-training corpora like $ThePile$ influence the outputs of LLMs. We expose and quantify discrepancies by juxtaposing these biases against actual disease prevalences in various U.S. demographic groups. Our results highlight substantial misalignment between LLM representation of disease prevalence and real disease prevalence rates across demographic subgroups, indicating a pronounced risk of bias propagation and a lack of real-world grounding for medical applications of LLMs. Furthermore, we observe that various alignment methods minimally resolve inconsistencies in the models' representation of disease prevalence across different languages. For further exploration and analysis, we make all data and a data visualization tool available at: www.crosscare.net.
comment: Submitted for review, data visualization tool available at: www.crosscare.net
♻ ☆ Prompting Explicit and Implicit Knowledge for Multi-hop Question Answering Based on Human Reading Process COLING 2024
Pre-trained language models (PLMs) leverage chains-of-thought (CoT) to simulate human reasoning and inference processes, achieving proficient performance in multi-hop QA. However, a gap persists between PLMs' reasoning abilities and those of humans when tackling complex problems. Psychological studies suggest a vital connection between explicit information in passages and human prior knowledge during reading. Nevertheless, current research has given insufficient attention to linking input passages and PLMs' pre-training-based knowledge from the perspective of human cognition studies. In this study, we introduce a Prompting Explicit and Implicit knowledge (PEI) framework, which uses prompts to connect explicit and implicit knowledge, aligning with human reading process for multi-hop QA. We consider the input passages as explicit knowledge, employing them to elicit implicit knowledge through unified prompt reasoning. Furthermore, our model incorporates type-specific reasoning via prompts, a form of implicit knowledge. Experimental results show that PEI performs comparably to the state-of-the-art on HotpotQA. Ablation studies confirm the efficacy of our model in bridging and integrating explicit and implicit knowledge.
comment: This paper has been accepted at COLING 2024
♻ ☆ Chain-of-Instructions: Compositional Instruction Tuning on Large Language Models
Fine-tuning large language models (LLMs) with a collection of large and diverse instructions has improved the model's generalization to different tasks, even for unseen tasks. However, most existing instruction datasets include only single instructions, and they struggle to follow complex instructions composed of multiple subtasks. In this work, we propose a novel concept of compositional instructions called chain-of-instructions (CoI), where the output of one instruction becomes an input for the next like a chain. Unlike the conventional practice of solving single instruction tasks, our proposed method encourages a model to solve each subtask step by step until the final answer is reached. CoI-tuning (i.e., fine-tuning with CoI instructions) improves the model's ability to handle instructions composed of multiple subtasks as well as unseen composite tasks such as multilingual summarization. Overall, our study find that simple CoI tuning of existing instruction data can provide consistent generalization to solve more complex, unseen, and longer chains of instructions.
♻ ☆ LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models NAACL 2024
Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{https://github.com/Glaciohound/LM-Infinite}.
comment: NAACL 2024 Outstanding paper, 9 pages, 6 figures
♻ ☆ LinkTransformer: A Unified Package for Record Linkage with Transformer Language Models
Linking information across sources is fundamental to a variety of analyses in social science, business, and government. While large language models (LLMs) offer enormous promise for improving record linkage in noisy datasets, in many domains approximate string matching packages in popular softwares such as R and Stata remain predominant. These packages have clean, simple interfaces and can be easily extended to a diversity of languages. Our open-source package LinkTransformer aims to extend the familiarity and ease-of-use of popular string matching methods to deep learning. It is a general purpose package for record linkage with transformer LLMs that treats record linkage as a text retrieval problem. At its core is an off-the-shelf toolkit for applying transformer models to record linkage with four lines of code. LinkTransformer contains a rich repository of pre-trained transformer semantic similarity models for multiple languages and supports easy integration of any transformer language model from Hugging Face or OpenAI. It supports standard functionality such as blocking and linking on multiple noisy fields. LinkTransformer APIs also perform other common text data processing tasks, e.g., aggregation, noisy de-duplication, and translation-free cross-lingual linkage. Importantly, LinkTransformer also contains comprehensive tools for efficient model tuning, to facilitate different levels of customization when off-the-shelf models do not provide the required accuracy. Finally, to promote reusability, reproducibility, and extensibility, LinkTransformer makes it easy for users to contribute their custom-trained models to its model hub. By combining transformer language models with intuitive APIs that will be familiar to many users of popular string matching packages, LinkTransformer aims to democratize the benefits of LLMs among those who may be less familiar with deep learning frameworks.
♻ ☆ COFFEE: A Contrastive Oracle-Free Framework for Event Extraction
Event extraction is a complex information extraction task that involves extracting events from unstructured text. Prior classification-based methods require comprehensive entity annotations for joint training, while newer generation-based methods rely on heuristic templates containing oracle information such as event type, which is often unavailable in real-world scenarios. In this study, we consider a more realistic setting of this task, namely the Oracle-Free Event Extraction (OFEE) task, where only the input context is given without any oracle information, including event type, event ontology and trigger word. To solve this task, we propose a new framework, called COFFEE, which extracts the events solely based on the document context without referring to any oracle information. In particular, a contrastive selection model is introduced in COFFEE to rectify the generated triggers and handle multi-event instances. The proposed COFFEE outperforms state-of-the-art approaches under the oracle-free setting of the event extraction task, as evaluated on a public event extraction benchmark ACE05.
♻ ☆ SOUL: Unlocking the Power of Second-Order Optimization for LLM Unlearning
Large Language Models (LLMs) have highlighted the necessity of effective unlearning mechanisms to comply with data regulations and ethical AI practices. LLM unlearning aims at removing undesired data influences and associated model capabilities without compromising utility beyond the scope of unlearning. While interest in studying LLM unlearning is growing, the impact of the optimizer choice for LLM unlearning remains unexplored. In this work, we shed light on the significance of optimizer selection in LLM unlearning for the first time, establishing a clear connection between second-order optimization and influence unlearning (a classical approach using influence functions to update the model for data influence removal). This insight propels us to develop a second-order optimization-based LLM unlearning framework, termed Second-Order UnLearning (SOUL), which extends the static, one-shot model update using influence unlearning to a dynamic, iterative unlearning process. Our extensive experiments show that SOUL consistently outperforms conventional first-order methods across various unlearning tasks, models, and metrics, indicating that second-order optimization offers an effective and broadly applicable solution for LLM unlearning. Codes are available at https://github.com/OPTML-Group/SOUL.
♻ ☆ Fine-Grained Detection of Solidarity for Women and Migrants in 155 Years of German Parliamentary Debates
Solidarity is a crucial concept to understand social relations in societies. In this paper, we explore fine-grained solidarity frames to study solidarity towards women and migrants in German parliamentary debates between 1867 and 2022. Using 2,864 manually annotated text snippets (with a cost exceeding 18k Euro), we evaluate large language models (LLMs) like Llama 3, GPT-3.5, and GPT-4. We find that GPT-4 outperforms other LLMs, approaching human annotation quality. Using GPT-4, we automatically annotate more than 18k further instances (with a cost of around 500 Euro) across 155 years and find that solidarity with migrants outweighs anti-solidarity but that frequencies and solidarity types shift over time. Most importantly, group-based notions of (anti-)solidarity fade in favor of compassionate solidarity, focusing on the vulnerability of migrant groups, and exchange-based anti-solidarity, focusing on the lack of (economic) contribution. Our study highlights the interplay of historical events, socio-economic needs, and political ideologies in shaping migration discourse and social cohesion. We also show that powerful LLMs, if carefully prompted, can be cost-effective alternatives to human annotation for hard social scientific tasks.
comment: Note title and author changes
♻ ☆ Aligner: Efficient Alignment by Learning to Correct
With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even iteratively bootstrap the upstream models using corrected responses as synthetic human preference data, breaking through the model's performance ceiling. Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and honesty). Specifically, Aligner-7B has achieved an average improvement of 68.9% in helpfulness and 23.8% in harmlessness across the tested LLMs while also effectively reducing hallucination. In the Alpaca-Eval leaderboard, stacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0% to 58.3%, surpassing GPT-4 Omni's 57.5% Win Rate (community report).
♻ ☆ RAG-RLRC-LaySum at BioLaySumm: Integrating Retrieval-Augmented Generation and Readability Control for Layman Summarization of Biomedical Texts
This paper introduces the RAG-RLRC-LaySum framework, designed to make complex biomedical research understandable to laymen through advanced Natural Language Processing (NLP) techniques. Our Retrieval Augmented Generation (RAG) solution, enhanced by a reranking method, utilizes multiple knowledge sources to ensure the precision and pertinence of lay summaries. Additionally, our Reinforcement Learning for Readability Control (RLRC) strategy improves readability, making scientific content comprehensible to non-specialists. Evaluations using the publicly accessible PLOS and eLife datasets show that our methods surpass Plain Gemini model, demonstrating a 20% increase in readability scores, a 15% improvement in ROUGE-2 relevance scores, and a 10% enhancement in factual accuracy. The RAG-RLRC-LaySum framework effectively democratizes scientific knowledge, enhancing public engagement with biomedical discoveries.
Machine Learning 150
☆ EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
Inference with modern Large Language Models (LLMs) is expensive and time-consuming, and speculative sampling has proven to be an effective solution. Most speculative sampling methods such as EAGLE use a static draft tree, implicitly assuming that the acceptance rate of draft tokens depends only on their position. Interestingly, we found that the acceptance rate of draft tokens is also context-dependent. In this paper, building upon EAGLE, we propose EAGLE-2, which introduces a new technique of context-aware dynamic draft tree into drafting modeling. This improvement leverages the fact that the draft model of EAGLE is well-calibrated: the confidence scores from the draft model approximate acceptance rates with small errors. We conducted extensive evaluations on three series of LLMs and six tasks, with EAGLE-2 achieving speedup ratios 3.05x-4.26x, which is 20%-40% faster than EAGLE-1. EAGLE-2 also ensures that the distribution of the generated text remains unchanged, making it a lossless acceleration algorithm.
☆ GeoMFormer: A General Architecture for Geometric Molecular Representation Learning ICML 2024
Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular representations under these constraints, most of them are built upon heuristic and costly modules. We argue that there is a strong need for a general and flexible framework for learning both invariant and equivariant features. In this work, we introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal. Using the standard Transformer modules, two separate streams are developed to maintain and learn invariant and equivariant representations. Carefully designed cross-attention modules bridge the two streams, allowing information fusion and enhancing geometric modeling in each stream. As a general and flexible architecture, we show that many previous architectures can be viewed as special instantiations of GeoMFormer. Extensive experiments are conducted to demonstrate the power of GeoMFormer. All empirical results show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales. Code and models will be made publicly available at https://github.com/c-tl/GeoMFormer.
comment: 25 pages, 13 tables, l figure; ICML 2024 camera ready version
☆ Data Debiasing with Datamodels (D3M): Improving Subgroup Robustness via Data Selection
Machine learning models can fail on subgroups that are underrepresented during training. While techniques such as dataset balancing can improve performance on underperforming groups, they require access to training group annotations and can end up removing large portions of the dataset. In this paper, we introduce Data Debiasing with Datamodels (D3M), a debiasing approach which isolates and removes specific training examples that drive the model's failures on minority groups. Our approach enables us to efficiently train debiased classifiers while removing only a small number of examples, and does not require training group annotations or additional hyperparameter tuning.
☆ From Decoding to Meta-Generation: Inference-time Algorithms for Large Language Models
One of the most striking findings in modern research on large language models (LLMs) is that scaling up compute during training leads to better results. However, less attention has been given to the benefits of scaling compute during inference. This survey focuses on these inference-time approaches. We explore three areas under a unified mathematical formalism: token-level generation algorithms, meta-generation algorithms, and efficient generation. Token-level generation algorithms, often called decoding algorithms, operate by sampling a single token at a time or constructing a token-level search space and then selecting an output. These methods typically assume access to a language model's logits, next-token distributions, or probability scores. Meta-generation algorithms work on partial or full sequences, incorporating domain knowledge, enabling backtracking, and integrating external information. Efficient generation methods aim to reduce token costs and improve the speed of generation. Our survey unifies perspectives from three research communities: traditional natural language processing, modern LLMs, and machine learning systems.
☆ Concentration Inequalities for $(f,Γ)$-GANs
Generative adversarial networks (GANs) are unsupervised learning methods for training a generator distribution to produce samples that approximate those drawn from a target distribution. Many such methods can be formulated as minimization of a metric or divergence. Recent works have proven the statistical consistency of GANs that are based on integral probability metrics (IPMs), e.g., WGAN which is based on the 1-Wasserstein metric. IPMs are defined by optimizing a linear functional (difference of expectations) over a space of discriminators. A much larger class of GANs, which allow for the use of nonlinear objective functionals, can be constructed using $(f,\Gamma)$-divergences; these generalize and interpolate between IPMs and $f$-divergences (e.g., KL or $\alpha$-divergences). Instances of $(f,\Gamma)$-GANs have been shown to exhibit improved performance in a number of applications. In this work we study the statistical consistency of $(f,\Gamma)$-GANs for general $f$ and $\Gamma$. Specifically, we derive finite-sample concentration inequalities. These derivations require novel arguments due to nonlinearity of the objective functional. We demonstrate that our new results reduce to the known results for IPM-GANs in the appropriate limit while also significantly extending the domain of applicability of this theory.
comment: 21 pages
☆ USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Identifying user's opinions and stances in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing (NLP) tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user's stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user's overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].
comment: 32 pages, 18 figures
☆ Understanding and Mitigating Tokenization Bias in Language Models
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that, for encoding schemes such as maximum prefix matching, tokenization induces a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, we propose a novel algorithm to obtain unbiased estimates from a model that was trained on tokenized data. Our method does not require finetuning the model, and its complexity, defined as the number of model runs, scales linearly with the sequence length. As a consequence, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
☆ General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design
Structure-Based Drug Design (SBDD) focuses on generating valid ligands that strongly and specifically bind to a designated protein pocket. Several methods use machine learning for SBDD to generate these ligands in 3D space, conditioned on the structure of a desired protein pocket. Recently, diffusion models have shown success here by modeling the underlying distributions of atomic positions and types. While these methods are effective in considering the structural details of the protein pocket, they often fail to explicitly consider the binding affinity. Binding affinity characterizes how tightly the ligand binds to the protein pocket, and is measured by the change in free energy associated with the binding process. It is one of the most crucial metrics for benchmarking the effectiveness of the interaction between a ligand and protein pocket. To address this, we propose BADGER: Binding Affinity Diffusion Guidance with Enhanced Refinement. BADGER is a general guidance method to steer the diffusion sampling process towards improved protein-ligand binding, allowing us to adjust the distribution of the binding affinity between ligands and proteins. Our method is enabled by using a neural network (NN) to model the energy function, which is commonly approximated by AutoDock Vina (ADV). ADV's energy function is non-differentiable, and estimates the affinity based on the interactions between a ligand and target protein receptor. By using a NN as a differentiable energy function proxy, we utilize the gradient of our learned energy function as a guidance method on top of any trained diffusion model. We show that our method improves the binding affinity of generated ligands to their protein receptors by up to 60\%, significantly surpassing previous machine learning methods. We also show that our guidance method is flexible and can be easily applied to other diffusion-based SBDD frameworks.
☆ PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs
Recently, machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs. However, unlearning approaches for LLMs that have been considered thus far have focused on the removal of independent data points and have not taken into account that the stored facts are logically connected to one another and form an implicit knowledge graph. To facilitate the development of structural unlearning methods, which are essential for the practical application of unlearning, we propose PISTOL, a pipeline for compiling multi-scenario datasets for benchmarking structural LLM unlearning. Additionally, leveraging sample datasets synthesized using PISTOL, we conducted benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models. This analysis helps to illustrate the prevailing challenges in effectively and robustly removing highly inter-connected data, batched data, or data skewed towards a specific domain. It also highlights the choice of pre-trained model can impact unlearning performance. This work not only advances our understandings on the limitation of current LLMs unlearning methods and proposes future research directions, but also provides a replicable framework for ongoing exploration and validation in the field.
☆ Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation
Human feedback plays a critical role in learning and refining reward models for text-to-image generation, but the optimal form the feedback should take for learning an accurate reward function has not been conclusively established. This paper investigates the effectiveness of fine-grained feedback which captures nuanced distinctions in image quality and prompt-alignment, compared to traditional coarse-grained feedback (for example, thumbs up/down or ranking between a set of options). While fine-grained feedback holds promise, particularly for systems catering to diverse societal preferences, we show that demonstrating its superiority to coarse-grained feedback is not automatic. Through experiments on real and synthetic preference data, we surface the complexities of building effective models due to the interplay of model choice, feedback type, and the alignment between human judgment and computational interpretation. We identify key challenges in eliciting and utilizing fine-grained feedback, prompting a reassessment of its assumed benefits and practicality. Our findings -- e.g., that fine-grained feedback can lead to worse models for a fixed budget, in some settings; however, in controlled settings with known attributes, fine grained rewards can indeed be more helpful -- call for careful consideration of feedback attributes and potentially beckon novel modeling approaches to appropriately unlock the potential value of fine-grained feedback in-the-wild.
☆ Improved Regret Bounds for Bandits with Expert Advice
In this research note, we revisit the bandits with expert advice problem. Under a restricted feedback model, we prove a lower bound of order $\sqrt{K T \ln(N/K)}$ for the worst-case regret, where $K$ is the number of actions, $N>K$ the number of experts, and $T$ the time horizon. This matches a previously known upper bound of the same order and improves upon the best available lower bound of $\sqrt{K T (\ln N) / (\ln K)}$. For the standard feedback model, we prove a new instance-based upper bound that depends on the agreement between the experts and provides a logarithmic improvement compared to prior results.
☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using $1/\sqrt{v}$, Adam-mini uses the average of $v$ within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down $\geq 90% v$ in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
☆ Enabling more efficient and cost-effective AI/ML systems with Collective Mind, virtualized MLOps, MLPerf, Collective Knowledge Playground and reproducible optimization tournaments
In this white paper, I present my community effort to automatically co-design cheaper, faster and more energy-efficient software and hardware for AI, ML and other popular workloads with the help of the Collective Mind framework (CM), virtualized MLOps, MLPerf benchmarks and reproducible optimization tournaments. I developed CM to modularize, automate and virtualize the tedious process of building, running, profiling and optimizing complex applications across rapidly evolving open-source and proprietary AI/ML models, datasets, software and hardware. I achieved that with the help of portable, reusable and technology-agnostic automation recipes (ResearchOps) for MLOps and DevOps (CM4MLOps) discovered in close collaboration with academia and industry when reproducing more than 150 research papers and organizing the 1st mass-scale community benchmarking of ML and AI systems using CM and MLPerf. I donated CM and CM4MLOps to MLCommons to help connect academia and industry to learn how to build and run AI and other emerging workloads in the most efficient and cost-effective way using a common and technology-agnostic automation, virtualization and reproducibility framework while unifying knowledge exchange, protecting everyone's intellectual property, enabling portable skills, and accelerating transfer of the state-of-the-art research to production. My long-term vision is to make AI accessible to everyone by making it a commodity automatically produced from the most suitable open-source and proprietary components from different vendors based on user demand, requirements and constraints such as cost, latency, throughput, accuracy, energy, size and other important characteristics.
☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. Numerous effective IFT datasets have been proposed in the recent past, but most focus on high resource languages such as English. In this work, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual, to better align LLMs on a diverse set of languages and tasks. M2Lingual contains a total of 182K IFT pairs that are built upon diverse seeds, covering 70 languages, 17 NLP tasks and general instruction-response pairs. LLMs finetuned with M2Lingual substantially outperform the majority of existing multilingual IFT datasets. Importantly, LLMs trained with M2Lingual consistently achieve competitive results across a wide variety of evaluation benchmarks compared to existing multilingual IFT datasets. Specifically, LLMs finetuned with M2Lingual achieve strong performance on our translated multilingual, multi-turn evaluation benchmark as well as a wide variety of multilingual tasks. Thus we contribute, and the 2 step Evol taxonomy used for its creation. M2Lingual repository - https://huggingface.co/datasets/ServiceNow-AI/M2Lingual
comment: 39 pages
☆ Confidence Aware Inverse Constrained Reinforcement Learning ICML 2024
In coming up with solutions to real-world problems, humans implicitly adhere to constraints that are too numerous and complex to be specified completely. However, reinforcement learning (RL) agents need these constraints to learn the correct optimal policy in these settings. The field of Inverse Constraint Reinforcement Learning (ICRL) deals with this problem and provides algorithms that aim to estimate the constraints from expert demonstrations collected offline. Practitioners prefer to know a measure of confidence in the estimated constraints, before deciding to use these constraints, which allows them to only use the constraints that satisfy a desired level of confidence. However, prior works do not allow users to provide the desired level of confidence for the inferred constraints. This work provides a principled ICRL method that can take a confidence level with a set of expert demonstrations and outputs a constraint that is at least as constraining as the true underlying constraint with the desired level of confidence. Further, unlike previous methods, this method allows a user to know if the number of expert trajectories is insufficient to learn a constraint with a desired level of confidence, and therefore collect more expert trajectories as required to simultaneously learn constraints with the desired level of confidence and a policy that achieves the desired level of performance.
comment: Paper to appear in ICML 2024
☆ WARP: On the Benefits of Weight Averaged Rewarded Policies
Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
comment: 11 main pages (34 pages with Appendix)
☆ Conformal time series decomposition with component-wise exchangeability
Conformal prediction offers a practical framework for distribution-free uncertainty quantification, providing finite-sample coverage guarantees under relatively mild assumptions on data exchangeability. However, these assumptions cease to hold for time series due to their temporally correlated nature. In this work, we present a novel use of conformal prediction for time series forecasting that incorporates time series decomposition. This approach allows us to model different temporal components individually. By applying specific conformal algorithms to each component and then merging the obtained prediction intervals, we customize our methods to account for the different exchangeability regimes underlying each component. Our decomposition-based approach is thoroughly discussed and empirically evaluated on synthetic and real-world data. We find that the method provides promising results on well-structured time series, but can be limited by factors such as the decomposition step for more complex data.
comment: Accepted at COPA 2024; 34 pages, 14 figures, 8 tables (incl. appendix)
☆ Addressing Polarization and Unfairness in Performative Prediction
When machine learning (ML) models are used in applications that involve humans (e.g., online recommendation, school admission, hiring, lending), the model itself may trigger changes in the distribution of targeted data it aims to predict. Performative prediction (PP) is a framework that explicitly considers such model-dependent distribution shifts when learning ML models. While significant efforts have been devoted to finding performative stable (PS) solutions in PP for system robustness, their societal implications are less explored and it is unclear whether PS solutions are aligned with social norms such as fairness. In this paper, we set out to examine the fairness property of PS solutions in performative prediction. We first show that PS solutions can incur severe polarization effects and group-wise loss disparity. Although existing fairness mechanisms commonly used in literature can help mitigate unfairness, they may fail and disrupt the stability under model-dependent distribution shifts. We thus propose novel fairness intervention mechanisms that can simultaneously achieve both stability and fairness in PP settings. Both theoretical analysis and experiments are provided to validate the proposed method.
☆ The MRI Scanner as a Diagnostic: Image-less Active Sampling MICCAI 2024
Despite the high diagnostic accuracy of Magnetic Resonance Imaging (MRI), using MRI as a Point-of-Care (POC) disease identification tool poses significant accessibility challenges due to the use of high magnetic field strength and lengthy acquisition times. We ask a simple question: Can we dynamically optimise acquired samples, at the patient level, according to an (automated) downstream decision task, while discounting image reconstruction? We propose an ML-based framework that learns an active sampling strategy, via reinforcement learning, at a patient-level to directly infer disease from undersampled k-space. We validate our approach by inferring Meniscus Tear in undersampled knee MRI data, where we achieve diagnostic performance comparable with ML-based diagnosis, using fully sampled k-space data. We analyse task-specific sampling policies, showcasing the adaptability of our active sampling approach. The introduced frugal sampling strategies have the potential to reduce high field strength requirements that in turn strengthen the viability of MRI-based POC disease identification and associated preliminary screening tools.
comment: Accepted in MICCAI 2024
☆ Inferring stochastic low-rank recurrent neural networks from neural data
A central aim in computational neuroscience is to relate the activity of large populations of neurons to an underlying dynamical system. Models of these neural dynamics should ideally be both interpretable and fit the observed data well. Low-rank recurrent neural networks (RNNs) exhibit such interpretability by having tractable dynamics. However, it is unclear how to best fit low-rank RNNs to data consisting of noisy observations of an underlying stochastic system. Here, we propose to fit stochastic low-rank RNNs with variational sequential Monte Carlo methods. We validate our method on several datasets consisting of both continuous and spiking neural data, where we obtain lower dimensional latent dynamics than current state of the art methods. Additionally, for low-rank models with piecewise linear nonlinearities, we show how to efficiently identify all fixed points in polynomial rather than exponential cost in the number of units, making analysis of the inferred dynamics tractable for large RNNs. Our method both elucidates the dynamical systems underlying experimental recordings and provides a generative model whose trajectories match observed trial-to-trial variability.
☆ OCALM: Object-Centric Assessment with Language Models
Properly defining a reward signal to efficiently train a reinforcement learning (RL) agent is a challenging task. Designing balanced objective functions from which a desired behavior can emerge requires expert knowledge, especially for complex environments. Learning rewards from human feedback or using large language models (LLMs) to directly provide rewards are promising alternatives, allowing non-experts to specify goals for the agent. However, black-box reward models make it difficult to debug the reward. In this work, we propose Object-Centric Assessment with Language Models (OCALM) to derive inherently interpretable reward functions for RL agents from natural language task descriptions. OCALM uses the extensive world-knowledge of LLMs while leveraging the object-centric nature common to many environments to derive reward functions focused on relational concepts, providing RL agents with the ability to derive policies from task descriptions.
comment: Accepted at the RLBRew Workshop at RLC 2024
☆ Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers
Accommodating long sequences efficiently in autoregressive Transformers, especially within an extended context window, poses significant challenges due to the quadratic computational complexity and substantial KV memory requirements inherent in self-attention mechanisms. In this work, we introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome these computational and memory obstacles while maintaining performance. Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query, thereby enabling gradient-based optimization. As a result, SPARSEK Attention offers linear time complexity and constant memory footprint during generation. Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods and provides significant speed improvements during both training and inference, particularly in language modeling and downstream tasks. Furthermore, our method can be seamlessly integrated into pre-trained Large Language Models (LLMs) with minimal fine-tuning, offering a practical solution for effectively managing long-range dependencies in diverse applications.
comment: preprint
☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
☆ Bandits with Preference Feedback: A Stackelberg Game Perspective
Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.
comment: 30 pages, 8 figures
☆ Learning the boundary-to-domain mapping using Lifting Product Fourier Neural Operators for partial differential equations ICML 2024
Neural operators such as the Fourier Neural Operator (FNO) have been shown to provide resolution-independent deep learning models that can learn mappings between function spaces. For example, an initial condition can be mapped to the solution of a partial differential equation (PDE) at a future time-step using a neural operator. Despite the popularity of neural operators, their use to predict solution functions over a domain given only data over the boundary (such as a spatially varying Dirichlet boundary condition) remains unexplored. In this paper, we refer to such problems as boundary-to-domain problems; they have a wide range of applications in areas such as fluid mechanics, solid mechanics, heat transfer etc. We present a novel FNO-based architecture, named Lifting Product FNO (or LP-FNO) which can map arbitrary boundary functions defined on the lower-dimensional boundary to a solution in the entire domain. Specifically, two FNOs defined on the lower-dimensional boundary are lifted into the higher dimensional domain using our proposed lifting product layer. We demonstrate the efficacy and resolution independence of the proposed LP-FNO for the 2D Poisson equation.
comment: Accepted by ICML 2024 AI for Science Workshop
☆ Inducing Group Fairness in LLM-Based Decisions
Prompting Large Language Models (LLMs) has created new and interesting means for classifying textual data. While evaluating and remediating group fairness is a well-studied problem in classifier fairness literature, some classical approaches (e.g., regularization) do not carry over, and some new opportunities arise (e.g., prompt-based remediation). We measure fairness of LLM-based classifiers on a toxicity classification task, and empirically show that prompt-based classifiers may lead to unfair decisions. We introduce several remediation techniques and benchmark their fairness and performance trade-offs. We hope our work encourages more research on group fairness in LLM-based classifiers.
☆ GC-Bench: A Benchmark Framework for Graph Condensation with New Insights
Graph condensation (GC) is an emerging technique designed to learn a significantly smaller graph that retains the essential information of the original graph. This condensed graph has shown promise in accelerating graph neural networks while preserving performance comparable to those achieved with the original, larger graphs. Additionally, this technique facilitates downstream applications such as neural architecture search and enhances our understanding of redundancy in large graphs. Despite the rapid development of GC methods, a systematic evaluation framework remains absent, which is necessary to clarify the critical designs for particular evaluative aspects. Furthermore, several meaningful questions have not been investigated, such as whether GC inherently preserves certain graph properties and offers robustness even without targeted design efforts. In this paper, we introduce GC-Bench, a comprehensive framework to evaluate recent GC methods across multiple dimensions and to generate new insights. Our experimental findings provide a deeper insights into the GC process and the characteristics of condensed graphs, guiding future efforts in enhancing performance and exploring new applications. Our code is available at \url{https://github.com/Emory-Melody/GraphSlim/tree/main/benchmark}.
comment: 9 pages
☆ AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.
☆ CausalFormer: An Interpretable Transformer for Temporal Causal Discovery
Temporal causal discovery is a crucial task aimed at uncovering the causal relations within time series data. The latest temporal causal discovery methods usually train deep learning models on prediction tasks to uncover the causality between time series. They capture causal relations by analyzing the parameters of some components of the trained models, e.g., attention weights and convolution weights. However, this is an incomplete mapping process from the model parameters to the causality and fails to investigate the other components, e.g., fully connected layers and activation functions, that are also significant for causal discovery. To facilitate the utilization of the whole deep learning models in temporal causal discovery, we proposed an interpretable transformer-based causal discovery model termed CausalFormer, which consists of the causality-aware transformer and the decomposition-based causality detector. The causality-aware transformer learns the causal representation of time series data using a prediction task with the designed multi-kernel causal convolution which aggregates each input time series along the temporal dimension under the temporal priority constraint. Then, the decomposition-based causality detector interprets the global structure of the trained causality-aware transformer with the proposed regression relevance propagation to identify potential causal relations and finally construct the causal graph. Experiments on synthetic, simulated, and real datasets demonstrate the state-of-the-art performance of CausalFormer on discovering temporal causality. Our code is available at https://github.com/lingbai-kong/CausalFormer.
☆ Probabilistic Subgoal Representations for Hierarchical Reinforcement learning
In goal-conditioned hierarchical reinforcement learning (HRL), a high-level policy specifies a subgoal for the low-level policy to reach. Effective HRL hinges on a suitable subgoal represen tation function, abstracting state space into latent subgoal space and inducing varied low-level behaviors. Existing methods adopt a subgoal representation that provides a deterministic mapping from state space to latent subgoal space. Instead, this paper utilizes Gaussian Processes (GPs) for the first probabilistic subgoal representation. Our method employs a GP prior on the latent subgoal space to learn a posterior distribution over the subgoal representation functions while exploiting the long-range correlation in the state space through learnable kernels. This enables an adaptive memory that integrates long-range subgoal information from prior planning steps allowing to cope with stochastic uncertainties. Furthermore, we propose a novel learning objective to facilitate the simultaneous learning of probabilistic subgoal representations and policies within a unified framework. In experiments, our approach outperforms state-of-the-art baselines in standard benchmarks but also in environments with stochastic elements and under diverse reward conditions. Additionally, our model shows promising capabilities in transferring low-level policies across different tasks.
☆ Learning Interpretable Fair Representations
Numerous approaches have been recently proposed for learning fair representations that mitigate unfair outcomes in prediction tasks. A key motivation for these methods is that the representations can be used by third parties with unknown objectives. However, because current fair representations are generally not interpretable, the third party cannot use these fair representations for exploration, or to obtain any additional insights, besides the pre-contracted prediction tasks. Thus, to increase data utility beyond prediction tasks, we argue that the representations need to be fair, yet interpretable. We propose a general framework for learning interpretable fair representations by introducing an interpretable "prior knowledge" during the representation learning process. We implement this idea and conduct experiments with ColorMNIST and Dsprite datasets. The results indicate that in addition to being interpretable, our representations attain slightly higher accuracy and fairer outcomes in a downstream classification task compared to state-of-the-art fair representations.
☆ Coding schemes in neural networks learning classification tasks
Neural networks posses the crucial ability to generate meaningful representations of task-dependent features. Indeed, with appropriate scaling, supervised learning in neural networks can result in strong, task-dependent feature learning. However, the nature of the emergent representations, which we call the `coding scheme', is still unclear. To understand the emergent coding scheme, we investigate fully-connected, wide neural networks learning classification tasks using the Bayesian framework where learning shapes the posterior distribution of the network weights. Consistent with previous findings, our analysis of the feature learning regime (also known as `non-lazy', `rich', or `mean-field' regime) shows that the networks acquire strong, data-dependent features. Surprisingly, the nature of the internal representations depends crucially on the neuronal nonlinearity. In linear networks, an analog coding scheme of the task emerges. Despite the strong representations, the mean predictor is identical to the lazy case. In nonlinear networks, spontaneous symmetry breaking leads to either redundant or sparse coding schemes. Our findings highlight how network properties such as scaling of weights and neuronal nonlinearity can profoundly influence the emergent representations.
☆ Link Prediction with Untrained Message Passing Layers
Message passing neural networks (MPNNs) operate on graphs by exchanging information between neigbouring nodes. MPNNs have been successfully applied to various node-, edge-, and graph-level tasks in areas like molecular science, computer vision, natural language processing, and combinatorial optimization. However, most MPNNs require training on large amounts of labeled data, which can be costly and time-consuming. In this work, we explore the use of various untrained message passing layers in graph neural networks, i.e. variants of popular message passing architecture where we remove all trainable parameters that are used to transform node features in the message passing step. Focusing on link prediction, we find that untrained message passing layers can lead to competitive and even superior performance compared to fully trained MPNNs, especially in the presence of high-dimensional features. We provide a theoretical analysis of untrained message passing by relating the inner products of features implicitly produced by untrained message passing layers to path-based topological node similarity measures. As such, untrained message passing architectures can be viewed as a highly efficient and interpretable approach to link prediction.
☆ Repulsive Score Distillation for Diverse Sampling of Diffusion Models
Score distillation sampling has been pivotal for integrating diffusion models into generation of complex visuals. Despite impressive results it suffers from mode collapse and lack of diversity. To cope with this challenge, we leverage the gradient flow interpretation of score distillation to propose Repulsive Score Distillation (RSD). In particular, we propose a variational framework based on repulsion of an ensemble of particles that promotes diversity. Using a variational approximation that incorporates a coupling among particles, the repulsion appears as a simple regularization that allows interaction of particles based on their relative pairwise similarity, measured e.g., via radial basis kernels. We design RSD for both unconstrained and constrained sampling scenarios. For constrained sampling we focus on inverse problems in the latent space that leads to an augmented variational formulation, that strikes a good balance between compute, quality and diversity. Our extensive experiments for text-to-image generation, and inverse problems demonstrate that RSD achieves a superior trade-off between diversity and quality compared with state-of-the-art alternatives.
☆ A Comprehensive Review of Emerging Approaches in Machine Learning for De Novo PROTAC Design
Targeted protein degradation (TPD) is a rapidly growing field in modern drug discovery that aims to regulate the intracellular levels of proteins by harnessing the cell's innate degradation pathways to selectively target and degrade disease-related proteins. This strategy creates new opportunities for therapeutic intervention in cases where occupancy-based inhibitors have not been successful. Proteolysis-targeting chimeras (PROTACs) are at the heart of TPD strategies, which leverage the ubiquitin-proteasome system for the selective targeting and proteasomal degradation of pathogenic proteins. As the field evolves, it becomes increasingly apparent that the traditional methodologies for designing such complex molecules have limitations. This has led to the use of machine learning (ML) and generative modeling to improve and accelerate the development process. In this review, we explore the impact of ML on de novo PROTAC design $-$ an aspect of molecular design that has not been comprehensively reviewed despite its significance. We delve into the distinct characteristics of PROTAC linker design, underscoring the complexities required to create effective bifunctional molecules capable of TPD. We then examine how ML in the context of fragment-based drug design (FBDD), honed in the realm of small-molecule drug discovery, is paving the way for PROTAC linker design. Our review provides a critical evaluation of the limitations inherent in applying this method to the complex field of PROTAC development. Moreover, we review existing ML works applied to PROTAC design, highlighting pioneering efforts and, importantly, the limitations these studies face. By offering insights into the current state of PROTAC development and the integral role of ML in PROTAC design, we aim to provide valuable perspectives for researchers in their pursuit of better design strategies for this new modality.
☆ Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.
☆ Cubic regularized subspace Newton for non-convex optimization
This paper addresses the optimization problem of minimizing non-convex continuous functions, which is relevant in the context of high-dimensional machine learning applications characterized by over-parametrization. We analyze a randomized coordinate second-order method named SSCN which can be interpreted as applying cubic regularization in random subspaces. This approach effectively reduces the computational complexity associated with utilizing second-order information, rendering it applicable in higher-dimensional scenarios. Theoretically, we establish convergence guarantees for non-convex functions, with interpolating rates for arbitrary subspace sizes and allowing inexact curvature estimation. When increasing subspace size, our complexity matches $\mathcal{O}(\epsilon^{-3/2})$ of the cubic regularization (CR) rate. Additionally, we propose an adaptive sampling scheme ensuring exact convergence rate of $\mathcal{O}(\epsilon^{-3/2}, \epsilon^{-3})$ to a second-order stationary point, even without sampling all coordinates. Experimental results demonstrate substantial speed-ups achieved by SSCN compared to conventional first-order methods.
☆ Data-driven Modeling in Metrology -- A Short Introduction, Current Developments and Future Perspectives
Mathematical models are vital to the field of metrology, playing a key role in the derivation of measurement results and the calculation of uncertainties from measurement data, informed by an understanding of the measurement process. These models generally represent the correlation between the quantity being measured and all other pertinent quantities. Such relationships are used to construct measurement systems that can interpret measurement data to generate conclusions and predictions about the measurement system itself. Classic models are typically analytical, built on fundamental physical principles. However, the rise of digital technology, expansive sensor networks, and high-performance computing hardware have led to a growing shift towards data-driven methodologies. This trend is especially prominent when dealing with large, intricate networked sensor systems in situations where there is limited expert understanding of the frequently changing real-world contexts. Here, we demonstrate the variety of opportunities that data-driven modeling presents, and how they have been already implemented in various real-world applications.
comment: 31 pages, Preprint
☆ ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
☆ No More Sliding-Windows: Dynamic Functional Connectivity Based On Random Convolutions Without Learning
In the field of dynamic functional connectivity, the sliding-window method is widely used and its stability is generally recognized. However, the sliding-window method's data processing within the window is overly simplistic, which to some extent limits its effectiveness. This study proposes a feature expansion method based on random convolution, which achieves better and more noise-resistant results than the sliding-window method without requiring training. Experiments on simulated data show that the dynamic functional connectivity matrix and time series obtained using the random convolution method have a higher degree of fit (95.59\%) with the standard answers within shorter time windows, compared to the sliding-window method (45.99\%). Gender difference studies on real data also reveal that the random convolution method uncovers more gender differences than the sliding-window method. Through theoretical analysis, we propose a more comprehensive convolutional functional connectivity computation model, with the sliding-window method being a special case of this model, thereby opening up vast potential for research methods in dynamic functional connectivity.
☆ Evaluating the Robustness of Deep-Learning Algorithm-Selection Models by Evolving Adversarial Instances PPSN 2024
Deep neural networks (DNN) are increasingly being used to perform algorithm-selection in combinatorial optimisation domains, particularly as they accommodate input representations which avoid designing and calculating features. Mounting evidence from domains that use images as input shows that deep convolutional networks are vulnerable to adversarial samples, in which a small perturbation of an instance can cause the DNN to misclassify. However, it remains unknown as to whether deep recurrent networks (DRN) which have recently been shown promise as algorithm-selectors in the bin-packing domain are equally vulnerable. We use an evolutionary algorithm (EA) to find perturbations of instances from two existing benchmarks for online bin packing that cause trained DRNs to misclassify: adversarial samples are successfully generated from up to 56% of the original instances depending on the dataset. Analysis of the new misclassified instances sheds light on the `fragility' of some training instances, i.e. instances where it is trivial to find a small perturbation that results in a misclassification and the factors that influence this. Finally, the method generates a large number of new instances misclassified with a wide variation in confidence, providing a rich new source of training data to create more robust models.
comment: To appear in the proceedings of the 18th International Conference on Parallel Problem Solving from Nature (PPSN 2024)
☆ When Invariant Representation Learning Meets Label Shift: Insufficiency and Theoretical Insights
As a crucial step toward real-world learning scenarios with changing environments, dataset shift theory and invariant representation learning algorithm have been extensively studied to relax the identical distribution assumption in classical learning setting. Among the different assumptions on the essential of shifting distributions, generalized label shift (GLS) is the latest developed one which shows great potential to deal with the complex factors within the shift. In this paper, we aim to explore the limitations of current dataset shift theory and algorithm, and further provide new insights by presenting a comprehensive understanding of GLS. From theoretical aspect, two informative generalization bounds are derived, and the GLS learner is proved to be sufficiently close to optimal target model from the Bayesian perspective. The main results show the insufficiency of invariant representation learning, and prove the sufficiency and necessity of GLS correction for generalization, which provide theoretical supports and innovations for exploring generalizable model under dataset shift. From methodological aspect, we provide a unified view of existing shift correction frameworks, and propose a kernel embedding-based correction algorithm (KECA) to minimize the generalization error and achieve successful knowledge transfer. Both theoretical results and extensive experiment evaluations demonstrate the sufficiency and necessity of GLS correction for addressing dataset shift and the superiority of proposed algorithm.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
☆ Cherry on the Cake: Fairness is NOT an Optimization Problem
Fair cake-cutting is a mathematical subfield that studies the problem of fairly dividing a resource among a number of participants. The so-called ``cake,'' as an object, represents any resource that can be distributed among players. This concept is connected to supervised multi-label classification: any dataset can be thought of as a cake that needs to be distributed, where each label is a player that receives its share of the dataset. In particular, any efficient cake-cutting solution for the dataset is equivalent to an optimal decision function. Although we are not the first to demonstrate this connection, the important ramifications of this parallel seem to have been partially forgotten. We revisit these classical results and demonstrate how this connection can be prolifically used for fairness in machine learning problems. Understanding the set of achievable fair decisions is a fundamental step in finding optimal fair solutions and satisfying fairness requirements. By employing the tools of cake-cutting theory, we have been able to describe the behavior of optimal fair decisions, which, counterintuitively, often exhibit quite unfair properties. Specifically, in order to satisfy fairness constraints, it is sometimes preferable, in the name of optimality, to purposefully make mistakes and deny giving the positive label to deserving individuals in a community in favor of less worthy individuals within the same community. This practice is known in the literature as cherry-picking and has been described as ``blatantly unfair.''
☆ CLEAR: Can Language Models Really Understand Causal Graphs?
Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.
☆ Measuring the Recyclability of Electronic Components to Assist Automatic Disassembly and Sorting Waste Printed Circuit Boards
The waste of electrical and electronic equipment has been increased due to the fast evolution of technology products and competition of many IT sectors. Every year millions of tons of electronic waste are thrown into the environment which causes high consequences for human health. Therefore, it is crucial to control this waste flow using technology, especially using Artificial Intelligence but also reclamation of critical raw materials for new production processes. In this paper, we focused on the measurement of recyclability of waste electronic components (WECs) from waste printed circuit boards (WPCBs) using mathematical innovation model. This innovative approach evaluates both the recyclability and recycling difficulties of WECs, integrating an AI model for improved disassembly and sorting. Assessing the recyclability of individual electronic components present on WPCBs provides insight into the recovery potential of valuable materials and indicates the level of complexity involved in recycling in terms of economic worth and production utility. This novel measurement approach helps AI models in accurately determining the number of classes to be identified and sorted during the automated disassembly of discarded PCBs. It also facilitates the model in iterative training and validation of individual electronic components.
comment: 15 pages, 6 figures
☆ Forecasting with Deep Learning: Beyond Average of Average of Average Performance
Accurate evaluation of forecasting models is essential for ensuring reliable predictions. Current practices for evaluating and comparing forecasting models focus on summarising performance into a single score, using metrics such as SMAPE. We hypothesize that averaging performance over all samples dilutes relevant information about the relative performance of models. Particularly, conditions in which this relative performance is different than the overall accuracy. We address this limitation by proposing a novel framework for evaluating univariate time series forecasting models from multiple perspectives, such as one-step ahead forecasting versus multi-step ahead forecasting. We show the advantages of this framework by comparing a state-of-the-art deep learning approach with classical forecasting techniques. While classical methods (e.g. ARIMA) are long-standing approaches to forecasting, deep neural networks (e.g. NHITS) have recently shown state-of-the-art forecasting performance in benchmark datasets. We conducted extensive experiments that show NHITS generally performs best, but its superiority varies with forecasting conditions. For instance, concerning the forecasting horizon, NHITS only outperforms classical approaches for multi-step ahead forecasting. Another relevant insight is that, when dealing with anomalies, NHITS is outperformed by methods such as Theta. These findings highlight the importance of aspect-based model evaluation.
☆ Personalized federated learning based on feature fusion
Federated learning enables distributed clients to collaborate on training while storing their data locally to protect client privacy. However, due to the heterogeneity of data, models, and devices, the final global model may need to perform better for tasks on each client. Communication bottlenecks, data heterogeneity, and model heterogeneity have been common challenges in federated learning. In this work, we considered a label distribution skew problem, a type of data heterogeneity easily overlooked. In the context of classification, we propose a personalized federated learning approach called pFedPM. In our process, we replace traditional gradient uploading with feature uploading, which helps reduce communication costs and allows for heterogeneous client models. These feature representations play a role in preserving privacy to some extent. We use a hyperparameter $a$ to mix local and global features, which enables us to control the degree of personalization. We also introduced a relation network as an additional decision layer, which provides a non-linear learnable classifier to predict labels. Experimental results show that, with an appropriate setting of $a$, our scheme outperforms several recent FL methods on MNIST, FEMNIST, and CRIFAR10 datasets and achieves fewer communications.
☆ Differentiable Distributionally Robust Optimization Layers
In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i.e., how to differentiate decisions with respect to an ambiguity set. In this paper, we develop such differentiable DRO layers for generic mixed-integer DRO problems with parameterized second-order conic ambiguity sets and discuss its extension to Wasserstein ambiguity sets. To differentiate the mixed-integer decisions, we propose a novel dual-view methodology by handling continuous and discrete parts of decisions via different principles. Specifically, we construct a differentiable energy-based surrogate to implement the dual-view methodology and use importance sampling to estimate its gradient. We further prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments.
comment: In Forty-first International Conference on Machine Learning (2024)
☆ Noisy Neighbors: Efficient membership inference attacks against LLMs
The potential of transformer-based LLMs risks being hindered by privacy concerns due to their reliance on extensive datasets, possibly including sensitive information. Regulatory measures like GDPR and CCPA call for using robust auditing tools to address potential privacy issues, with Membership Inference Attacks (MIA) being the primary method for assessing LLMs' privacy risks. Differently from traditional MIA approaches, often requiring computationally intensive training of additional models, this paper introduces an efficient methodology that generates \textit{noisy neighbors} for a target sample by adding stochastic noise in the embedding space, requiring operating the target model in inference mode only. Our findings demonstrate that this approach closely matches the effectiveness of employing shadow models, showing its usability in practical privacy auditing scenarios.
☆ Efficient k-means with Individual Fairness via Exponential Tilting
In location-based resource allocation scenarios, the distances between each individual and the facility are desired to be approximately equal, thereby ensuring fairness. Individually fair clustering is often employed to achieve the principle of treating all points equally, which can be applied in these scenarios. This paper proposes a novel algorithm, tilted k-means (TKM), aiming to achieve individual fairness in clustering. We integrate the exponential tilting into the sum of squared errors (SSE) to formulate a novel objective function called tilted SSE. We demonstrate that the tilted SSE can generalize to SSE and employ the coordinate descent and first-order gradient method for optimization. We propose a novel fairness metric, the variance of the distances within each cluster, which can alleviate the Matthew Effect typically caused by existing fairness metrics. Our theoretical analysis demonstrates that the well-known k-means++ incurs a multiplicative error of O(k log k), and we establish the convergence of TKM under mild conditions. In terms of fairness, we prove that the variance generated by TKM decreases with a scaled hyperparameter. In terms of efficiency, we demonstrate the time complexity is linear with the dataset size. Our experiments demonstrate that TKM outperforms state-of-the-art methods in effectiveness, fairness, and efficiency.
☆ Inference of Sequential Patterns for Neural Message Passing in Temporal Graphs
The modelling of temporal patterns in dynamic graphs is an important current research issue in the development of time-aware GNNs. Whether or not a specific sequence of events in a temporal graph constitutes a temporal pattern not only depends on the frequency of its occurrence. We consider whether it deviates from what is expected in a temporal graph where timestamps are randomly shuffled. While accounting for such a random baseline is important to model temporal patterns, it has mostly been ignored by current temporal graph neural networks. To address this issue we propose HYPA-DBGNN, a novel two-step approach that combines (i) the inference of anomalous sequential patterns in time series data on graphs based on a statistically principled null model, with (ii) a neural message passing approach that utilizes a higher-order De Bruijn graph whose edges capture overrepresented sequential patterns. Our method leverages hypergeometric graph ensembles to identify anomalous edges within both first- and higher-order De Bruijn graphs, which encode the temporal ordering of events. The model introduces an inductive bias that enhances model interpretability. We evaluate our approach for static node classification using benchmark datasets and a synthetic dataset that showcases its ability to incorporate the observed inductive bias regarding over- and under-represented temporal edges. We demonstrate the framework's effectiveness in detecting similar patterns within empirical datasets, resulting in superior performance compared to baseline methods in node classification tasks. To the best of our knowledge, our work is the first to introduce statistically informed GNNs that leverage temporal and causal sequence anomalies. HYPA-DBGNN represents a path for bridging the gap between statistical graph inference and neural graph representation learning, with potential applications to static GNNs.
☆ Improving robustness to corruptions with multiplicative weight perturbations
Deep neural networks (DNNs) excel on clean images but struggle with corrupted ones. Incorporating specific corruptions into the data augmentation pipeline can improve robustness to those corruptions but may harm performance on clean images and other types of distortion. In this paper, we introduce an alternative approach that improves the robustness of DNNs to a wide range of corruptions without compromising accuracy on clean images. We first demonstrate that input perturbations can be mimicked by multiplicative perturbations in the weight space. Leveraging this, we propose Data Augmentation via Multiplicative Perturbation (DAMP), a training method that optimizes DNNs under random multiplicative weight perturbations. We also examine the recently proposed Adaptive Sharpness-Aware Minimization (ASAM) and show that it optimizes DNNs under adversarial multiplicative weight perturbations. Experiments on image classification datasets (CIFAR-10/100, TinyImageNet and ImageNet) and neural network architectures (ResNet50, ViT-S/16) show that DAMP enhances model generalization performance in the presence of corruptions across different settings. Notably, DAMP is able to train a ViT-S/16 on ImageNet from scratch, reaching the top-1 error of 23.7% which is comparable to ResNet50 without extensive data augmentations.
comment: Under review
☆ Token-based Decision Criteria Are Suboptimal in In-context Learning
In-Context Learning (ICL) typically utilizes classification criteria from probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we use the nearest centroid classification on the hidden states, assigning the category of the nearest centroid previously observed from a few-shot calibration set to the test sample as the predicted label. Our experiments on 3 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based calibrations by about 20%. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-categories overlap, and LMs provide linearly separable intra-category clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the conventional ICL.
comment: 21 pages, 14 figures, 8 tables
☆ Conditional Bayesian Quadrature
We propose a novel approach for estimating conditional or parametric expectations in the setting where obtaining samples or evaluating integrands is costly. Through the framework of probabilistic numerical methods (such as Bayesian quadrature), our novel approach allows to incorporates prior information about the integrands especially the prior smoothness knowledge about the integrands and the conditional expectation. As a result, our approach provides a way of quantifying uncertainty and leads to a fast convergence rate, which is confirmed both theoretically and empirically on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty.
☆ SyROCCo: Enhancing Systematic Reviews using Machine Learning
The sheer number of research outputs published every year makes systematic reviewing increasingly time- and resource-intensive. This paper explores the use of machine learning techniques to help navigate the systematic review process. ML has previously been used to reliably 'screen' articles for review - that is, identify relevant articles based on reviewers' inclusion criteria. The application of ML techniques to subsequent stages of a review, however, such as data extraction and evidence mapping, is in its infancy. We therefore set out to develop a series of tools that would assist in the profiling and analysis of 1,952 publications on the theme of 'outcomes-based contracting'. Tools were developed for the following tasks: assign publications into 'policy area' categories; identify and extract key information for evidence mapping, such as organisations, laws, and geographical information; connect the evidence base to an existing dataset on the same topic; and identify subgroups of articles that may share thematic content. An interactive tool using these techniques and a public dataset with their outputs have been released. Our results demonstrate the utility of ML techniques to enhance evidence accessibility and analysis within the systematic review processes. These efforts show promise in potentially yielding substantial efficiencies for future systematic reviewing and for broadening their analytical scope. Our work suggests that there may be implications for the ease with which policymakers and practitioners can access evidence. While ML techniques seem poised to play a significant role in bridging the gap between research and policy by offering innovative ways of gathering, accessing, and analysing data from systematic reviews, we also highlight their current limitations and the need to exercise caution in their application, particularly given the potential for errors and biases.
comment: 28 pages, 5 figures. To appear in Data & Policy journal
☆ OAML: Outlier Aware Metric Learning for OOD Detection Enhancement
Out-of-distribution (OOD) detection methods have been developed to identify objects that a model has not seen during training. The Outlier Exposure (OE) methods use auxiliary datasets to train OOD detectors directly. However, the collection and learning of representative OOD samples may pose challenges. To tackle these issues, we propose the Outlier Aware Metric Learning (OAML) framework. The main idea of our method is to use the k-NN algorithm and Stable Diffusion model to generate outliers for training at the feature level without making any distributional assumptions. To increase feature discrepancies in the semantic space, we develop a mutual information-based contrastive learning approach for learning from OOD data effectively. Both theoretical and empirical results confirm the effectiveness of this contrastive learning technique. Furthermore, we incorporate knowledge distillation into our learning framework to prevent degradation of in-distribution classification accuracy. The combination of contrastive learning and knowledge distillation algorithms significantly enhances the performance of OOD detection. Experimental results across various datasets show that our method significantly outperforms previous OE methods.
☆ UNICAD: A Unified Approach for Attack Detection, Noise Reduction and Novel Class Identification
As the use of Deep Neural Networks (DNNs) becomes pervasive, their vulnerability to adversarial attacks and limitations in handling unseen classes poses significant challenges. The state-of-the-art offers discrete solutions aimed to tackle individual issues covering specific adversarial attack scenarios, classification or evolving learning. However, real-world systems need to be able to detect and recover from a wide range of adversarial attacks without sacrificing classification accuracy and to flexibly act in {\bf unseen} scenarios. In this paper, UNICAD, is proposed as a novel framework that integrates a variety of techniques to provide an adaptive solution. For the targeted image classification, UNICAD achieves accurate image classification, detects unseen classes, and recovers from adversarial attacks using Prototype and Similarity-based DNNs with denoising autoencoders. Our experiments performed on the CIFAR-10 dataset highlight UNICAD's effectiveness in adversarial mitigation and unseen class classification, outperforming traditional models.
☆ Robust prediction under missingness shifts
Prediction becomes more challenging with missing covariates. What method is chosen to handle missingness can greatly affect how models perform. In many real-world problems, the best prediction performance is achieved by models that can leverage the informative nature of a value being missing. Yet, the reasons why a covariate goes missing can change once a model is deployed in practice. If such a missingness shift occurs, the conditional probability of a value being missing differs in the target data. Prediction performance in the source data may no longer be a good selection criterion, and approaches that do not rely on informative missingness may be preferable. However, we show that the Bayes predictor remains unchanged by ignorable shifts for which the probability of missingness only depends on observed data. Any consistent estimator of the Bayes predictor may therefore result in robust prediction under those conditions, although we show empirically that different methods appear robust to different types of shifts. If the missingness shift is non-ignorable, the Bayes predictor may change due to the shift. While neither approach recovers the Bayes predictor in this case, we found empirically that disregarding missingness was most beneficial when it was highly informative.
☆ Improving Quaternion Neural Networks with Quaternionic Activation Functions
In this paper, we propose novel quaternion activation functions where we modify either the quaternion magnitude or the phase, as an alternative to the commonly used split activation functions. We define criteria that are relevant for quaternion activation functions, and subsequently we propose our novel activation functions based on this analysis. Instead of applying a known activation function like the ReLU or Tanh on the quaternion elements separately, these activation functions consider the quaternion properties and respect the quaternion space $\mathbb{H}$. In particular, all quaternion components are utilized to calculate all output components, carrying out the benefit of the Hamilton product in e.g. the quaternion convolution to the activation functions. The proposed activation functions can be incorporated in arbitrary quaternion valued neural networks trained with gradient descent techniques. We further discuss the derivatives of the proposed activation functions where we observe beneficial properties for the activation functions affecting the phase. Specifically, they prove to be sensitive on basically the whole input range, thus improved gradient flow can be expected. We provide an elaborate experimental evaluation of our proposed quaternion activation functions including comparison with the split ReLU and split Tanh on two image classification tasks using the CIFAR-10 and SVHN dataset. There, especially the quaternion activation functions affecting the phase consistently prove to provide better performance.
☆ The Hidden Pitfalls of the Cosine Similarity Loss
We show that the gradient of the cosine similarity between two points goes to zero in two under-explored settings: (1) if a point has large magnitude or (2) if the points are on opposite ends of the latent space. Counterintuitively, we prove that optimizing the cosine similarity between points forces them to grow in magnitude. Thus, (1) is unavoidable in practice. We then observe that these derivations are extremely general -- they hold across deep learning architectures and for many of the standard self-supervised learning (SSL) loss functions. This leads us to propose cut-initialization: a simple change to network initialization that helps all studied SSL methods converge faster.
☆ SLOctolyzer: Fully automatic analysis toolkit for segmentation and feature extracting in scanning laser ophthalmoscopy images
Purpose: To describe SLOctolyzer: an open-source analysis toolkit for en face retinal vessels appearing in infrared reflectance scanning laser ophthalmoscopy (SLO) images. Methods: SLOctolyzer includes two main modules: segmentation and measurement. The segmentation module use deep learning methods to delineate retinal anatomy, while the measurement module quantifies key retinal vascular features such as vessel complexity, density, tortuosity, and calibre. We evaluate the segmentation module using unseen data and measure its reproducibility. Results: SLOctolyzer's segmentation module performed well against unseen internal test data (Dice for all-vessels, 0.9097; arteries, 0.8376; veins, 0.8525; optic disc, 0.9430; fovea, 0.8837). External validation against severe retinal pathology showed decreased performance (Dice for arteries, 0.7180; veins, 0.7470; optic disc, 0.9032). SLOctolyzer had good reproducibility (mean difference for fractal dimension, -0.0007; vessel density, -0.0003; vessel calibre, -0.3154 $\mu$m; tortuosity density, 0.0013). SLOctolyzer can process a macula-centred SLO image in under 20 seconds and a disc-centred SLO image in under 30 seconds using a standard laptop CPU. Conclusions: To our knowledge, SLOctolyzer is the first open-source tool to convert raw SLO images into reproducible and clinically meaningful retinal vascular parameters. SLO images are captured simultaneous to optical coherence tomography (OCT), and we believe our software will be useful for extracting retinal vascular measurements from large OCT image sets and linking them to ocular or systemic diseases. It requires no specialist knowledge or proprietary software, and allows manual correction of segmentations and re-computing of vascular metrics. SLOctolyzer is freely available at https://github.com/jaburke166/SLOctolyzer.
comment: 10 pages, 5 figures, 6 tables + Supplementary (7 pages, 10 figures, 4 tables). Submitted for peer review at Translational Vision Science and Technology
☆ Automated Privacy-Preserving Techniques via Meta-Learning
Sharing private data for learning tasks is pivotal for transparent and secure machine learning applications. Many privacy-preserving techniques have been proposed for this task aiming to transform the data while ensuring the privacy of individuals. Some of these techniques have been incorporated into tools, whereas others are accessed through various online platforms. However, such tools require manual configuration, which can be complex and time-consuming. Moreover, they require substantial expertise, potentially restricting their use to those with advanced technical knowledge. In this paper, we propose AUTOPRIV, the first automated privacy-preservation method, that eliminates the need for any manual configuration. AUTOPRIV employs meta-learning to automate the de-identification process, facilitating the secure release of data for machine learning tasks. The main goal is to anticipate the predictive performance and privacy risk of a large set of privacy configurations. We provide a ranked list of the most promising solutions, which are likely to achieve an optimal approximation within a new domain. AUTOPRIV is highly effective as it reduces computational complexity and energy consumption considerably.
comment: 12 pages, 6 figures, 3 tables
☆ Theory on Mixture-of-Experts in Continual Learning
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.
☆ Fault Detection for agents on power grid topology optimization: A Comprehensive analysis
The topology optimization of transmission networks using Deep Reinforcement Learning (DRL) has increasingly come into focus. Various researchers have proposed different DRL agents, which are often benchmarked on the Grid2Op environment from the Learning to Run a Power Network (L2RPN) challenges. The environments have many advantages with their realistic chronics and underlying power flow backends. However, the interpretation of agent survival or failure is not always clear, as there are a variety of potential causes. In this work, we focus on the failures of the power grid to identify patterns and detect them a priori. We collect the failed chronics of three different agents on the WCCI 2022 L2RPN environment, totaling about 40k data points. By clustering, we are able to detect five distinct clusters, identifying different failure types. Further, we propose a multi-class prediction approach to detect failures beforehand and evaluate five different models. Here, the Light Gradient-Boosting Machine (LightGBM) shows the best performance, with an accuracy of 86%. It also correctly identifies in 91% of the time failure and survival observations. Finally, we provide a detailed feature importance analysis that identifies critical features and regions in the grid.
comment: 11 Pages plus references and appendix. The appendix consist of additional material of the paper and is not included in the initial submission
☆ Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization
Combinatorial Optimization is crucial to numerous real-world applications, yet still presents challenges due to its (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete within industrial solvers. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. The current best methods either rely on a collection of pre-trained policies, or on data-inefficient fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an RL approach that leverages memory to improve the adaptation of neural solvers at inference time. MEMENTO enables updating the action distribution dynamically based on the outcome of previous decisions. We validate its effectiveness on benchmark problems, in particular Traveling Salesman and Capacitated Vehicle Routing, demonstrating it can successfully be combined with standard methods to boost their performance under a given budget, both in and out-of-distribution, improving their performance on all 12 evaluated tasks.
☆ Towards Lightweight Graph Neural Network Search with Curriculum Graph Sparsification KDD 2024
Graph Neural Architecture Search (GNAS) has achieved superior performance on various graph-structured tasks. However, existing GNAS studies overlook the applications of GNAS in resource-constraint scenarios. This paper proposes to design a joint graph data and architecture mechanism, which identifies important sub-architectures via the valuable graph data. To search for optimal lightweight Graph Neural Networks (GNNs), we propose a Lightweight Graph Neural Architecture Search with Graph SparsIfication and Network Pruning (GASSIP) method. In particular, GASSIP comprises an operation-pruned architecture search module to enable efficient lightweight GNN search. Meanwhile, we design a novel curriculum graph data sparsification module with an architecture-aware edge-removing difficulty measurement to help select optimal sub-architectures. With the aid of two differentiable masks, we iteratively optimize these two modules to efficiently search for the optimal lightweight architecture. Extensive experiments on five benchmarks demonstrate the effectiveness of GASSIP. Particularly, our method achieves on-par or even higher node classification performance with half or fewer model parameters of searched GNNs and a sparser graph.
comment: Accepted by KDD 2024. The two first authors made equal contributions
☆ Compact Model Parameter Extraction via Derivative-Free Optimization
In this paper, we address the problem of compact model parameter extraction to simultaneously extract tens of parameters via derivative-free optimization. Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets, each targeting different operational regions of the device, a process that can take several days or even weeks. Our approach streamlines this process by employing derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations. We further enhance the optimization process to address critical issues in device modeling by carefully choosing a loss function that evaluates model performance consistently across varying magnitudes by focusing on relative errors (as opposed to absolute errors), prioritizing accuracy in key operational regions of the device above a certain threshold, and reducing sensitivity to outliers. Furthermore, we utilize the concept of train-test split to assess the model fit and avoid overfitting. This is done by fitting 80% of the data and testing the model efficacy with the remaining 20%. We demonstrate the effectiveness of our methodology by successfully modeling two semiconductor devices: a diamond Schottky diode and a GaN-on-SiC HEMT, with the latter involving the ASM-HEMT DC model, which requires simultaneously extracting 35 model parameters to fit the model to the measured data. These examples demonstrate the effectiveness of our approach and showcase the practical benefits of derivative-free optimization in device modeling.
☆ METRIK: Measurement-Efficient Randomized Controlled Trials using Transformers with Input Masking
Clinical randomized controlled trials (RCTs) collect hundreds of measurements spanning various metric types (e.g., laboratory tests, cognitive/motor assessments, etc.) across 100s-1000s of subjects to evaluate the effect of a treatment, but do so at the cost of significant trial expense. To reduce the number of measurements, trial protocols can be revised to remove metrics extraneous to the study's objective, but doing so requires additional human labor and limits the set of hypotheses that can be studied with the collected data. In contrast, a planned missing design (PMD) can reduce the amount of data collected without removing any metric by imputing the unsampled data. Standard PMDs randomly sample data to leverage statistical properties of imputation algorithms, but are ad hoc, hence suboptimal. Methods that learn PMDs produce more sample-efficient PMDs, but are not suitable for RCTs because they require ample prior data (150+ subjects) to model the data distribution. Therefore, we introduce a framework called Measurement EfficienT Randomized Controlled Trials using Transformers with Input MasKing (METRIK), which, for the first time, calculates a PMD specific to the RCT from a modest amount of prior data (e.g., 60 subjects). Specifically, METRIK models the PMD as a learnable input masking layer that is optimized with a state-of-the-art imputer based on the Transformer architecture. METRIK implements a novel sampling and selection algorithm to generate a PMD that satisfies the trial designer's objective, i.e., whether to maximize sampling efficiency or imputation performance for a given sampling budget. Evaluated across five real-world clinical RCT datasets, METRIK increases the sampling efficiency of and imputation performance under the generated PMD by leveraging correlations over time and across metrics, thereby removing the need to manually remove metrics from the RCT.
comment: 18 pages, 11 figures
☆ AnnotatedTables: A Large Tabular Dataset with Language Model Annotations
Tabular data is ubiquitous in real-world applications and abundant on the web, yet its annotation has traditionally required human labor, posing a significant scalability bottleneck for tabular machine learning. Our methodology can successfully annotate a large amount of tabular data and can be flexibly steered to generate various types of annotations based on specific research objectives, as we demonstrate with SQL annotation and input-target column annotation as examples. As a result, we release AnnotatedTables, a collection of 32,119 databases with LLM-generated annotations. The dataset includes 405,616 valid SQL programs, making it the largest SQL dataset with associated tabular data that supports query execution. To further demonstrate the value of our methodology and dataset, we perform two follow-up research studies. 1) We investigate whether LLMs can translate SQL programs to Rel programs, a database language previously unknown to LLMs, while obtaining the same execution results. Using our Incremental Prompt Engineering methods based on execution feedback, we show that LLMs can produce adequate translations with few-shot learning. 2) We evaluate the performance of TabPFN, a recent neural tabular classifier trained on Bayesian priors, on 2,720 tables with input-target columns identified and annotated by LLMs. On average, TabPFN performs on par with the baseline AutoML method, though the relative performance can vary significantly from one data table to another, making both models viable for practical applications depending on the situation. Our findings underscore the potential of LLMs in automating the annotation of large volumes of diverse tabular data.
☆ Lesion-Aware Cross-Phase Attention Network for Renal Tumor Subtype Classification on Multi-Phase CT Scans
Multi-phase computed tomography (CT) has been widely used for the preoperative diagnosis of kidney cancer due to its non-invasive nature and ability to characterize renal lesions. However, since enhancement patterns of renal lesions across CT phases are different even for the same lesion type, the visual assessment by radiologists suffers from inter-observer variability in clinical practice. Although deep learning-based approaches have been recently explored for differential diagnosis of kidney cancer, they do not explicitly model the relationships between CT phases in the network design, limiting the diagnostic performance. In this paper, we propose a novel lesion-aware cross-phase attention network (LACPANet) that can effectively capture temporal dependencies of renal lesions across CT phases to accurately classify the lesions into five major pathological subtypes from time-series multi-phase CT images. We introduce a 3D inter-phase lesion-aware attention mechanism to learn effective 3D lesion features that are used to estimate attention weights describing the inter-phase relations of the enhancement patterns. We also present a multi-scale attention scheme to capture and aggregate temporal patterns of lesion features at different spatial scales for further improvement. Extensive experiments on multi-phase CT scans of kidney cancer patients from the collected dataset demonstrate that our LACPANet outperforms state-of-the-art approaches in diagnostic accuracy.
comment: This article has been accepted for publication in Computers in Biology and Medicine
☆ Multimodal Graph Benchmark
Associating unstructured data with structured information is crucial for real-world tasks that require relevance search. However, existing graph learning benchmarks often overlook the rich semantic information associate with each node. To bridge such gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), the first comprehensive multi-modal graph benchmark that incorporates both textual and visual information. MM-GRAPH surpasses previous efforts, which have primarily focused on text-attributed graphs with various connectivity patterns. MM-GRAPH consists of five graph learning datasets of various scales that are appropriate for different learning tasks. Their multimodal node features, enabling a more comprehensive evaluation of graph learning algorithms in real-world scenarios. To facilitate research on multimodal graph learning, we further provide an extensive study on the performance of various graph neural networks in the presence of features from various modalities. MM-GRAPH aims to foster research on multimodal graph learning and drive the development of more advanced and robust graph learning algorithms. By providing a diverse set of datasets and benchmarks, MM-GRAPH enables researchers to evaluate and compare their models in realistic settings, ultimately leading to improved performance on real-world applications that rely on multimodal graph data.
comment: https://mm-graph-benchmark.github.io/
☆ Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models? ACL 2024
Alignment of the language model with human preferences is a common approach to making a language model useful to end users. However, most alignment work is done in English, and human preference datasets are dominated by English, reflecting only the preferences of English-speaking annotators. Nevertheless, it is common practice to use the English preference data, either directly or by translating it into the target language, when aligning a multilingual language model. The question is whether such an alignment strategy marginalizes the preference of non-English speaking users. To this end, we investigate the effect of aligning Japanese language models with (mostly) English resources. In particular, we focus on evaluating whether the commonsense morality of the resulting fine-tuned models is aligned with Japanese culture using the JCommonsenseMorality (JCM) and ETHICS datasets. The experimental results show that the fine-tuned model outperforms the SFT model. However, it does not demonstrate the same level of improvement as a model fine-tuned using the JCM, suggesting that while some aspects of commonsense morality are transferable, others may not be.
comment: The 2nd Workshop on Cross-Cultural Considerations in NLP (C3NLP) at ACL 2024
☆ Anomaly Detection of Tabular Data Using LLMs
Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.
comment: accepted at the Anomaly Detection with Foundation Models workshop
☆ Cascade Reward Sampling for Efficient Decoding-Time Alignment
Aligning large language models (LLMs) with human preferences is critical for their deployment. Recently, decoding-time alignment has emerged as an effective plug-and-play technique that requires no fine-tuning of model parameters. However, generating text that achieves both high reward and high likelihood remains a significant challenge. Existing methods often fail to generate high-reward text or incur substantial computational costs. In this paper, we propose Cascade Reward Sampling (CARDS) to address both issues, guaranteeing the generation of high-reward and high-likelihood text with significantly low costs. Based on our analysis of reward models (RMs) on incomplete text and our observation that high-reward prefixes induce high-reward complete text, we use rejection sampling to iteratively generate small semantic segments to form such prefixes. The segment length is dynamically determined by the predictive uncertainty of LLMs. This strategy guarantees desirable prefixes for subsequent generations and significantly reduces wasteful token re-generations and the number of reward model scoring. Our experiments demonstrate substantial gains in both generation efficiency and alignment ratings compared to the baselines, achieving five times faster text generation and 99\% win-ties in GPT-4/Claude-3 helpfulness evaluation.
☆ Landscaping Linear Mode Connectivity ICML 2024
The presence of linear paths in parameter space between two different network solutions in certain cases, i.e., linear mode connectivity (LMC), has garnered interest from both theoretical and practical fronts. There has been significant research that either practically designs algorithms catered for connecting networks by adjusting for the permutation symmetries as well as some others that more theoretically construct paths through which networks can be connected. Yet, the core reasons for the occurrence of LMC, when in fact it does occur, in the highly non-convex loss landscapes of neural networks are far from clear. In this work, we take a step towards understanding it by providing a model of how the loss landscape needs to behave topographically for LMC (or the lack thereof) to manifest. Concretely, we present a `mountainside and ridge' perspective that helps to neatly tie together different geometric features that can be spotted in the loss landscape along the training runs. We also complement this perspective by providing a theoretical analysis of the barrier height, for which we provide empirical support, and which additionally extends as a faithful predictor of layer-wise LMC. We close with a toy example that provides further intuition on how barriers arise in the first place, all in all, showcasing the larger aim of the work -- to provide a working model of the landscape and its topography for the occurrence of LMC.
comment: ICML 2024 HiLD workshop paper
☆ Relaxing Continuous Constraints of Equivariant Graph Neural Networks for Physical Dynamics Learning
Incorporating Euclidean symmetries (e.g. rotation equivariance) as inductive biases into graph neural networks has improved their generalization ability and data efficiency in unbounded physical dynamics modeling. However, in various scientific and engineering applications, the symmetries of dynamics are frequently discrete due to the boundary conditions. Thus, existing GNNs either overlook necessary symmetry, resulting in suboptimal representation ability, or impose excessive equivariance, which fails to generalize to unobserved symmetric dynamics. In this work, we propose a general Discrete Equivariant Graph Neural Network (DEGNN) that guarantees equivariance to a given discrete point group. Specifically, we show that such discrete equivariant message passing could be constructed by transforming geometric features into permutation-invariant embeddings. Through relaxing continuous equivariant constraints, DEGNN can employ more geometric feature combinations to approximate unobserved physical object interaction functions. Two implementation approaches of DEGNN are proposed based on ranking or pooling permutation-invariant functions. We apply DEGNN to various physical dynamics, ranging from particle, molecular, crowd to vehicle dynamics. In twenty scenarios, DEGNN significantly outperforms existing state-of-the-art approaches. Moreover, we show that DEGNN is data efficient, learning with less data, and can generalize across scenarios such as unobserved orientation.
♻ ☆ Improving physics-informed DeepONets with hard constraints
Current physics-informed (standard or deep operator) neural networks still rely on accurately learning the initial and/or boundary conditions of the system of differential equations they are solving. In contrast, standard numerical methods involve such conditions in computations without needing to learn them. In this study, we propose to improve current physics-informed deep learning strategies such that initial and/or boundary conditions do not need to be learned and are represented exactly in the predicted solution. Moreover, this method guarantees that when a deep operator network is applied multiple times to time-step a solution of an initial value problem, the resulting function is at least continuous.
comment: 26 pages, 8 figures, 6 tables; extended version
♻ ☆ Scaling and renormalization in high-dimensional regression
This paper presents a succinct derivation of the training and generalization performance of a variety of high-dimensional ridge regression models using the basic tools of random matrix theory and free probability. We provide an introduction and review of recent results on these topics, aimed at readers with backgrounds in physics and deep learning. Analytic formulas for the training and generalization errors are obtained in a few lines of algebra directly from the properties of the $S$-transform of free probability. This allows for a straightforward identification of the sources of power-law scaling in model performance. We compute the generalization error of a broad class of random feature models. We find that in all models, the $S$-transform corresponds to the train-test generalization gap, and yields an analogue of the generalized-cross-validation estimator. Using these techniques, we derive fine-grained bias-variance decompositions for a very general class of random feature models with structured covariates. These novel results allow us to discover a scaling regime for random feature models where the variance due to the features limits performance in the overparameterized setting. We also demonstrate how anisotropic weight structure in random feature models can limit performance and lead to nontrivial exponents for finite-width corrections in the overparameterized setting. Our results extend and provide a unifying perspective on earlier models of neural scaling laws.
comment: 68 pages, 17 figures
♻ ☆ Generative Fractional Diffusion Models
We introduce the first continuous-time score-based generative model that leverages fractional diffusion processes for its underlying dynamics. Although diffusion models have excelled at capturing data distributions, they still suffer from various limitations such as slow convergence, mode-collapse on imbalanced data, and lack of diversity. These issues are partially linked to the use of light-tailed Brownian motion (BM) with independent increments. In this paper, we replace BM with an approximation of its non-Markovian counterpart, fractional Brownian motion (fBM), characterized by correlated increments and Hurst index $H \in (0,1)$, where $H=1/2$ recovers the classical BM. To ensure tractable inference and learning, we employ a recently popularized Markov approximation of fBM (MA-fBM) and derive its reverse time model, resulting in generative fractional diffusion models (GFDMs). We characterize the forward dynamics using a continuous reparameterization trick and propose an augmented score matching loss to efficiently learn the score-function, which is partly known in closed form, at minimal added cost. The ability to drive our diffusion model via fBM provides flexibility and control. $H \leq 1/2$ enters the regime of rough paths whereas $H>1/2$ regularizes diffusion paths and invokes long-term memory as well as a heavy-tailed behaviour (super-diffusion). The Markov approximation allows added control by varying the number of Markov processes linearly combined to approximate fBM. Our evaluations on real image datasets demonstrate that GFDM achieves greater pixel-wise diversity and enhanced image quality, as indicated by a lower FID, offering a promising alternative to traditional diffusion models.
♻ ☆ Why Transformers Need Adam: A Hessian Perspective
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.
♻ ☆ Deep Reinforcement Learning: A Convex Optimization Approach
In this paper, we consider reinforcement learning of nonlinear systems with continuous state and action spaces. We present an episodic learning algorithm, where we for each episode use convex optimization to find a two-layer neural network approximation of the optimal $Q$-function. The convex optimization approach guarantees that the weights calculated at each episode are optimal, with respect to the given sampled states and actions of the current episode. For stable nonlinear systems, we show that the algorithm converges and that the converging parameters of the trained neural network can be made arbitrarily close to the optimal neural network parameters. In particular, if the regularization parameter in the training phase is given by $\rho$, then the parameters of the trained neural network converge to $w$, where the distance between $w$ and the optimal parameters $w^\star$ is bounded by $\mathcal{O}(\rho)$. That is, when the number of episodes goes to infinity, there exists a constant $C$ such that \[ \|w-w^\star\| \le C\rho. \] In particular, our algorithm converges arbitrarily close to the optimal neural network parameters as the regularization parameter goes to zero. As a consequence, our algorithm converges fast due to the polynomial-time convergence of convex optimization algorithms.
♻ ☆ Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models
In this paper we develop state-of-the-art privacy attacks against Large Language Models (LLMs), where an adversary with some access to the model tries to learn something about the underlying training data. Our headline results are new membership inference attacks (MIAs) against pretrained LLMs that perform hundreds of times better than baseline attacks, and a pipeline showing that over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, pretraining and fine-tuning data, and both MIAs and training data extraction. For pretraining data, we propose two new MIAs: a supervised neural network classifier that predicts training data membership on the basis of (dimensionality-reduced) model gradients, as well as a variant of this attack that only requires logit access to the model by leveraging recent model-stealing work on LLMs. To our knowledge this is the first MIA that explicitly incorporates model-stealing information. Both attacks outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and the strongest known attacks for other machine learning models. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance; we then leverage our MIA to extract a large fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models. Our code is available at github.com/safr-ai-lab/pandora-llm.
♻ ☆ An Experimental Study on the Rashomon Effect of Balancing Methods in Imbalanced Classification
Predictive models may generate biased predictions when classifying imbalanced datasets. This happens when the model favors the majority class, leading to low performance in accurately predicting the minority class. To address this issue, balancing or resampling methods are critical pre-processing steps in the modeling process. However, there have been debates and questioning of the functionality of these methods in recent years. In particular, many candidate models may exhibit very similar predictive performance, which is called the Rashomon effect, in model selection. Selecting one of them without considering predictive multiplicity which is the case of yielding conflicting models' predictions for any sample may lead to a loss of using another model. In this study, in addition to the existing debates, the impact of balancing methods on predictive multiplicity is examined through the Rashomon effect. It is important because the blind model selection is risky from a set of approximately equally accurate models. This may lead to serious problems in model selection, validation, and explanation. To tackle this matter, we conducted real dataset experiments to observe the impact of balancing methods on predictive multiplicity through the Rashomon effect. Our findings showed that balancing methods inflate the predictive multiplicity, and they yield varying results. To monitor the trade-off between performance and predictive multiplicity for conducting the modeling process responsibly, we proposed using the extended performance-gain plot for the Rashomon effect.
comment: 16 pages, 6 figures
♻ ☆ Positive concave deep equilibrium models
Deep equilibrium (DEQ) models are widely recognized as a memory efficient alternative to standard neural networks, achieving state-of-the-art performance in language modeling and computer vision tasks. These models solve a fixed point equation instead of explicitly computing the output, which sets them apart from standard neural networks. However, existing DEQ models often lack formal guarantees of the existence and uniqueness of the fixed point, and the convergence of the numerical scheme used for computing the fixed point is not formally established. As a result, DEQ models are potentially unstable in practice. To address these drawbacks, we introduce a novel class of DEQ models called positive concave deep equilibrium (pcDEQ) models. Our approach, which is based on nonlinear Perron-Frobenius theory, enforces nonnegative weights and activation functions that are concave on the positive orthant. By imposing these constraints, we can easily ensure the existence and uniqueness of the fixed point without relying on additional complex assumptions commonly found in the DEQ literature, such as those based on monotone operator theory in convex analysis. Furthermore, the fixed point can be computed with the standard fixed point algorithm, and we provide theoretical guarantees of its geometric convergence, which, in particular, simplifies the training process. Experiments demonstrate the competitiveness of our pcDEQ models against other implicit models.
♻ ☆ Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
♻ ☆ EGTR: Extracting Graph from Transformer for Scene Graph Generation CVPR 2024
Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed, one-stage SGG models based on a one-stage object detector have been actively studied. However, complex modeling is used to predict the relationship between objects, and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products, the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task, we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing, the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore, we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at https://github.com/naver-ai/egtr.
comment: CVPR 2024 (Best paper award candidate)
♻ ☆ Attribute Diversity Determines the Systematicity Gap in VQA
The degree to which neural networks can generalize to new combinations of familiar concepts, and the conditions under which they are able to do so, has long been an open question. In this work, we study the systematicity gap in visual question answering: the performance difference between reasoning on previously seen and unseen combinations of object attributes. To test, we introduce a novel diagnostic dataset, CLEVR-HOPE. We find that while increased quantity of training data does not reduce the systematicity gap, increased training data diversity of the attributes in the unseen combination does. In all, our experiments suggest that the more distinct attribute type combinations are seen during training, the more systematic we can expect the resulting model to be.
comment: 33 pages, 20 figures
♻ ☆ A practical existence theorem for reduced order models based on convolutional autoencoders
In recent years, deep learning has gained increasing popularity in the fields of Partial Differential Equations (PDEs) and Reduced Order Modeling (ROM), providing domain practitioners with new powerful data-driven techniques such as Physics-Informed Neural Networks (PINNs), Neural Operators, Deep Operator Networks (DeepONets) and Deep-Learning based ROMs (DL-ROMs). In this context, deep autoencoders based on Convolutional Neural Networks (CNNs) have proven extremely effective, outperforming established techniques, such as the reduced basis method, when dealing with complex nonlinear problems. However, despite the empirical success of CNN-based autoencoders, there are only a few theoretical results supporting these architectures, usually stated in the form of universal approximation theorems. In particular, although the existing literature provides users with guidelines for designing convolutional autoencoders, the subsequent challenge of learning the latent features has been barely investigated. Furthermore, many practical questions remain unanswered, e.g., the number of snapshots needed for convergence or the neural network training strategy. In this work, using recent techniques from sparse high-dimensional function approximation, we fill some of these gaps by providing a new practical existence theorem for CNN-based autoencoders when the parameter-to-solution map is holomorphic. This regularity assumption arises in many relevant classes of parametric PDEs, such as the parametric diffusion equation, for which we discuss an explicit application of our general theory.
♻ ☆ Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random Walks
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
♻ ☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
♻ ☆ FT-AED: Benchmark Dataset for Early Freeway Traffic Anomalous Event Detection
Early and accurate detection of anomalous events on the freeway, such as accidents, can improve emergency response and clearance. However, existing delays and errors in event identification and reporting make it a difficult problem to solve. Current large-scale freeway traffic datasets are not designed for anomaly detection and ignore these challenges. In this paper, we introduce the first large-scale lane-level freeway traffic dataset for anomaly detection. Our dataset consists of a month of weekday radar detection sensor data collected in 4 lanes along an 18-mile stretch of Interstate 24 heading toward Nashville, TN, comprising over 3.7 million sensor measurements. We also collect official crash reports from the Nashville Traffic Management Center and manually label all other potential anomalies in the dataset. To show the potential for our dataset to be used in future machine learning and traffic research, we benchmark numerous deep learning anomaly detection models on our dataset. We find that unsupervised graph neural network autoencoders are a promising solution for this problem and that ignoring spatial relationships leads to decreased performance. We demonstrate that our methods can reduce reporting delays by over 10 minutes on average while detecting 75% of crashes. Our dataset and all preprocessing code needed to get started are publicly released at https://vu.edu/ft-aed/ to facilitate future research.
♻ ☆ State Representation Learning Using an Unbalanced Atlas
The manifold hypothesis posits that high-dimensional data often lies on a lower-dimensional manifold and that utilizing this manifold as the target space yields more efficient representations. While numerous traditional manifold-based techniques exist for dimensionality reduction, their application in self-supervised learning has witnessed slow progress. The recent MSimCLR method combines manifold encoding with SimCLR but requires extremely low target encoding dimensions to outperform SimCLR, limiting its applicability. This paper introduces a novel learning paradigm using an unbalanced atlas (UA), capable of surpassing state-of-the-art self-supervised learning approaches. We investigated and engineered the DeepInfomax with an unbalanced atlas (DIM-UA) method by adapting the Spatiotemporal DeepInfomax (ST-DIM) framework to align with our proposed UA paradigm. The efficacy of DIM-UA is demonstrated through training and evaluation on the Atari Annotated RAM Interface (AtariARI) benchmark, a modified version of the Atari 2600 framework that produces annotated image samples for representation learning. The UA paradigm improves existing algorithms significantly as the number of target encoding dimensions grows. For instance, the mean F1 score averaged over categories of DIM-UA is ~75% compared to ~70% of ST-DIM when using 16384 hidden units.
♻ ☆ Incorporating temporal dynamics of mutations to enhance the prediction capability of antiretroviral therapy's outcome for HIV-1
Motivation: In predicting HIV therapy outcomes, a critical clinical question is whether using historical information can enhance predictive capabilities compared with current or latest available data analysis. This study analyses whether historical knowledge, which includes viral mutations detected in all genotypic tests before therapy, their temporal occurrence, and concomitant viral load measurements, can bring improvements. We introduce a method to weigh mutations, considering the previously enumerated factors and the reference mutation-drug Stanford resistance tables. We compare a model encompassing history (H) with one not using it (NH). Results: The H-model demonstrates superior discriminative ability, with a higher ROC-AUC score (76.34%) than the NH-model (74.98%). Significant Wilcoxon test results confirm that incorporating historical information improves consistently predictive accuracy for treatment outcomes. The better performance of the H-model might be attributed to its consideration of latent HIV reservoirs, probably obtained when leveraging historical information. The findings emphasize the importance of temporal dynamics in mutations, offering insights into HIV infection complexities. However, our result also shows that prediction accuracy remains relatively high even when no historical information is available. Supplementary information: Supplementary material is available.
comment: 16 pages, 6 figures
♻ ☆ Deep Reinforcement Learning with Swin Transformers
Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).
♻ ☆ Watermark Stealing in Large Language Models ICML 2024
LLM watermarking has attracted attention as a promising way to detect AI-generated content, with some works suggesting that current schemes may already be fit for deployment. In this work we dispute this claim, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes. We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks, as hypothesized in prior work, but also greatly boosts scrubbing attacks, which was previously unnoticed. We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings. We show that for under $50 an attacker can both spoof and scrub state-of-the-art schemes previously considered safe, with average success rate of over 80%. Our findings challenge common beliefs about LLM watermarking, stressing the need for more robust schemes. We make all our code and additional examples available at https://watermark-stealing.org.
comment: ICML 2024
♻ ☆ Provable Adaptivity of Adam under Non-uniform Smoothness KDD 2024
Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the \emph{L-smooth condition}. Unfortunately, this assumption does not hold for many deep learning tasks. Moreover, we believe that this assumption obscures the true benefit of Adam, as the algorithm can adapt its update magnitude according to local smoothness. This important feature of Adam becomes irrelevant when assuming globally bounded smoothness. This paper studies the convergence of randomly reshuffled Adam (RR Adam) with diminishing learning rate, which is the major version of Adam adopted in deep learning tasks. We present the first convergence analysis of RR Adam without the bounded smoothness assumption. We demonstrate that RR Adam can maintain its convergence properties when smoothness is linearly bounded by the gradient norm, referred to as the \emph{$(L_0, L_1)$-smooth condition. We further compare Adam to SGD when both methods use diminishing learning rate. We refine the existing lower bound of SGD and show that SGD can be slower than Adam. To our knowledge, this is the first time that Adam and SGD are rigorously compared in the same setting and the advantage of Adam is revealed.
comment: KDD 2024
♻ ☆ Towards Theoretical Understandings of Self-Consuming Generative Models ICML 2024
This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
comment: Accepted at ICML 2024
♻ ☆ Adaptively Perturbed Mirror Descent for Learning in Games ICML 2024
This paper proposes a payoff perturbation technique for the Mirror Descent (MD) algorithm in games where the gradient of the payoff functions is monotone in the strategy profile space, potentially containing additive noise. The optimistic family of learning algorithms, exemplified by optimistic MD, successfully achieves {\it last-iterate} convergence in scenarios devoid of noise, leading the dynamics to a Nash equilibrium. A recent re-emerging trend underscores the promise of the perturbation approach, where payoff functions are perturbed based on the distance from an anchoring, or {\it slingshot}, strategy. In response, we propose {\it Adaptively Perturbed MD} (APMD), which adjusts the magnitude of the perturbation by repeatedly updating the slingshot strategy at a predefined interval. This innovation empowers us to find a Nash equilibrium of the underlying game with guaranteed rates. Empirical demonstrations affirm that our algorithm exhibits significantly accelerated convergence.
comment: Accepted at ICML 2024
♻ ☆ Hyperbolic Random Forests
Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.
comment: Accepted at TMLR. Code available at https://github.com/LarsDoorenbos/HoroRF
♻ ☆ CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
♻ ☆ Validation of ML-UQ calibration statistics using simulated reference values: a sensitivity analysis
Some popular Machine Learning Uncertainty Quantification (ML-UQ) calibration statistics do not have predefined reference values and are mostly used in comparative studies. In consequence, calibration is almost never validated and the diagnostic is left to the appreciation of the reader. Simulated reference values, based on synthetic calibrated datasets derived from actual uncertainties, have been proposed to palliate this problem. As the generative probability distribution for the simulation of synthetic errors is often not constrained, the sensitivity of simulated reference values to the choice of generative distribution might be problematic, shedding a doubt on the calibration diagnostic. This study explores various facets of this problem, and shows that some statistics are excessively sensitive to the choice of generative distribution to be used for validation when the generative distribution is unknown. This is the case, for instance, of the correlation coefficient between absolute errors and uncertainties (CC) and of the expected normalized calibration error (ENCE). A robust validation workflow to deal with simulated reference values is proposed.
♻ ☆ Detach-ROCKET: Sequential feature selection for time series classification with random convolutional kernels
Time Series Classification (TSC) is essential in fields like medicine, environmental science, and finance, enabling tasks such as disease diagnosis, anomaly detection, and stock price analysis. While machine learning models like Recurrent Neural Networks and InceptionTime are successful in numerous applications, they can face scalability issues due to computational requirements. Recently, ROCKET has emerged as an efficient alternative, achieving state-of-the-art performance and simplifying training by utilizing a large number of randomly generated features from the time series data. However, many of these features are redundant or non-informative, increasing computational load and compromising generalization. Here we introduce Sequential Feature Detachment (SFD) to identify and prune non-essential features in ROCKET-based models, such as ROCKET, MiniRocket, and MultiRocket. SFD estimates feature importance using model coefficients and can handle large feature sets without complex hyperparameter tuning. Testing on the UCR archive shows that SFD can produce models with better test accuracy using only 10\% of the original features. We named these pruned models Detach-ROCKET. We also present an end-to-end procedure for determining an optimal balance between the number of features and model accuracy. On the largest binary UCR dataset, Detach-ROCKET improves test accuracy by 0.6\% while reducing features by 98.9\%. By enabling a significant reduction in model size without sacrificing accuracy, our methodology improves computational efficiency and contributes to model interpretability. We believe that Detach-ROCKET will be a valuable tool for researchers and practitioners working with time series data, who can find a user-friendly implementation of the model at \url{https://github.com/gon-uri/detach_rocket}.
comment: 18 pages, 5 figures, 3 tables
♻ ☆ Make Large Language Model a Better Ranker
Large Language Models (LLMs) demonstrate robust capabilities across various fields, leading to a paradigm shift in LLM-enhanced Recommender System (RS). Research to date focuses on point-wise and pair-wise recommendation paradigms, which are inefficient for LLM-based recommenders due to high computational costs. However, existing list-wise approaches also fall short in ranking tasks due to misalignment between ranking objectives and next-token prediction. Moreover, these LLM-based methods struggle to effectively address the order relation among candidates, particularly given the scale of ratings. To address these challenges, this paper introduces the large language model framework with Aligned Listwise Ranking Objectives (ALRO). ALRO is designed to bridge the gap between the capabilities of LLMs and the nuanced requirements of ranking tasks. Specifically, ALRO employs explicit feedback in a listwise manner by introducing soft lambda loss, a customized adaptation of lambda loss designed for optimizing order relations. This mechanism provides more accurate optimization goals, enhancing the ranking process. Additionally, ALRO incorporates a permutation-sensitive learning mechanism that addresses position bias, a prevalent issue in generative models, without imposing additional computational burdens during inference. Our evaluative studies reveal that ALRO outperforms both existing embedding-based recommendation methods and LLM-based recommendation baselines.
comment: 12 pages, 5 figures
♻ ☆ Learning Action-based Representations Using Invariance
Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.
comment: Published at the Reinforcement Learning Conference 2024
♻ ☆ Deep Prompt Multi-task Network for Abuse Language Detection ICPR
The detection of abusive language remains a long-standing challenge with the extensive use of social networks. The detection task of abusive language suffers from limited accuracy. We argue that the existing detection methods utilize the fine-tuning technique of the pre-trained language models (PLMs) to handle downstream tasks. Hence, these methods fail to stimulate the general knowledge of the PLMs. To address the problem, we propose a novel Deep Prompt Multi-task Network (DPMN) for abuse language detection. Specifically, DPMN first attempts to design two forms of deep prompt tuning and light prompt tuning for the PLMs. The effects of different prompt lengths, tuning strategies, and prompt initialization methods on detecting abusive language are studied. In addition, we propose a Task Head based on Bi-LSTM and FFN, which can be used as a short text classifier. Eventually, DPMN utilizes multi-task learning to improve detection metrics further. The multi-task network has the function of transferring effective knowledge. The proposed DPMN is evaluated against eight typical methods on three public datasets: OLID, SOLID, and AbuseAnalyzer. The experimental results show that our DPMN outperforms the state-of-the-art methods.
comment: Accepted by the International Conference on Pattern Recognition (ICPR) 2024
♻ ☆ Improved Dynamic Regret for Online Frank-Wolfe COLT2023
To deal with non-stationary online problems with complex constraints, we investigate the dynamic regret of online Frank-Wolfe (OFW), which is an efficient projection-free algorithm for online convex optimization. It is well-known that in the setting of offline optimization, the smoothness of functions and the strong convexity of functions accompanying specific properties of constraint sets can be utilized to achieve fast convergence rates for the Frank-Wolfe (FW) algorithm. However, for OFW, previous studies only establish a dynamic regret bound of $O(\sqrt{T}(V_T+\sqrt{D_T}+1))$ by utilizing the convexity of problems, where $T$ is the number of rounds, $V_T$ is the function variation, and $D_T$ is the gradient variation. In this paper, we derive improved dynamic regret bounds for OFW by extending the fast convergence rates of FW from offline optimization to online optimization. The key technique for this extension is to set the step size of OFW with a line search rule. In this way, we first show that the dynamic regret bound of OFW can be improved to $O(\sqrt{T(V_T+1)})$ for smooth functions. Second, we achieve a better dynamic regret bound of $O(T^{1/3}(V_T+1)^{2/3})$ when functions are smooth and strongly convex, and the constraint set is strongly convex. Finally, for smooth and strongly convex functions with minimizers in the interior of the constraint set, we demonstrate that the dynamic regret of OFW reduces to $O(V_T+1)$, and can be further strengthened to $O(\min\{P_T^\ast,S_T^\ast,V_T\}+1)$ by performing a constant number of FW iterations per round, where $P_T^\ast$ and $S_T^\ast$ denote the path length and squared path length of minimizers, respectively.
comment: v2 matches the camera-ready version for COLT2023 better
♻ ☆ Compact Proofs of Model Performance via Mechanistic Interpretability
In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
♻ ☆ Expert with Clustering: Hierarchical Online Preference Learning Framework
Emerging mobility systems are increasingly capable of recommending options to mobility users, to guide them towards personalized yet sustainable system outcomes. Even more so than the typical recommendation system, it is crucial to minimize regret, because 1) the mobility options directly affect the lives of the users, and 2) the system sustainability relies on sufficient user participation. In this study, we consider accelerating user preference learning by exploiting a low-dimensional latent space that captures the mobility preferences of users. We introduce a hierarchical contextual bandit framework named Expert with Clustering (EWC), which integrates clustering techniques and prediction with expert advice. EWC efficiently utilizes hierarchical user information and incorporates a novel Loss-guided Distance metric. This metric is instrumental in generating more representative cluster centroids. In a recommendation scenario with $N$ users, $T$ rounds per user, and $K$ options, our algorithm achieves a regret bound of $O(N\sqrt{T\log K} + NT)$. This bound consists of two parts: the first term is the regret from the Hedge algorithm, and the second term depends on the average loss from clustering. To the best of the authors knowledge, this is the first work to analyze the regret of an integrated expert algorithm with k-Means clustering. This regret bound underscores the theoretical and experimental efficacy of EWC, particularly in scenarios that demand rapid learning and adaptation. Experimental results highlight that EWC can substantially reduce regret by 27.57% compared to the LinUCB baseline. Our work offers a data-efficient approach to capturing both individual and collective behaviors, making it highly applicable to contexts with hierarchical structures. We expect the algorithm to be applicable to other settings with layered nuances of user preferences and information.
♻ ☆ Causal Fair Machine Learning via Rank-Preserving Interventional Distributions
A decision can be defined as fair if equal individuals are treated equally and unequals unequally. Adopting this definition, the task of designing machine learning (ML) models that mitigate unfairness in automated decision-making systems must include causal thinking when introducing protected attributes: Following a recent proposal, we define individuals as being normatively equal if they are equal in a fictitious, normatively desired (FiND) world, where the protected attributes have no (direct or indirect) causal effect on the target. We propose rank-preserving interventional distributions to define a specific FiND world in which this holds and a warping method for estimation. Evaluation criteria for both the method and the resulting ML model are presented and validated through simulations. Experiments on empirical data showcase the practical application of our method and compare results with "fairadapt" (Ple\v{c}ko and Meinshausen, 2020), a different approach for mitigating unfairness by causally preprocessing data that uses quantile regression forests. With this, we show that our warping approach effectively identifies the most discriminated individuals and mitigates unfairness.
♻ ☆ CLUE: A Clinical Language Understanding Evaluation for LLMs
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE.
♻ ☆ Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA
Reliable uncertainty estimation plays a crucial role in various safety-critical applications such as medical diagnosis and autonomous driving. In recent years, Bayesian neural networks (BayesNNs) have gained substantial research and industrial interests due to their capability to make accurate predictions with reliable uncertainty estimation. However, the algorithmic complexity and the resulting hardware performance of BayesNNs hinder their adoption in real-life applications. To bridge this gap, this paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs. At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads while achieving high accuracy and quality of uncertainty estimation. At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient multi-exit BayesNNs. Several optimization techniques such as the mix of spatial and temporal mappings are introduced to reduce resource consumption and improve the overall hardware performance. Comprehensive experiments demonstrate that our approach can achieve higher energy efficiency compared to CPU, GPU, and other state-of-the-art hardware implementations. To support the future development of this research, we have open-sourced our code at: https://github.com/os-hxfan/MCME_FPGA_Acc.git
comment: arXiv admin note: text overlap with arXiv:2308.06849
♻ ☆ AdaTreeFormer: Few Shot Domain Adaptation for Tree Counting from a Single High-Resolution Image
The process of estimating and counting tree density using only a single aerial or satellite image is a difficult task in the fields of photogrammetry and remote sensing. However, it plays a crucial role in the management of forests. The huge variety of trees in varied topography severely hinders tree counting models to perform well. The purpose of this paper is to propose a framework that is learnt from the source domain with sufficient labeled trees and is adapted to the target domain with only a limited number of labeled trees. Our method, termed as AdaTreeFormer, contains one shared encoder with a hierarchical feature extraction scheme to extract robust features from the source and target domains. It also consists of three subnets: two for extracting self-domain attention maps from source and target domains respectively and one for extracting cross-domain attention maps. For the latter, an attention-to-adapt mechanism is introduced to distill relevant information from different domains while generating tree density maps; a hierarchical cross-domain feature alignment scheme is proposed that progressively aligns the features from the source and target domains. We also adopt adversarial learning into the framework to further reduce the gap between source and target domains. Our AdaTreeFormer is evaluated on six designed domain adaptation tasks using three tree counting datasets, \ie Jiangsu, Yosemite, and London. Experimental results show that AdaTreeFormer significantly surpasses the state of the art, \eg in the cross domain from the Yosemite to Jiangsu dataset, it achieves a reduction of 15.9 points in terms of the absolute counting errors and an increase of 10.8\% in the accuracy of the detected trees' locations. The codes and datasets are available at \emph{\color{magenta}{https://github.com/HAAClassic/AdaTreeFormer}}.
♻ ☆ Knowledge Accumulation in Continually Learned Representations and the Issue of Feature Forgetting
Continual learning research has shown that neural networks suffer from catastrophic forgetting "at the output level", but it is debated whether this is also the case at the level of learned representations. Multiple recent studies ascribe representations a certain level of innate robustness against forgetting -- that they only forget minimally in comparison with forgetting at the output level. We revisit and expand upon the experiments that revealed this difference in forgetting and illustrate the coexistence of two phenomena that affect the quality of continually learned representations: knowledge accumulation and feature forgetting. Taking both aspects into account, we show that, even though forgetting in the representation (i.e. feature forgetting) can be small in absolute terms, when measuring relative to how much was learned during a task, forgetting in the representation tends to be just as catastrophic as forgetting at the output level. Next we show that this feature forgetting is problematic as it substantially slows down the incremental learning of good general representations (i.e. knowledge accumulation). Finally, we study how feature forgetting and knowledge accumulation are affected by different types of continual learning methods.
comment: TMLR 2024
♻ ☆ Local primordial non-Gaussianity from the large-scale clustering of photometric DESI luminous red galaxies
We use angular clustering of luminous red galaxies from the Dark Energy Spectroscopic Instrument (DESI) imaging surveys to constrain the local primordial non-Gaussianity parameter $\fnl$. Our sample comprises over 12 million targets, covering 14,000 square degrees of the sky, with redshifts in the range $0.2< z < 1.35$. We identify Galactic extinction, survey depth, and astronomical seeing as the primary sources of systematic error, and employ linear regression and artificial neural networks to alleviate non-cosmological excess clustering on large scales. Our methods are tested against simulations with and without $\fnl$ and systematics, showing superior performance of the neural network treatment. The neural network with a set of nine imaging property maps passes our systematic null test criteria, and is chosen as the fiducial treatment. Assuming the universality relation, we find $\fnl = 34^{+24(+50)}_{-44(-73)}$ at 68\%(95\%) confidence. We apply a series of robustness tests (e.g., cuts on imaging, declination, or scales used) that show consistency in the obtained constraints. We study how the regression method biases the measured angular power-spectrum and degrades the $\fnl$ constraining power. The use of the nine maps more than doubles the uncertainty compared to using only the three primary maps in the regression. Our results thus motivate the development of more efficient methods that avoid over-correction, protect large-scale clustering information, and preserve constraining power. Additionally, our results encourage further studies of $\fnl$ with DESI spectroscopic samples, where the inclusion of 3D clustering modes should help separate imaging systematics and lessen the degradation in the $\fnl$ uncertainty.
comment: 21 pages, 17 figures, 7 tables (Appendix excluded). Published in MNRAS
♻ ☆ Sigma-point Kalman Filter with Nonlinear Unknown Input Estimation via Optimization and Data-driven Approach for Dynamic Systems
Most works on joint state and unknown input (UI) estimation require the assumption that the UIs are linear; this is potentially restrictive as it does not hold in many intelligent autonomous systems. To overcome this restriction and circumvent the need to linearize the system, we propose a derivative-free Unknown Input Sigma-point Kalman Filter (SPKF-nUI) where the SPKF is interconnected with a general nonlinear UI estimator that can be implemented via nonlinear optimization and data-driven approaches. The nonlinear UI estimator uses the posterior state estimate which is less susceptible to state prediction error. In addition, we introduce a joint sigma-point transformation scheme to incorporate both the state and UI uncertainties in the estimation of SPKF-nUI. An in-depth stochastic stability analysis proves that the proposed SPKF-nUI yields exponentially converging estimation error bounds under reasonable assumptions. Finally, two case studies are carried out on a simulation-based rigid robot and a physical soft robot, i.e., robots made of soft materials with complex dynamics to validate effectiveness of the proposed filter on nonlinear dynamic systems. Our results demonstrate that the proposed SPKF-nUI achieves the lowest state and UI estimation errors when compared to the existing nonlinear state-UI filters.
♻ ☆ A Non-autoregressive Multi-Horizon Flight Trajectory Prediction Framework with Gray Code Representation AAAI
Flight Trajectory Prediction (FTP) is an essential task in Air Traffic Control (ATC), which can assist air traffic controllers in managing airspace more safely and efficiently. Existing approaches generally perform multi-horizon FTP tasks in an autoregressive manner, thereby suffering from error accumulation and low-efficiency problems. In this paper, a novel framework, called FlightBERT++, is proposed to i) forecast multi-horizon flight trajectories directly in a non-autoregressive way, and ii) improve the limitation of the binary encoding (BE) representation in the FlightBERT framework. Specifically, the proposed framework is implemented by a generalized encoder-decoder architecture, in which the encoder learns the temporal-spatial patterns from historical observations and the decoder predicts the flight status for the future horizons. Compared to conventional architecture, an innovative horizon-aware contexts generator is dedicatedly designed to consider the prior horizon information, which further enables non-autoregressive multi-horizon prediction. Additionally, the Gray code representation and the differential prediction paradigm are designed to cope with the high-bit misclassifications of the BE representation, which significantly reduces the outliers in the predictions. Moreover, a differential prompted decoder is proposed to enhance the capability of the differential predictions by leveraging the stationarity of the differential sequence. Extensive experiments are conducted to validate the proposed framework on a real-world flight trajectory dataset. The experimental results demonstrated that the proposed framework outperformed the competitive baselines in both FTP performance and computational efficiency.
comment: An extend version based on the AAAI version
♻ ☆ A Survey of Large Language Models for Graphs KDD'24
Graphs are an essential data structure utilized to represent relationships in real-world scenarios. Prior research has established that Graph Neural Networks (GNNs) deliver impressive outcomes in graph-centric tasks, such as link prediction and node classification. Despite these advancements, challenges like data sparsity and limited generalization capabilities continue to persist. Recently, Large Language Models (LLMs) have gained attention in natural language processing. They excel in language comprehension and summarization. Integrating LLMs with graph learning techniques has attracted interest as a way to enhance performance in graph learning tasks. In this survey, we conduct an in-depth review of the latest state-of-the-art LLMs applied in graph learning and introduce a novel taxonomy to categorize existing methods based on their framework design. We detail four unique designs: i) GNNs as Prefix, ii) LLMs as Prefix, iii) LLMs-Graphs Integration, and iv) LLMs-Only, highlighting key methodologies within each category. We explore the strengths and limitations of each framework, and emphasize potential avenues for future research, including overcoming current integration challenges between LLMs and graph learning techniques, and venturing into new application areas. This survey aims to serve as a valuable resource for researchers and practitioners eager to leverage large language models in graph learning, and to inspire continued progress in this dynamic field. We consistently maintain the related open-source materials at \url{https://github.com/HKUDS/Awesome-LLM4Graph-Papers}.
comment: Published as a KDD'24 survey paper
♻ ☆ An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding IJCAI2024
Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.
comment: Accepted by IJCAI2024. Code is available at https://github.com/RenqiChen/Genomic-Selection
♻ ☆ Of Mice and Mates: Automated Classification and Modelling of Mouse Behaviour in Groups using a Single Model across Cages
Behavioural experiments often happen in specialised arenas, but this may confound the analysis. To address this issue, we provide tools to study mice in the home-cage environment, equipping biologists with the possibility to capture the temporal aspect of the individual's behaviour and model the interaction and interdependence between cage-mates with minimal human intervention. Our main contribution is the novel Group Behaviour Model (GBM) which summarises the joint behaviour of groups of mice across cages, using a permutation matrix to match the mouse identities in each cage to the model. In support of the above, we also (a) developed the Activity Labelling Module (ALM) to automatically classify mouse behaviour from video, and (b) released two datasets, ABODe for training behaviour classifiers and IMADGE for modelling behaviour.
comment: International Journal of Computer Vision (2024)
♻ ☆ Machine Learning Applications of Quantum Computing: A Review
At the intersection of quantum computing and machine learning, this review paper explores the transformative impact these technologies are having on the capabilities of data processing and analysis, far surpassing the bounds of traditional computational methods. Drawing upon an in-depth analysis of 32 seminal papers, this review delves into the interplay between quantum computing and machine learning, focusing on transcending the limitations of classical computing in advanced data processing and applications. This review emphasizes the potential of quantum-enhanced methods in enhancing cybersecurity, a critical sector that stands to benefit significantly from these advancements. The literature review, primarily leveraging Science Direct as an academic database, delves into the transformative effects of quantum technologies on machine learning, drawing insights from a diverse collection of studies and scholarly articles. While the focus is primarily on the growing significance of quantum computing in cybersecurity, the review also acknowledges the promising implications for other sectors as the field matures. Our systematic approach categorizes sources based on quantum machine learning algorithms, applications, challenges, and potential future developments, uncovering that quantum computing is increasingly being implemented in practical machine learning scenarios. The review highlights advancements in quantum-enhanced machine learning algorithms and their potential applications in sectors such as cybersecurity, emphasizing the need for industry-specific solutions while considering ethical and security concerns. By presenting an overview of the current state and projecting future directions, the paper sets a foundation for ongoing research and strategic advancement in quantum machine learning.
comment: Proceedings of the 23rd European Conference on Cyber Warfare and Security (ECCWS 2024)
♻ ☆ MANO: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts
Leveraging the models' outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method MaNo which (1) applies a data-dependent normalization on the logits to reduce prediction bias, and (2) takes the $L_p$ norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model's uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that MaNo achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts.
comment: The three first authors contributed equally
♻ ☆ ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
comment: 17 pages, preprint
♻ ☆ Prompting with Divide-and-Conquer Program Makes Large Language Models Discerning to Hallucination and Deception
Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, simple instructional prompts suffer from inaccurate responses. Existing works show that more complicated prompting strategies, such as Chain-of-Thoughts and Least-to-Most, can unlock LLM's powerful capacity in diverse areas. Recent researches reveal that simple divide-and-conquer prompting strategy, i.e. simply dividing the input sequence to multiple sub-inputs, can also substantially improve LLM's performance in some specific tasks such as misinformation detection. In this paper, we aim at examining the utility of divide-and-conquer prompting strategy and answer on which kind of tasks this strategy gets advantages. Specifically, we provide a theoretic analysis to divide-and-conquer prompting strategy and help us identify the specific tasks where DaC prompting can bring performance boost with theoretic guarantee. We then present two cases (large integer arithmetic and fact verification) where experimental results aligns with our theoretic analysis.
comment: Preprint
♻ ☆ Node-like as a Whole: Structure-aware Searching and Coarsening for Graph Classification
Graph Transformers (GTs) have made remarkable achievements in graph-level tasks. However, most existing works regard graph structures as a form of guidance or bias for enhancing node representations, which focuses on node-central perspectives and lacks explicit representations of edges and structures. One natural question is, can we treat graph structures node-like as a whole to learn high-level features? Through experimental analysis, we explore the feasibility of this assumption. Based on our findings, we propose a novel multi-view graph representation learning model via structure-aware searching and coarsening (GRLsc) on GT architecture for graph classification. Specifically, we build three unique views, original, coarsening, and conversion, to learn a thorough structural representation. We compress loops and cliques via hierarchical heuristic graph coarsening and restrict them with well-designed constraints, which builds the coarsening view to learn high-level interactions between structures. We also introduce line graphs for edge embeddings and switch to edge-central perspective to construct the conversion view. Experiments on eight real-world datasets demonstrate the improvements of GRLsc over 28 baselines from various architectures.
comment: 22 pages
♻ ☆ Towards Bayesian Data Selection ICML 2024
A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
comment: 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024
♻ ☆ On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data
Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.
comment: 8 pages, 6 figures, 1 Table, Accepted by the 1st AUTONOMOUS UBIQUITOUS SYSTEMS (AUTOQUITOUS) WORKSHOP of EAI MobiQuitous 2023 - 20th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services
♻ ☆ SWAP-NAS: Sample-Wise Activation Patterns for Ultra-fast NAS ICLR2024
Training-free metrics (a.k.a. zero-cost proxies) are widely used to avoid resource-intensive neural network training, especially in Neural Architecture Search (NAS). Recent studies show that existing training-free metrics have several limitations, such as limited correlation and poor generalisation across different search spaces and tasks. Hence, we propose Sample-Wise Activation Patterns and its derivative, SWAP-Score, a novel high-performance training-free metric. It measures the expressivity of networks over a batch of input samples. The SWAP-Score is strongly correlated with ground-truth performance across various search spaces and tasks, outperforming 15 existing training-free metrics on NAS-Bench-101/201/301 and TransNAS-Bench-101. The SWAP-Score can be further enhanced by regularisation, which leads to even higher correlations in cell-based search space and enables model size control during the search. For example, Spearman's rank correlation coefficient between regularised SWAP-Score and CIFAR-100 validation accuracies on NAS-Bench-201 networks is 0.90, significantly higher than 0.80 from the second-best metric, NWOT. When integrated with an evolutionary algorithm for NAS, our SWAP-NAS achieves competitive performance on CIFAR-10 and ImageNet in approximately 6 minutes and 9 minutes of GPU time respectively.
comment: ICLR2024 Spotlight
♻ ☆ EEGEncoder: Advancing BCI with Transformer-Based Motor Imagery Classification
Brain-computer interfaces (BCIs) harness electroencephalographic signals for direct neural control of devices, offering a significant benefit for individuals with motor impairments. Traditional machine learning methods for EEG-based motor imagery (MI) classification encounter challenges such as manual feature extraction and susceptibility to noise.This paper introduces EEGEncoder, a deep learning framework that employs modified transformers and TCNs to surmount these limitations. We innovatively propose a fusion architecture, namely Dual-Stream Temporal-Spatial Block (DSTS), to capture temporal and spatial features, improving the accuracy of Motor Imagery classification task. Additionally, we use multiple parallel structures to enhance the performance of the model. When tested on the BCI Competition IV-2a dataset, our model results outperform current state-of-the-art techniques.
♻ ☆ Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach
Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
♻ ☆ Tempora-Fusion: Time-Lock Puzzle with Efficient Verifiable Homomorphic Linear Combination
To securely transmit sensitive information into the future, Time-Lock Puzzles (TLPs) have been developed. Their applications include scheduled payments, timed commitments, e-voting, and sealed-bid auctions. Homomorphic TLP is a key variant of TLP that enables computation on puzzles from different clients. This allows a solver/server to tackle only a single puzzle encoding the computation's result. However, existing homomorphic TLPs lack support for verifying the correctness of the computation results. We address this limitation by introducing Tempora-Fusion, a TLP that allows a server to perform homomorphic linear combinations of puzzles from different clients while ensuring verification of computation correctness. This scheme avoids asymmetric-key cryptography for verification, thus paving the way for efficient implementations. We discuss our scheme's application in various domains, such as federated learning, scheduled payments in online banking, and e-voting.
♻ ☆ CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation ACL 2024
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning.
comment: ACL 2024 Camera Ready Version
♻ ☆ Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL
Reinforcement learning agents tend to develop habits that are effective only under specific policies. Following an initial exploration phase where agents try out different actions, they eventually converge onto a particular policy. As this occurs, the distribution over state-action trajectories becomes narrower, leading agents to repeatedly experience the same transitions. This repetitive exposure fosters spurious correlations between certain observations and rewards. Agents may then pick up on these correlations and develop simplistic habits tailored to the specific set of trajectories dictated by their policy. The problem is that these habits may yield incorrect outcomes when agents are forced to deviate from their typical trajectories, prompted by changes in the environment. This paper presents a mathematical characterization of this phenomenon, termed policy confounding, and illustrates, through a series of examples, the circumstances under which it occurs.
VCR: Visual Caption Restoration
We introduce Visual Caption Restoration (VCR), a novel vision-language task that challenges models to accurately restore partially obscured texts using pixel-level hints within images. This task stems from the observation that text embedded in images is intrinsically different from common visual elements and natural language due to the need to align the modalities of vision, text, and text embedded in images. While numerous works have integrated text embedded in images into visual question-answering tasks, approaches to these tasks generally rely on optical character recognition or masked language modeling, thus reducing the task to mainly text-based processing. However, text-based processing becomes ineffective in VCR as accurate text restoration depends on the combined information from provided images, context, and subtle cues from the tiny exposed areas of masked texts. We develop a pipeline to generate synthetic images for the VCR task using image-caption pairs, with adjustable caption visibility to control the task difficulty. With this pipeline, we construct a dataset for VCR called VCR-Wiki using images with captions from Wikipedia, comprising 2.11M English and 346K Chinese entities in both easy and hard split variants. Our results reveal that current vision language models significantly lag behind human performance in the VCR task, and merely fine-tuning the models on our dataset does not lead to notable improvements. We release VCR-Wiki and the data construction code to facilitate future research.
comment: 17 pages, 2 figures
♻ ☆ Continuous-time Autoencoders for Regular and Irregular Time Series Imputation WSDM'24
Time series imputation is one of the most fundamental tasks for time series. Real-world time series datasets are frequently incomplete (or irregular with missing observations), in which case imputation is strongly required. Many different time series imputation methods have been proposed. Recent self-attention-based methods show the state-of-the-art imputation performance. However, it has been overlooked for a long time to design an imputation method based on continuous-time recurrent neural networks (RNNs), i.e., neural controlled differential equations (NCDEs). To this end, we redesign time series (variational) autoencoders based on NCDEs. Our method, called continuous-time autoencoder (CTA), encodes an input time series sample into a continuous hidden path (rather than a hidden vector) and decodes it to reconstruct and impute the input. In our experiments with 4 datasets and 19 baselines, our method shows the best imputation performance in almost all cases.
comment: Published as a WSDM'24 full paper (oral presentation)
♻ ☆ ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
♻ ☆ In-context Pretraining: Language Modeling Beyond Document Boundaries
Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
♻ ☆ PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
comment: Code: https://github.com/gmkim-ai/PromptKD
♻ ☆ On Parameter Estimation in Deviated Gaussian Mixture of Experts AISTATS 2024
We consider the parameter estimation problem in the deviated Gaussian mixture of experts in which the data are generated from $(1 - \lambda^{\ast}) g_0(Y| X)+ \lambda^{\ast} \sum_{i = 1}^{k_{\ast}} p_{i}^{\ast} f(Y|(a_{i}^{\ast})^{\top}X+b_i^{\ast},\sigma_{i}^{\ast})$, where $X, Y$ are respectively a covariate vector and a response variable, $g_{0}(Y|X)$ is a known function, $\lambda^{\ast} \in [0, 1]$ is true but unknown mixing proportion, and $(p_{i}^{\ast}, a_{i}^{\ast}, b_{i}^{\ast}, \sigma_{i}^{\ast})$ for $1 \leq i \leq k^{\ast}$ are unknown parameters of the Gaussian mixture of experts. This problem arises from the goodness-of-fit test when we would like to test whether the data are generated from $g_{0}(Y|X)$ (null hypothesis) or they are generated from the whole mixture (alternative hypothesis). Based on the algebraic structure of the expert functions and the distinguishability between $g_0$ and the mixture part, we construct novel Voronoi-based loss functions to capture the convergence rates of maximum likelihood estimation (MLE) for our models. We further demonstrate that our proposed loss functions characterize the local convergence rates of parameter estimation more accurately than the generalized Wasserstein, a loss function being commonly used for estimating parameters in the Gaussian mixture of experts.
comment: Accepted to AISTATS 2024, 32 pages, 2 figures, 1 table
♻ ☆ Testing the Limits of Jailbreaking Defenses with the Purple Problem
The rise of "jailbreak" attacks on language models has led to a flurry of defenses aimed at preventing undesirable responses. We critically examine the two stages of the defense pipeline: (i) defining what constitutes unsafe outputs, and (ii) enforcing the definition via methods such as input processing or fine-tuning. To test the efficacy of existing enforcement mechanisms, we consider a simple and well-specified definition of unsafe outputs--outputs that contain the word "purple". Surprisingly, existing fine-tuning and input defenses fail on this simple problem, casting doubt on whether enforcement algorithms can be robust for more complicated definitions. We find that real safety benchmarks similarly test enforcement for a fixed definition. We hope that future research can lead to effective/fast enforcement as well as high quality definitions used for enforcement and evaluation.
♻ ☆ Hallmarks of Optimization Trajectories in Neural Networks: Directional Exploration and Redundancy
We propose a fresh take on understanding the mechanisms of neural networks by analyzing the rich directional structure of optimization trajectories, represented by their pointwise parameters. Towards this end, we introduce some natural notions of the complexity of optimization trajectories, both qualitative and quantitative, which hallmark the directional nature of optimization in neural networks: when is there redundancy, and when exploration. We use them to reveal the inherent nuance and interplay involved between various optimization choices, such as momentum and weight decay. Further, the trajectory perspective helps us see the effect of scale on regularizing the directional nature of trajectories, and as a by-product, we also observe an intriguing heterogeneity of Q,K,V dynamics in the middle attention layers in LLMs and which is homogenized by scale. Importantly, we put the significant directional redundancy observed to the test by demonstrating that training only scalar batchnorm parameters some while into training matches the performance of training the entire network, which thus exhibits the potential of hybrid optimization schemes that are geared towards efficiency.
comment: Preprint, 57 pages
♻ ☆ A General Theory for Softmax Gating Multinomial Logistic Mixture of Experts ICML 2024
Mixture-of-experts (MoE) model incorporates the power of multiple submodels via gating functions to achieve greater performance in numerous regression and classification applications. From a theoretical perspective, while there have been previous attempts to comprehend the behavior of that model under the regression settings through the convergence analysis of maximum likelihood estimation in the Gaussian MoE model, such analysis under the setting of a classification problem has remained missing in the literature. We close this gap by establishing the convergence rates of density estimation and parameter estimation in the softmax gating multinomial logistic MoE model. Notably, when part of the expert parameters vanish, these rates are shown to be slower than polynomial rates owing to an inherent interaction between the softmax gating and expert functions via partial differential equations. To address this issue, we propose using a novel class of modified softmax gating functions which transform the input before delivering them to the gating functions. As a result, the previous interaction disappears and the parameter estimation rates are significantly improved.
comment: Accepted to ICML 2024, 32 pages, 3 figures, 3 tables
♻ ☆ Achieving Dimension-Free Communication in Federated Learning via Zeroth-Order Optimization
Federated Learning (FL) offers a promising framework for collaborative and privacy-preserving machine learning across distributed data sources. However, the substantial communication costs associated with FL pose a significant challenge to its efficiency. Specifically, in each communication round, the communication costs scale linearly with the model's dimension, which presents a formidable obstacle, especially in large model scenarios. Despite various communication efficient strategies, the intrinsic dimension-dependent communication cost remains a major bottleneck for current FL implementations. In this paper, we introduce a novel dimension-free communication strategy for FL, leveraging zero-order optimization techniques. We propose a new algorithm, FedDisco, which facilitates the transmission of only a constant number of scalar values between clients and the server in each communication round, thereby reducing the communication cost from $\mathscr{O}(d)$ to $\mathscr{O}(1)$, where $d$ is the dimension of the model parameters. Theoretically, in non-convex functions, we prove that our algorithm achieves state-of-the-art rates, which show a linear speedup of the number of clients and local steps under standard assumptions and dimension-free rate for low effective rank scenarios. Empirical evaluations through classic deep learning training and large language model fine-tuning substantiate significant reductions in communication overhead compared to traditional FL approaches. Our code is available at https://github.com/ZidongLiu/FedDisco.
♻ ☆ LLM-Assisted Content Conditional Debiasing for Fair Text Embedding
Mitigating biases in machine learning models has become an increasing concern in Natural Language Processing (NLP), particularly in developing fair text embeddings, which are crucial yet challenging for real-world applications like search engines. In response, this paper proposes a novel method for learning fair text embeddings. First, we define a novel content-conditional equal distance (CCED) fairness for text embeddings, ensuring content-conditional independence between sensitive attributes and text embeddings. Building on CCED, we introduce a content-conditional debiasing (CCD) loss to ensure that embeddings of texts with different sensitive attributes but identical content maintain the same distance from the embedding of their corresponding neutral text. Additionally, we tackle the issue of insufficient training data by using Large Language Models (LLMs) with instructions to fairly augment texts into different sensitive groups. Our extensive evaluations show that our approach effectively enhances fairness while maintaining the utility of embeddings. Furthermore, our augmented dataset, combined with the CCED metric, serves as an new benchmark for evaluating fairness.
♻ ☆ Is Temperature Sample Efficient for Softmax Gaussian Mixture of Experts? ICML 2024
Dense-to-sparse gating mixture of experts (MoE) has recently become an effective alternative to a well-known sparse MoE. Rather than fixing the number of activated experts as in the latter model, which could limit the investigation of potential experts, the former model utilizes the temperature to control the softmax weight distribution and the sparsity of the MoE during training in order to stabilize the expert specialization. Nevertheless, while there are previous attempts to theoretically comprehend the sparse MoE, a comprehensive analysis of the dense-to-sparse gating MoE has remained elusive. Therefore, we aim to explore the impacts of the dense-to-sparse gate on the maximum likelihood estimation under the Gaussian MoE in this paper. We demonstrate that due to interactions between the temperature and other model parameters via some partial differential equations, the convergence rates of parameter estimations are slower than any polynomial rates, and could be as slow as $\mathcal{O}(1/\log(n))$, where $n$ denotes the sample size. To address this issue, we propose using a novel activation dense-to-sparse gate, which routes the output of a linear layer to an activation function before delivering them to the softmax function. By imposing linearly independence conditions on the activation function and its derivatives, we show that the parameter estimation rates are significantly improved to polynomial rates. Finally, we conduct a simulation study to empirically validate our theoretical results.
comment: Accepted to ICML 2024, 47 pages, 2 figures, 2 tables
♻ ☆ Multi-Fidelity Residual Neural Processes for Scalable Surrogate Modeling
Multi-fidelity surrogate modeling aims to learn an accurate surrogate at the highest fidelity level by combining data from multiple sources. Traditional methods relying on Gaussian processes can hardly scale to high-dimensional data. Deep learning approaches utilize neural network based encoders and decoders to improve scalability. These approaches share encoded representations across fidelities without including corresponding decoder parameters. This hinders inference performance, especially in out-of-distribution scenarios when the highest fidelity data has limited domain coverage. To address these limitations, we propose Multi-fidelity Residual Neural Processes (MFRNP), a novel multi-fidelity surrogate modeling framework. MFRNP explicitly models the residual between the aggregated output from lower fidelities and ground truth at the highest fidelity. The aggregation introduces decoders into the information sharing step and optimizes lower fidelity decoders to accurately capture both in-fidelity and cross-fidelity information. We show that MFRNP significantly outperforms state-of-the-art in learning partial differential equations and a real-world climate modeling task. Our code is published at: https://github.com/Rose-STL-Lab/MFRNP
comment: A novel probabilistic inference approach for scalable multi-fidelity surrogate modeling
♻ ☆ On Least Square Estimation in Softmax Gating Mixture of Experts ICML 2024
Mixture of experts (MoE) model is a statistical machine learning design that aggregates multiple expert networks using a softmax gating function in order to form a more intricate and expressive model. Despite being commonly used in several applications owing to their scalability, the mathematical and statistical properties of MoE models are complex and difficult to analyze. As a result, previous theoretical works have primarily focused on probabilistic MoE models by imposing the impractical assumption that the data are generated from a Gaussian MoE model. In this work, we investigate the performance of the least squares estimators (LSE) under a deterministic MoE model where the data are sampled according to a regression model, a setting that has remained largely unexplored. We establish a condition called strong identifiability to characterize the convergence behavior of various types of expert functions. We demonstrate that the rates for estimating strongly identifiable experts, namely the widely used feed-forward networks with activation functions $\mathrm{sigmoid}(\cdot)$ and $\tanh(\cdot)$, are substantially faster than those of polynomial experts, which we show to exhibit a surprising slow estimation rate. Our findings have important practical implications for expert selection.
comment: Accepted to ICML 2024, 29 pages, 2 figures, 2 tables
♻ ☆ Federated Learning for Estimating Heterogeneous Treatment Effects
Machine learning methods for estimating heterogeneous treatment effects (HTE) facilitate large-scale personalized decision-making across various domains such as healthcare, policy making, education, and more. Current machine learning approaches for HTE require access to substantial amounts of data per treatment, and the high costs associated with interventions makes centrally collecting so much data for each intervention a formidable challenge. To overcome this obstacle, in this work, we propose a novel framework for collaborative learning of HTE estimators across institutions via Federated Learning. We show that even under a diversity of interventions and subject populations across clients, one can jointly learn a common feature representation, while concurrently and privately learning the specific predictive functions for outcomes under distinct interventions across institutions. Our framework and the associated algorithm are based on this insight, and leverage tabular transformers to map multiple input data to feature representations which are then used for outcome prediction via multi-task learning. We also propose a novel way of federated training of personalised transformers that can work with heterogeneous input feature spaces. Experimental results on real-world clinical trial data demonstrate the effectiveness of our method.
♻ ☆ FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning ICML 2024
Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder provides meticulously designed environments to address critical challenges in competitive MARL research, aiming to catalyze a new era of discovery and advancement in the field. Videos and code at https://sites.google.com/view/fightladder/home.
comment: ICML 2024
♻ ☆ ZeroG: Investigating Cross-dataset Zero-shot Transferability in Graphs KDD 2024
With the development of foundation models such as large language models, zero-shot transfer learning has become increasingly significant. This is highlighted by the generative capabilities of NLP models like GPT-4, and the retrieval-based approaches of CV models like CLIP, both of which effectively bridge the gap between seen and unseen data. In the realm of graph learning, the continuous emergence of new graphs and the challenges of human labeling also amplify the necessity for zero-shot transfer learning, driving the exploration of approaches that can generalize across diverse graph data without necessitating dataset-specific and label-specific fine-tuning. In this study, we extend such paradigms to zero-shot transferability in graphs by introducing ZeroG, a new framework tailored to enable cross-dataset generalization. Addressing the inherent challenges such as feature misalignment, mismatched label spaces, and negative transfer, we leverage a language model to encode both node attributes and class semantics, ensuring consistent feature dimensions across datasets. We also propose a prompt-based subgraph sampling module that enriches the semantic information and structure information of extracted subgraphs using prompting nodes and neighborhood aggregation, respectively. We further adopt a lightweight fine-tuning strategy that reduces the risk of overfitting and maintains the zero-shot learning efficacy of the language model. The results underscore the effectiveness of our model in achieving significant cross-dataset zero-shot transferability, opening pathways for the development of graph foundation models. Codes and data are available at https://github.com/NineAbyss/ZeroG.
comment: Accepted by SIGKDD 2024, research track
♻ ☆ Accurately Classifying Out-Of-Distribution Data in Facial Recognition
Standard classification theory assumes that the distribution of images in the test and training sets are identical. Unfortunately, real-life scenarios typically feature unseen data ("out-of-distribution data") which is different from data in the training distribution("in-distribution"). This issue is most prevalent in social justice problems where data from under-represented groups may appear in the test data without representing an equal proportion of the training data. This may result in a model returning confidently wrong decisions and predictions. We are interested in the following question: Can the performance of a neural network improve on facial images of out-of-distribution data when it is trained simultaneously on multiple datasets of in-distribution data? We approach this problem by incorporating the Outlier Exposure model and investigate how the model's performance changes when other datasets of facial images were implemented. We observe that the accuracy and other metrics of the model can be increased by applying Outlier Exposure, incorporating a trainable weight parameter to increase the machine's emphasis on outlier images, and by re-weighting the importance of different class labels. We also experimented with whether sorting the images and determining outliers via image features would have more of an effect on the metrics than sorting by average pixel value. Our goal was to make models not only more accurate but also more fair by scanning a more expanded range of images. We also tested the datasets in reverse order to see whether a more fair dataset with balanced features has an effect on the model's accuracy.
comment: 18 pages, 6 tables, 6 figures
Systems and Control 26
☆ Fuel-Optimal Formation Reconfiguration by Means of Unidirectional Low-Thrust Propulsion System
The use of electric low-thrust propulsion systems for orbit maneuvers is becoming a popular choice among satellite manufacturers due to their inherent merits over their chemical counterparts. Many designers choose to incorporate multiple of such thrusters to insure omnidirectional orbit maneuverability, while others choose to equip their satellite with only one thruster nozzle, aiming to reduce the required power, weight, and size of the orbit control system. This paper proposes guidance and control schemes to address the problem of autonomous optimal relative orbit reconfiguration for a formation of two satellites, one of which utilizes a single low-thrust throttleable nozzle. Such under-actuated orbit control system requires the controlled spacecraft to constantly slew to direct the nozzle to the desired thrust direction. These redirection attitude maneuvers are treated within the guidance layer by accommodating recurrent no-thrust periods during which slew maneuvers take place. The control loop is then closed with an MPC-like scheme. The main motivation of this article is to support the future missions of LuxSpace's flagship satellite, Triton-X. Since the proposed guidance and control schemes are meant to answer realistic market needs, they are designed to have some specific qualities that makes them attractive from the practical point of view. Namely, they require minimal computational loads, besides being able to accommodate operational time constraints, e.g. no thrusting during eclipse or during ground contact, within the guidance layer.
☆ Koopman Operator-based Detection-Isolation of Cyberattack: A Case Study on Electric Vehicle Charging
One of the key challenges towards the reliable operation of cyber-physical systems (CPS) is the threat of cyberattacks on system actuation signals and measurements. In recent years, system theoretic research has focused on effectively detecting and isolating these cyberattacks to ensure proper restorative measures. Although both model-based and model-free approaches have been used in this context, the latter are increasingly becoming more popular as complexities and model uncertainties in CPS increases. Thus, in this paper we propose a Koopman operator-based model-free cyberattack detection-isolation scheme for CPS. The algorithm uses limited system measurements for its training and generates real-time detection-isolation flags. Furthermore, we present a simulation case study to detect and isolate actuation and sensor attacks in a Lithium-ion battery system of a plug-in electric vehicle during charging.
comment: 10 pages, 4 figures, to be published in 2024 American Control Conference
☆ Backstepping control for the sterile mosquitoes technique: stabilization of extinction equilibrium
The control of a mosquito population using the sterile insect technique is considered. Building on a model-based approach, where the control input is the release rate of sterilized males, we propose a non-negative backstepping control law capable of globally stabilizing the extinction equilibrium of the system. A simulation study supports and validates the theoretical findings, showing the efficacy of the approach both on a reduced model, used for control design, and on a complete model of the mosquito population dynamics.
☆ Generalized Modal Analysis in Power System with High CIG Penetration: Concept and Quantitative Assessment
This paper presents a Generalized Modal Analysis (GMA) concept for the small-signal stability analysis of power systems with high penetration of Converter-Interfaced Generation (CIG). GMA quantitatively assesses interactions between various elements in the power system, offering intuitive and transparent physical interpretations. The method's versatility in selecting physical quantities at different input and output ports makes it broadly applicable. Based on the concept of GMA, the study further defines interaction quantification indices by selecting voltage ports, examining the impact of grid disturbances on power sources and the support from the power sources to the grid at connection points. Numerical simulations on modified 14-bus and 68-bus systems validate GMA's effectiveness in capturing the coupling of the dynamic characteristics between grid elements. This research provides a theoretical foundation and analytical framework for future analyses of power system stability with diverse power sources.
comment: submitted to IEEE Transactions on Power Systems for peer-review
☆ Tuning a Cascaded Online Feedback Optimization Controller for Provision of Distributed Flexibility
Coordinating a high number of flexibility providing units (e.g. to provide ancillary services for the transmission system) across various grid layers requires new control concepts. A flexibility request at a point of common coupling can be met by utilizing a cascaded control structure based on online feedback optimization. In this paper the influence of the parameterization of the individual controllers on the performance of the hierarchical flexibility provision is studied on a three-level test system. The results show a high interdependency between the choice of control parameters of one controller and the behavior of other controllers as well as a significant impact on the accuracy and speed of flexibility provision. With a careful tuning, a cascaded structure based on online feedback optimization can achieve efficient vertical coordination of flexibility providing units.
☆ Switching Controller Synthesis for Hybrid Systems Against STL Formulas
Switching controllers play a pivotal role in directing hybrid systems (HSs) towards the desired objective, embodying a ``correct-by-construction'' approach to HS design. Identifying these objectives is thus crucial for the synthesis of effective switching controllers. While most of existing works focus on safety and liveness, few of them consider timing constraints. In this paper, we delves into the synthesis of switching controllers for HSs that meet system objectives given by a fragment of STL, which essentially corresponds to a reach-avoid problem with timing constraints. Our approach involves iteratively computing the state sets that can be driven to satisfy the reach-avoid specification with timing constraints. This technique supports to create switching controllers for both constant and non-constant HSs. We validate our method's soundness, and confirm its relative completeness for a certain subclass of HSs. Experiment results affirms the efficacy of our approach.
☆ Differentiable Distributionally Robust Optimization Layers
In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i.e., how to differentiate decisions with respect to an ambiguity set. In this paper, we develop such differentiable DRO layers for generic mixed-integer DRO problems with parameterized second-order conic ambiguity sets and discuss its extension to Wasserstein ambiguity sets. To differentiate the mixed-integer decisions, we propose a novel dual-view methodology by handling continuous and discrete parts of decisions via different principles. Specifically, we construct a differentiable energy-based surrogate to implement the dual-view methodology and use importance sampling to estimate its gradient. We further prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments.
comment: In Forty-first International Conference on Machine Learning (2024)
☆ Recent advancements on MPC for tracking: periodic and harmonic formulations
The main benefit of model predictive control (MPC) is its ability to steer the system to a given reference without violating the constraints while minimizing some objective. Furthermore, a suitably designed MPC controller guarantees asymptotic stability of the closed-loop system to the given reference as long as its optimization problem is feasible at the initial state of the system. Therefore, one of the limitations of classical MPC is that changing the reference may lead to an unfeasible MPC problem. Furthermore, due to a lack of deep knowledge of the system, it is possible for the user to provide a desired reference that is unfeasible or non-attainable for the MPC controller, leading to the same problem. This chapter summarizes MPC formulations recently proposed that have been designed to address these issues. In particular, thanks to the addition of an artificial reference as decision variable, the formulations achieve asymptotic stability and recursive feasibility guarantees regardless of the reference provided by the user, even if it is changed online or if it violates the system constraints. We show a recent formulation which extends this idea, achieving better performance and larger domains of attraction when working with small prediction horizons. Additional benefits of these formulations, when compared to classical MPC, are also discussed and highlighted with illustrative examples.
comment: (21 pages, 8 figures)
☆ Fault Detection for agents on power grid topology optimization: A Comprehensive analysis
The topology optimization of transmission networks using Deep Reinforcement Learning (DRL) has increasingly come into focus. Various researchers have proposed different DRL agents, which are often benchmarked on the Grid2Op environment from the Learning to Run a Power Network (L2RPN) challenges. The environments have many advantages with their realistic chronics and underlying power flow backends. However, the interpretation of agent survival or failure is not always clear, as there are a variety of potential causes. In this work, we focus on the failures of the power grid to identify patterns and detect them a priori. We collect the failed chronics of three different agents on the WCCI 2022 L2RPN environment, totaling about 40k data points. By clustering, we are able to detect five distinct clusters, identifying different failure types. Further, we propose a multi-class prediction approach to detect failures beforehand and evaluate five different models. Here, the Light Gradient-Boosting Machine (LightGBM) shows the best performance, with an accuracy of 86%. It also correctly identifies in 91% of the time failure and survival observations. Finally, we provide a detailed feature importance analysis that identifies critical features and regions in the grid.
comment: 11 Pages plus references and appendix. The appendix consist of additional material of the paper and is not included in the initial submission
☆ Testing Topological Data Analysis for Condition Monitoring of Wind Turbines
We present an investigation of how topological data analysis (TDA) can be applied to condition-based monitoring (CBM) of wind turbines for energy generation. TDA is a branch of data analysis focusing on extracting meaningful information from complex datasets by analyzing their structure in state space and computing their underlying topological features. By representing data in a high-dimensional state space, TDA enables the identification of patterns, anomalies, and trends in the data that may not be apparent through traditional signal processing methods. For this study, wind turbine data was acquired from a wind park in Norway via standard vibration sensors at different locations of the turbine's gearbox. Both the vibration acceleration data and its frequency spectra were recorded at infrequent intervals for a few seconds at high frequency and failure events were labelled as either gear-tooth or ball-bearing failures. The data processing and analysis are based on a pipeline where the time series data is first split into intervals and then transformed into multi-dimensional point clouds via a time-delay embedding. The shape of the point cloud is analyzed with topological methods such as persistent homology to generate topology-based key health indicators based on Betti numbers, information entropy and signal persistence. Such indicators are tested for CBM and diagnosis (fault detection) to identify faults in wind turbines and classify them accordingly. Topological indicators are shown to be an interesting alternative for failure identification and diagnosis of operational failures in wind turbines.
comment: To be published in the proceedings of the Annual Conference of the Prognostics and Health Management Society Style 2024
☆ Compact Model Parameter Extraction via Derivative-Free Optimization
In this paper, we address the problem of compact model parameter extraction to simultaneously extract tens of parameters via derivative-free optimization. Traditionally, parameter extraction is performed manually by dividing the complete set of parameters into smaller subsets, each targeting different operational regions of the device, a process that can take several days or even weeks. Our approach streamlines this process by employing derivative-free optimization to identify a good parameter set that best fits the compact model without performing an exhaustive number of simulations. We further enhance the optimization process to address critical issues in device modeling by carefully choosing a loss function that evaluates model performance consistently across varying magnitudes by focusing on relative errors (as opposed to absolute errors), prioritizing accuracy in key operational regions of the device above a certain threshold, and reducing sensitivity to outliers. Furthermore, we utilize the concept of train-test split to assess the model fit and avoid overfitting. This is done by fitting 80% of the data and testing the model efficacy with the remaining 20%. We demonstrate the effectiveness of our methodology by successfully modeling two semiconductor devices: a diamond Schottky diode and a GaN-on-SiC HEMT, with the latter involving the ASM-HEMT DC model, which requires simultaneously extracting 35 model parameters to fit the model to the measured data. These examples demonstrate the effectiveness of our approach and showcase the practical benefits of derivative-free optimization in device modeling.
☆ Discrete-time Integral Resonant Control of Negative Imaginary Systems: Application to a High-speed Nanopositioner
We propose a discrete-time integral resonant control (IRC) approach for negative imaginary (NI) systems, which overcomes several limitations of continuous-time IRC. We show that a discrete-time IRC has a step-advanced negative imaginary property. A zero-order hold-sampled NI system can be asymptotically stabilized using a discrete-time IRC with suitable parameters. A hardware experiment is conducted where a high-speed flexure-guided nanopositioner is efficiently damped using the proposed discrete-time IRC with the discrete-time controller being implemented in FPGA hardware at the sampling rate of 1.25 MHz.
comment: 10 pages, 10 figures
☆ Perturbed Decision-Focused Learning for Modeling Strategic Energy Storage
This paper presents a novel decision-focused framework integrating the physical energy storage model into machine learning pipelines. Motivated by the model predictive control for energy storage, our end-to-end method incorporates the prior knowledge of the storage model and infers the hidden reward that incentivizes energy storage decisions. This is achieved through a dual-layer framework, combining a prediction layer with an optimization layer. We introduce the perturbation idea into the designed decision-focused loss function to ensure the differentiability over linear storage models, supported by a theoretical analysis of the perturbed loss function. We also develop a hybrid loss function for effective model training. We provide two challenging applications for our proposed framework: energy storage arbitrage, and energy storage behavior prediction. The numerical experiments on real price data demonstrate that our arbitrage approach achieves the highest profit against existing methods. The numerical experiments on synthetic and real-world energy storage data show that our approach achieves the best behavior prediction performance against existing benchmark methods, which shows the effectiveness of our method.
☆ Tolerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems
Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.
comment: arXiv admin note: text overlap with arXiv:2311.07462
♻ ☆ An Approach to Evaluate Modeling Adequacy for Small-Signal Stability Analysis of IBR-related SSOs in Multimachine Systems
Time-varying phasor-based analysis of subsynchronous oscillations (SSOs) involving grid-following converters (GFLCs) and its benchmarking with electromagnetic transient (EMT) models have so far been restricted to highly simplified grid models with constant voltage sources behind series R-L circuits. In this paper, modeling adequacy of bulk power systems with synchronous generators (SGs), transmission systems, loads, and GFLCs are considered. To this end, we revisit the notions of time-varying phasor calculus, highlighting the distinction between space-phasor-calculus (SPC) and two often interchangeably used frameworks namely baseband-abc and generalized averaging. We present the models of grids in SPC framework that include transmission line dynamics, load dynamics, and SG stator transients. Next, we propose a generic approach to study modeling adequacy in small-signal sense by (a) identifying critical modes through eigenvalue and singular value analysis followed by (b) using weighted maximum singular value error magnitudes as metrics, and (c) further cross-validation. Using a modified 4-machine IEEE benchmark model with up to 3 GFLCs we show that SPC framework can be used for analysis of SSOs. Further, we consider the quasistationary phasor calculus (QPC) framework that neglects transmission line, load, and SG stator dynamics to show its adequacy in SSO modeling and analysis. Time-domain and frequency-domain results with EMT models are also presented.
comment: 10 pages, 21 figures
♻ ☆ Feedback linearization through the lens of data
Controlling nonlinear systems, especially when data are being used to offset uncertainties in the model, is hard. A natural approach when dealing with the challenges of nonlinear control is to reduce the system to a linear one via change of coordinates and feedback, an approach commonly known as feedback linearization. Here we consider the feedback linearization problem of an unknown system when the solution must be found using experimental data. We propose a new method that learns the change of coordinates and the linearizing controller from a library (a dictionary) of candidate functions with a simple algebraic procedure - the computation of the null space of a data-dependent matrix. Remarkably, we show that the solution is valid over the entire state space of interest and not just on the dataset used to determine the solution.
comment: This is the extended version of a paper that appeared in abridged form in the Proceedings of the 62nd IEEE Conference on Decision and Control, 13-15 December 2023, Singapore
♻ ☆ Who Plays First? Optimizing the Order of Play in Stackelberg Games with Many Robots
We consider the multi-agent spatial navigation problem of computing the socially optimal order of play, i.e., the sequence in which the agents commit to their decisions, and its associated equilibrium in an N-player Stackelberg trajectory game. We model this problem as a mixed-integer optimization problem over the space of all possible Stackelberg games associated with the order of play's permutations. To solve the problem, we introduce Branch and Play (B&P), an efficient and exact algorithm that provably converges to a socially optimal order of play and its Stackelberg equilibrium. As a subroutine for B&P, we employ and extend sequential trajectory planning, i.e., a popular multi-agent control approach, to scalably compute valid local Stackelberg equilibria for any given order of play. We demonstrate the practical utility of B&P to coordinate air traffic control, swarm formation, and delivery vehicle fleets. We find that B&P consistently outperforms various baselines, and computes the socially optimal equilibrium.
comment: Robotics: Science and Systems (RSS) 2024
♻ ☆ Interconnection of (Q,S,R)-Dissipative Systems in Discrete Time
Discrete-time systems cannot be passive unless there is a direct feedthrough from the input to the output. For passivity-based control to be exploited nevertheless, some authors introduce virtual outputs, while others rely on continuous-time passivity and then apply discretization techniques that preserve passivity in discrete time. Here we argue that quadratic supply rates incorporate and extend the effect of virtual outputs, allowing one to exploit dissipativity properties directly in discrete time. We derive decentralized (Q,S,R)-dissipativity conditions for a set of nonlinear systems interconnected with arbitrary topology, so that the overall network is guaranteed to be stable. For linear systems, we develop dissipative control conditions that are linear in the supply rate matrices. To demonstrate the validity of our methods, we provide numerical examples in the context of islanded microgrids.
♻ ☆ Sigma-point Kalman Filter with Nonlinear Unknown Input Estimation via Optimization and Data-driven Approach for Dynamic Systems
Most works on joint state and unknown input (UI) estimation require the assumption that the UIs are linear; this is potentially restrictive as it does not hold in many intelligent autonomous systems. To overcome this restriction and circumvent the need to linearize the system, we propose a derivative-free Unknown Input Sigma-point Kalman Filter (SPKF-nUI) where the SPKF is interconnected with a general nonlinear UI estimator that can be implemented via nonlinear optimization and data-driven approaches. The nonlinear UI estimator uses the posterior state estimate which is less susceptible to state prediction error. In addition, we introduce a joint sigma-point transformation scheme to incorporate both the state and UI uncertainties in the estimation of SPKF-nUI. An in-depth stochastic stability analysis proves that the proposed SPKF-nUI yields exponentially converging estimation error bounds under reasonable assumptions. Finally, two case studies are carried out on a simulation-based rigid robot and a physical soft robot, i.e., robots made of soft materials with complex dynamics to validate effectiveness of the proposed filter on nonlinear dynamic systems. Our results demonstrate that the proposed SPKF-nUI achieves the lowest state and UI estimation errors when compared to the existing nonlinear state-UI filters.
♻ ☆ Input-output linearization and decoupling of mechanical control systems
In this work, we present a problem of simultaneous input-output feedback linearization and decoupling (non-interacting) for mechanical control systems with outputs. We show that the natural requirement of preserving mechanical structure of the system and of transformations imposes supplementary conditions when compared to the classical solution of the same problem for general control systems. These conditions can be expressed using objects on the configuration space only. We illustrate our results with several examples of mechanical control systems.
♻ ☆ Coordinate-based neural representations for computational adaptive optics in widefield microscopy
Widefield microscopy is widely used for non-invasive imaging of biological structures at subcellular resolution. When applied to complex specimen, its image quality is degraded by sample-induced optical aberration. Adaptive optics can correct wavefront distortion and restore diffraction-limited resolution but require wavefront sensing and corrective devices, increasing system complexity and cost. Here, we describe a self-supervised machine learning algorithm, CoCoA, that performs joint wavefront estimation and three-dimensional structural information extraction from a single input 3D image stack without the need for external training dataset. We implemented CoCoA for widefield imaging of mouse brain tissues and validated its performance with direct-wavefront-sensing-based adaptive optics. Importantly, we systematically explored and quantitatively characterized the limiting factors of CoCoA's performance. Using CoCoA, we demonstrated the first in vivo widefield mouse brain imaging using machine-learning-based adaptive optics. Incorporating coordinate-based neural representations and a forward physics model, the self-supervised scheme of CoCoA should be applicable to microscopy modalities in general.
comment: 60 pages, 20 figures, 2 tables. Nat Mach Intell (2024)
♻ ☆ Proactive Service Assurance in 5G and B5G Networks: A Closed-Loop Algorithm for End-to-End Network Slicing
The customization of services in Fifth-generation (5G) and Beyond 5G (B5G) networks relies heavily on network slicing, which creates multiple virtual networks on a shared physical infrastructure, tailored to meet specific requirements of distinct applications, using Software Defined Networking (SDN) and Network Function Virtualization (NFV). It is imperative to ensure that network services meet the performance and reliability requirements of various applications and users, thus, service assurance is one of the critical components in network slicing. One of the key functionalities of network slicing is the ability to scale Virtualized Network Functions (VNFs) in response to changing resource demand and to meet Customer Service Level agreements (SLAs). In this paper, we introduce a proactive closed-loop algorithm for end-to-end network orchestration, designed to provide service assurance in 5G and B5G networks. We focus on dynamically scaling resources to meet key performance indicators (KPIs) specific to each network slice and operate in parallel across multiple slices, making it scalable and capable of managing completely automatically real-time service assurance. Through our experiments, we demonstrate that the proposed algorithm effectively fulfills service assurance requirements for different network slice types, thereby minimizing network resource utilization and reducing the over-provisioning of spare resources.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
♻ ☆ Hamilton-Jacobi Reachability Analysis for Hybrid Systems with Controlled and Forced Transitions
Hybrid dynamical systems with nonlinear dynamics are one of the most general modeling tools for representing robotic systems, especially contact-rich systems. However, providing guarantees regarding the safety or performance of nonlinear hybrid systems remains a challenging problem because it requires simultaneous reasoning about continuous state evolution and discrete mode switching. In this work, we address this problem by extending classical Hamilton-Jacobi (HJ) reachability analysis, a formal verification method for continuous-time nonlinear dynamical systems, to hybrid dynamical systems. We characterize the reachable sets for hybrid systems through a generalized value function defined over discrete and continuous states of the hybrid system. We also provide a numerical algorithm to compute this value function and obtain the reachable set. Our framework can compute reachable sets for hybrid systems consisting of multiple discrete modes, each with its own set of nonlinear continuous dynamics, discrete transitions that can be directly commanded or forced by a discrete control input, while still accounting for control bounds and adversarial disturbances in the state evolution. Along with the reachable set, the proposed framework also provides an optimal continuous and discrete controller to ensure system safety. We demonstrate our framework in several simulation case studies, as well as on a real-world testbed to solve the optimal mode planning problem for a quadruped with multiple gaits.
♻ ☆ Interpolatory $\mathcal{H}_2$-optimality Conditions for Structured Linear Time-invariant Systems
Interpolatory necessary optimality conditions for $\mathcal{H}_2$-optimal reduced-order modeling of unstructured linear time-invariant (LTI) systems are well-known. Based on previous work on $\mathcal{L}_2$-optimal reduced-order modeling of stationary parametric problems, in this paper we develop and investigate optimality conditions for $\mathcal{H}_2$-optimal reduced-order modeling of structured LTI systems, in particular, for second-order, port-Hamiltonian, and time-delay systems. Under certain diagonalizability assumptions, we show that across all these different structured settings, bitangential Hermite interpolation is the common form for optimality, thus proving a unifying optimality framework for structured reduced-order modeling.
comment: 23 pages
♻ ☆ Tight Remainder-Form Decomposition Functions with Applications to Constrained Reachability and Guaranteed State Estimation
This paper proposes a tractable family of remainder-form mixed-monotone decomposition functions that are useful for over-approximating the image set of nonlinear mappings in reachability and estimation problems. Our approach applies to a new class of nonsmooth, discontinuous nonlinear systems that we call either-sided locally Lipschitz semicontinuous (ELLS) systems, which we show to be a strict superset of locally Lipschitz continuous (LLC) systems, thus expanding the set of systems that are formally known to be mixed-monotone. In addition, we derive lower and upper bounds for the over-approximation error and show that the lower bound is achieved with our proposed approach, i.e., our approach constructs the tightest, tractable remainder-form mixed-monotone decomposition function. Moreover, we introduce a set inversion algorithm that along with the proposed decomposition functions, can be used for constrained reachability analysis and guaranteed state estimation for continuous- and discrete-time systems with bounded noise.
comment: 17 pages, published in IEEE Transactions on Automatic Control, Examples 1 and 2 slightly revised on Arxiv
♻ ☆ A Stochastic Operator Framework for Optimization and Learning with Sub-Weibull Errors
This paper proposes a framework to study the convergence of stochastic optimization and learning algorithms. The framework is modeled over the different challenges that these algorithms pose, such as (i) the presence of random additive errors (e.g. due to stochastic gradients), and (ii) random coordinate updates (e.g. due to asynchrony in distributed set-ups). The paper covers both convex and strongly convex problems, and it also analyzes online scenarios, involving changes in the data and costs. The paper relies on interpreting stochastic algorithms as the iterated application of stochastic operators, thus allowing us to use the powerful tools of operator theory. In particular, we consider operators characterized by additive errors with sub-Weibull distribution (which parameterize a broad class of errors by their tail probability), and random updates. In this framework we derive convergence results in mean and in high probability, by providing bounds to the distance of the current iteration from a solution of the optimization or learning problem. The contributions are discussed in light of federated learning applications.
comment: To appear in IEEE Transactions on Automatic Control
Multiagent Systems 5
☆ Towards Physically Talented Aerial Robots with Tactically Smart Swarm Behavior thereof: An Efficient Co-design Approach
The collective performance or capacity of collaborative autonomous systems such as a swarm of robots is jointly influenced by the morphology and the behavior of individual systems in that collective. In that context, this paper explores how morphology impacts the learned tactical behavior of unmanned aerial/ground robots performing reconnaissance and search & rescue. This is achieved by presenting a computationally efficient framework to solve this otherwise challenging problem of jointly optimizing the morphology and tactical behavior of swarm robots. Key novel developments to this end include the use of physical talent metrics and modification of graph reinforcement learning architectures to allow joint learning of the swarm tactical policy and the talent metrics (search speed, flight range, and cruising speed) that constrain mobility and object/victim search capabilities of the aerial robots executing these tactics. Implementation of this co-design approach is supported by advancements to an open-source Pybullet-based swarm simulator that allows the use of variable aerial asset capabilities. The results of the co-design are observed to outperform those of tactics learning with a fixed Pareto design, when compared in terms of mission performance metrics. Significant differences in morphology and learned behavior are also observed by comparing the baseline design and the co-design outcomes.
comment: Accepted for presentation in proceedings of ASME IDETC-CIE 2024
♻ ☆ MAP-NBV: Multi-agent Prediction-guided Next-Best-View Planning for Active 3D Object Reconstruction IROS 2024
Next-Best View (NBV) planning is a long-standing problem of determining where to obtain the next best view of an object from, by a robot that is viewing the object. There are a number of methods for choosing NBV based on the observed part of the object. In this paper, we investigate how predicting the unobserved part helps with the efficiency of reconstructing the object. We present, Multi-Agent Prediction-Guided NBV (MAP-NBV), a decentralized coordination algorithm for active 3D reconstruction with multi-agent systems. Prediction-based approaches have shown great improvement in active perception tasks by learning the cues about structures in the environment from data. However, these methods primarily focus on single-agent systems. We design a decentralized next-best-view approach that utilizes geometric measures over the predictions and jointly optimizes the information gain and control effort for efficient collaborative 3D reconstruction of the object. Our method achieves 19% improvement over the non-predictive multi-agent approach in simulations using AirSim and ShapeNet. We make our code publicly available through our project website: http://raaslab.org/projects/MAPNBV/.
comment: 8 pages, 7 figures, 1 table. Submitted to IROS 2024
♻ ☆ Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: https://github.com/YuchenXia/AASbyLLM.
comment: Published in IEEE Access
♻ ☆ Navigating simplicity and complexity of social-ecological systems through a dialog between dynamical systems and agent-based models
Social-ecological systems research aims to understand the nature of social-ecological phenomena, to find ways to foster or manage conditions under which desired phenomena occur or to reduce the negative consequences of undesirable phenomena. Such challenges are often addressed using dynamical systems models (DSM) or agent-based models (ABM). Here we develop an iterative procedure for combining DSM and ABM to leverage their strengths and gain insights that surpass insights obtained by each approach separately. The procedure uses results of an ABM as inputs for a DSM development. In the following steps, results of the DSM analyses guide future analysis of the ABM and vice versa. This dialogue, more than having a tight connection between the models, enables pushing the research frontier, expanding the set of research questions and insights. We illustrate our method with the example of poverty traps and innovation in agricultural systems, but our conclusions are general and can be applied to other DSM-ABM combinations.
comment: 16 pages, 8 figures
♻ ☆ FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning ICML 2024
Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder provides meticulously designed environments to address critical challenges in competitive MARL research, aiming to catalyze a new era of discovery and advancement in the field. Videos and code at https://sites.google.com/view/fightladder/home.
comment: ICML 2024
Artificial Intelligence 172
☆ StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available in hf.co/Stable-X
comment: HF Demo: hf.co/Stable-X, Video: https://www.youtube.com/watch?v=sylXTxG_U2U
☆ GeoMFormer: A General Architecture for Geometric Molecular Representation Learning ICML 2024
Molecular modeling, a central topic in quantum mechanics, aims to accurately calculate the properties and simulate the behaviors of molecular systems. The molecular model is governed by physical laws, which impose geometric constraints such as invariance and equivariance to coordinate rotation and translation. While numerous deep learning approaches have been developed to learn molecular representations under these constraints, most of them are built upon heuristic and costly modules. We argue that there is a strong need for a general and flexible framework for learning both invariant and equivariant features. In this work, we introduce a novel Transformer-based molecular model called GeoMFormer to achieve this goal. Using the standard Transformer modules, two separate streams are developed to maintain and learn invariant and equivariant representations. Carefully designed cross-attention modules bridge the two streams, allowing information fusion and enhancing geometric modeling in each stream. As a general and flexible architecture, we show that many previous architectures can be viewed as special instantiations of GeoMFormer. Extensive experiments are conducted to demonstrate the power of GeoMFormer. All empirical results show that GeoMFormer achieves strong performance on both invariant and equivariant tasks of different types and scales. Code and models will be made publicly available at https://github.com/c-tl/GeoMFormer.
comment: 25 pages, 13 tables, l figure; ICML 2024 camera ready version
☆ Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts
We present LoCoVQA, a dynamic benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs). LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts composed of both in-distribution and out-of-distribution distractor images. Across these tasks, a diverse set of VLMs rapidly lose performance as the visual context length grows, often exhibiting a striking exponential decay trend. This test assesses how well VLMs can ignore irrelevant information when answering queries -- a task that is quite easy for language models (LMs) in the text domain -- demonstrating that current state-of-the-art VLMs lack this essential capability for many long-context applications.
comment: Under review
☆ USDC: A Dataset of $\underline{U}$ser $\underline{S}$tance and $\underline{D}$ogmatism in Long $\underline{C}$onversations
Identifying user's opinions and stances in long conversation threads on various topics can be extremely critical for enhanced personalization, market research, political campaigns, customer service, conflict resolution, targeted advertising, and content moderation. Hence, training language models to automate this task is critical. However, to train such models, gathering manual annotations has multiple challenges: 1) It is time-consuming and costly; 2) Conversation threads could be very long, increasing chances of noisy annotations; and 3) Interpreting instances where a user changes their opinion within a conversation is difficult because often such transitions are subtle and not expressed explicitly. Inspired by the recent success of large language models (LLMs) for complex natural language processing (NLP) tasks, we leverage Mistral Large and GPT-4 to automate the human annotation process on the following two tasks while also providing reasoning: i) User Stance classification, which involves labeling a user's stance of a post in a conversation on a five-point scale; ii) User Dogmatism classification, which deals with labeling a user's overall opinion in the conversation on a four-point scale. The majority voting on zero-shot, one-shot, and few-shot annotations from these two LLMs on 764 multi-user Reddit conversations helps us curate the USDC dataset. USDC is then used to finetune and instruction-tune multiple deployable small language models for the 5-class stance and 4-class dogmatism classification tasks. We make the code and dataset publicly available [https://anonymous.4open.science/r/USDC-0F7F].
comment: 32 pages, 18 figures
☆ Understanding and Mitigating Tokenization Bias in Language Models
State-of-the-art language models are autoregressive and operate on subword units known as tokens. Specifically, one must encode the conditioning string into a list of tokens before passing to the language models for next-token prediction. We show that, for encoding schemes such as maximum prefix matching, tokenization induces a sampling bias that cannot be mitigated with more training or data. To counter this universal problem, we propose a novel algorithm to obtain unbiased estimates from a model that was trained on tokenized data. Our method does not require finetuning the model, and its complexity, defined as the number of model runs, scales linearly with the sequence length. As a consequence, we show that one can simulate token-free behavior from a tokenized language model. We empirically verify the correctness of our method through a Markov-chain setup, where it accurately recovers the transition probabilities, as opposed to the conventional method of directly prompting tokens into the language model.
☆ Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.
☆ General Binding Affinity Guidance for Diffusion Models in Structure-Based Drug Design
Structure-Based Drug Design (SBDD) focuses on generating valid ligands that strongly and specifically bind to a designated protein pocket. Several methods use machine learning for SBDD to generate these ligands in 3D space, conditioned on the structure of a desired protein pocket. Recently, diffusion models have shown success here by modeling the underlying distributions of atomic positions and types. While these methods are effective in considering the structural details of the protein pocket, they often fail to explicitly consider the binding affinity. Binding affinity characterizes how tightly the ligand binds to the protein pocket, and is measured by the change in free energy associated with the binding process. It is one of the most crucial metrics for benchmarking the effectiveness of the interaction between a ligand and protein pocket. To address this, we propose BADGER: Binding Affinity Diffusion Guidance with Enhanced Refinement. BADGER is a general guidance method to steer the diffusion sampling process towards improved protein-ligand binding, allowing us to adjust the distribution of the binding affinity between ligands and proteins. Our method is enabled by using a neural network (NN) to model the energy function, which is commonly approximated by AutoDock Vina (ADV). ADV's energy function is non-differentiable, and estimates the affinity based on the interactions between a ligand and target protein receptor. By using a NN as a differentiable energy function proxy, we utilize the gradient of our learned energy function as a guidance method on top of any trained diffusion model. We show that our method improves the binding affinity of generated ligands to their protein receptors by up to 60\%, significantly surpassing previous machine learning methods. We also show that our guidance method is flexible and can be easily applied to other diffusion-based SBDD frameworks.
☆ PISTOL: Dataset Compilation Pipeline for Structural Unlearning of LLMs
Recently, machine unlearning, which seeks to erase specific data stored in the pre-trained or fine-tuned models, has emerged as a crucial protective measure for LLMs. However, unlearning approaches for LLMs that have been considered thus far have focused on the removal of independent data points and have not taken into account that the stored facts are logically connected to one another and form an implicit knowledge graph. To facilitate the development of structural unlearning methods, which are essential for the practical application of unlearning, we propose PISTOL, a pipeline for compiling multi-scenario datasets for benchmarking structural LLM unlearning. Additionally, leveraging sample datasets synthesized using PISTOL, we conducted benchmarks with four distinct unlearning methods on both Llama2-7B and Mistral-7B models. This analysis helps to illustrate the prevailing challenges in effectively and robustly removing highly inter-connected data, batched data, or data skewed towards a specific domain. It also highlights the choice of pre-trained model can impact unlearning performance. This work not only advances our understandings on the limitation of current LLMs unlearning methods and proposes future research directions, but also provides a replicable framework for ongoing exploration and validation in the field.
☆ Lottery Ticket Adaptation: Mitigating Destructive Interference in LLMs
Existing methods for adapting large language models (LLMs) to new tasks are not suited to multi-task adaptation because they modify all the model weights -- causing destructive interference between tasks. The resulting effects, such as catastrophic forgetting of earlier tasks, make it challenging to obtain good performance on multiple tasks at the same time. To mitigate this, we propose Lottery Ticket Adaptation (LoTA), a sparse adaptation method that identifies and optimizes only a sparse subnetwork of the model. We evaluate LoTA on a wide range of challenging tasks such as instruction following, reasoning, math, and summarization. LoTA obtains better performance than full fine-tuning and low-rank adaptation (LoRA), and maintains good performance even after training on other tasks -- thus, avoiding catastrophic forgetting. By extracting and fine-tuning over \emph{lottery tickets} (or \emph{sparse task vectors}), LoTA also enables model merging over highly dissimilar tasks.
☆ Adam-mini: Use Fewer Learning Rates To Gain More
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the number of learning rates in Adam: Instead of assigning an individual learning rate for each parameter using $1/\sqrt{v}$, Adam-mini uses the average of $v$ within a pre-defined parameter block as the learning rate for that block. Such a design is inspired by two empirical findings. First, the Hessian of Transformers exhibits a near-block diagonal structure with different sizes of dense sub-blocks. Second, for each of these dense sub-blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. Adam-mini provides one cost-effective way to find these good learning rates and manage to cut down $\geq 90% v$ in Adam. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
☆ The Progression of Transformers from Language to Vision to MOT: A Literature Review on Multi-Object Tracking with Transformers
The transformer neural network architecture allows for autoregressive sequence-to-sequence modeling through the use of attention layers. It was originally created with the application of machine translation but has revolutionized natural language processing. Recently, transformers have also been applied across a wide variety of pattern recognition tasks, particularly in computer vision. In this literature review, we describe major advances in computer vision utilizing transformers. We then focus specifically on Multi-Object Tracking (MOT) and discuss how transformers are increasingly becoming competitive in state-of-the-art MOT works, yet still lag behind traditional deep learning methods.
comment: This report was written in November 2022, and may not contain more recent works since then
☆ M2Lingual: Enhancing Multilingual, Multi-Turn Instruction Alignment in Large Language Models
Instruction finetuning (IFT) is critical for aligning Large Language Models (LLMs) to follow instructions. Numerous effective IFT datasets have been proposed in the recent past, but most focus on high resource languages such as English. In this work, we propose a fully synthetic, novel taxonomy (Evol) guided Multilingual, Multi-turn instruction finetuning dataset, called M2Lingual, to better align LLMs on a diverse set of languages and tasks. M2Lingual contains a total of 182K IFT pairs that are built upon diverse seeds, covering 70 languages, 17 NLP tasks and general instruction-response pairs. LLMs finetuned with M2Lingual substantially outperform the majority of existing multilingual IFT datasets. Importantly, LLMs trained with M2Lingual consistently achieve competitive results across a wide variety of evaluation benchmarks compared to existing multilingual IFT datasets. Specifically, LLMs finetuned with M2Lingual achieve strong performance on our translated multilingual, multi-turn evaluation benchmark as well as a wide variety of multilingual tasks. Thus we contribute, and the 2 step Evol taxonomy used for its creation. M2Lingual repository - https://huggingface.co/datasets/ServiceNow-AI/M2Lingual
comment: 39 pages
☆ Blending LLMs into Cascaded Speech Translation: KIT's Offline Speech Translation System for IWSLT 2024
Large Language Models (LLMs) are currently under exploration for various tasks, including Automatic Speech Recognition (ASR), Machine Translation (MT), and even End-to-End Speech Translation (ST). In this paper, we present KIT's offline submission in the constrained + LLM track by incorporating recently proposed techniques that can be added to any cascaded speech translation. Specifically, we integrate Mistral-7B\footnote{mistralai/Mistral-7B-Instruct-v0.1} into our system to enhance it in two ways. Firstly, we refine the ASR outputs by utilizing the N-best lists generated by our system and fine-tuning the LLM to predict the transcript accurately. Secondly, we refine the MT outputs at the document level by fine-tuning the LLM, leveraging both ASR and MT predictions to improve translation quality. We find that integrating the LLM into the ASR and MT systems results in an absolute improvement of $0.3\%$ in Word Error Rate and $0.65\%$ in COMET for tst2019 test set. In challenging test sets with overlapping speakers and background noise, we find that integrating LLM is not beneficial due to poor ASR performance. Here, we use ASR with chunked long-form decoding to improve context usage that may be unavailable when transcribing with Voice Activity Detection segmentation alone.
☆ OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?
In this report, we pose the following question: Who is the most intelligent AI model to date, as measured by the OlympicArena (an Olympic-level, multi-discipline, multi-modal benchmark for superintelligent AI)? We specifically focus on the most recently released models: Claude-3.5-Sonnet, Gemini-1.5-Pro, and GPT-4o. For the first time, we propose using an Olympic medal Table approach to rank AI models based on their comprehensive performance across various disciplines. Empirical results reveal: (1) Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry, and Biology). (2) Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them. (3) The performance of AI models from the open-source community significantly lags behind these proprietary models. (4) The performance of these models on this benchmark has been less than satisfactory, indicating that we still have a long way to go before achieving superintelligence. We remain committed to continuously tracking and evaluating the performance of the latest powerful models on this benchmark (available at https://github.com/GAIR-NLP/OlympicArena).
comment: 10 pages
☆ WARP: On the Benefits of Weight Averaged Rewarded Policies
Reinforcement learning from human feedback (RLHF) aligns large language models (LLMs) by encouraging their generations to have high rewards, using a reward model trained on human preferences. To prevent the forgetting of pre-trained knowledge, RLHF usually incorporates a KL regularization; this forces the policy to remain close to its supervised fine-tuned initialization, though it hinders the reward optimization. To tackle the trade-off between KL and reward, in this paper we introduce a novel alignment strategy named Weight Averaged Rewarded Policies (WARP). WARP merges policies in the weight space at three distinct stages. First, it uses the exponential moving average of the policy as a dynamic anchor in the KL regularization. Second, it applies spherical interpolation to merge independently fine-tuned policies into a new enhanced one. Third, it linearly interpolates between this merged model and the initialization, to recover features from pre-training. This procedure is then applied iteratively, with each iteration's final model used as an advanced initialization for the next, progressively refining the KL-reward Pareto front, achieving superior rewards at fixed KL. Experiments with GEMMA policies validate that WARP improves their quality and alignment, outperforming other open-source LLMs.
comment: 11 main pages (34 pages with Appendix)
☆ Addressing Polarization and Unfairness in Performative Prediction
When machine learning (ML) models are used in applications that involve humans (e.g., online recommendation, school admission, hiring, lending), the model itself may trigger changes in the distribution of targeted data it aims to predict. Performative prediction (PP) is a framework that explicitly considers such model-dependent distribution shifts when learning ML models. While significant efforts have been devoted to finding performative stable (PS) solutions in PP for system robustness, their societal implications are less explored and it is unclear whether PS solutions are aligned with social norms such as fairness. In this paper, we set out to examine the fairness property of PS solutions in performative prediction. We first show that PS solutions can incur severe polarization effects and group-wise loss disparity. Although existing fairness mechanisms commonly used in literature can help mitigate unfairness, they may fail and disrupt the stability under model-dependent distribution shifts. We thus propose novel fairness intervention mechanisms that can simultaneously achieve both stability and fairness in PP settings. Both theoretical analysis and experiments are provided to validate the proposed method.
☆ The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible development practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
☆ Bandits with Preference Feedback: A Stackelberg Game Perspective
Bandits with preference feedback present a powerful tool for optimizing unknown target functions when only pairwise comparisons are allowed instead of direct value queries. This model allows for incorporating human feedback into online inference and optimization and has been employed in systems for fine-tuning large language models. The problem is well understood in simplified settings with linear target functions or over finite small domains that limit practical interest. Taking the next step, we consider infinite domains and nonlinear (kernelized) rewards. In this setting, selecting a pair of actions is quite challenging and requires balancing exploration and exploitation at two levels: within the pair, and along the iterations of the algorithm. We propose MAXMINLCB, which emulates this trade-off as a zero-sum Stackelberg game, and chooses action pairs that are informative and yield favorable rewards. MAXMINLCB consistently outperforms existing algorithms and satisfies an anytime-valid rate-optimal regret guarantee. This is due to our novel preference-based confidence sequences for kernelized logistic estimators.
comment: 30 pages, 8 figures
☆ Extracting thin film structures of energy materials using transformers
Neutron-Transformer Reflectometry and Advanced Computation Engine (N-TRACE ), a neural network model using transformer architecture, is introduced for neutron reflectometry data analysis. It offers fast, accurate initial parameter estimations and efficient refinements, improving efficiency and precision for real-time data analysis of lithium-mediated nitrogen reduction for electrochemical ammonia synthesis, with relevance to other chemical transformations and batteries. Despite limitations in generalizing across systems, it shows promises for the use of transformers as the basis for models that could replace trial-and-error approaches to modeling reflectometry data.
comment: 11 pages, 7 figures
☆ Inducing Group Fairness in LLM-Based Decisions
Prompting Large Language Models (LLMs) has created new and interesting means for classifying textual data. While evaluating and remediating group fairness is a well-studied problem in classifier fairness literature, some classical approaches (e.g., regularization) do not carry over, and some new opportunities arise (e.g., prompt-based remediation). We measure fairness of LLM-based classifiers on a toxicity classification task, and empirically show that prompt-based classifiers may lead to unfair decisions. We introduce several remediation techniques and benchmark their fairness and performance trade-offs. We hope our work encourages more research on group fairness in LLM-based classifiers.
☆ Convolutional neural network for Lyman break galaxies classification and redshift regression in DESI (Dark Energy Spectroscopic Instrument)
DESI is a groundbreaking international project to observe more than 40 million quasars and galaxies over a 5-year period to create a 3D map of the sky. This map will enable us to probe multiple aspects of cosmology, from dark energy to neutrino mass. We are focusing here on one type of object observed by DESI, the Lyman Break Galaxies (LBGs). The aim is to use their spectra to determine whether they are indeed LBGs, and if so, to determine their distance from the Earth using a phenomenon called redshift. This will enable us to place these galaxies on the DESI 3D map. The aim is therefore to develop a convolutional neural network (CNN) inspired by QuasarNET (See arXiv:1808.09955), performing simultaneously a classification (LBG type or not) and a regression task (determine the redshift of the LBGs). Initially, data augmentation techniques such as shifting the spectra in wavelengths, adding noise to the spectra, or adding synthetic spectra were used to increase the model training dataset from 3,019 data to over 66,000. In a second phase, modifications to the QuasarNET architecture, notably through transfer learning and hyperparameter tuning with Bayesian optimization, boosted model performance. Gains of up to 26% were achieved on the Purity/Efficiency curve, which is used to evaluate model performance, particularly in areas with interesting redshifts, at low (around 2) and high (around 4) redshifts. The best model obtained an average score of 94%, compared with 75% for the initial model.
☆ CausalMMM: Learning Causal Structure for Marketing Mix Modeling WSDM 2024
In online advertising, marketing mix modeling (MMM) is employed to predict the gross merchandise volume (GMV) of brand shops and help decision-makers to adjust the budget allocation of various advertising channels. Traditional MMM methods leveraging regression techniques can fail in handling the complexity of marketing. Although some efforts try to encode the causal structures for better prediction, they have the strict restriction that causal structures are prior-known and unchangeable. In this paper, we define a new causal MMM problem that automatically discovers the interpretable causal structures from data and yields better GMV predictions. To achieve causal MMM, two essential challenges should be addressed: (1) Causal Heterogeneity. The causal structures of different kinds of shops vary a lot. (2) Marketing Response Patterns. Various marketing response patterns i.e., carryover effect and shape effect, have been validated in practice. We argue that causal MMM needs dynamically discover specific causal structures for different shops and the predictions should comply with the prior known marketing response patterns. Thus, we propose CausalMMM that integrates Granger causality in a variational inference framework to measure the causal relationships between different channels and predict the GMV with the regularization of both temporal and saturation marketing response patterns. Extensive experiments show that CausalMMM can not only achieve superior performance of causal structure learning on synthetic datasets with improvements of 5.7%\sim 7.1%, but also enhance the GMV prediction results on a representative E-commerce platform.
comment: WSDM 2024, full version
☆ AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Although Large Language Models (LLMs) are becoming increasingly powerful, they still exhibit significant but subtle weaknesses, such as mistakes in instruction-following or coding tasks. As these unexpected errors could lead to severe consequences in practical deployments, it is crucial to investigate the limitations within LLMs systematically. Traditional benchmarking approaches cannot thoroughly pinpoint specific model deficiencies, while manual inspections are costly and not scalable. In this paper, we introduce a unified framework, AutoDetect, to automatically expose weaknesses in LLMs across various tasks. Inspired by the educational assessment process that measures students' learning outcomes, AutoDetect consists of three LLM-powered agents: Examiner, Questioner, and Assessor. The collaboration among these three agents is designed to realize comprehensive and in-depth weakness identification. Our framework demonstrates significant success in uncovering flaws, with an identification success rate exceeding 30% in prominent models such as ChatGPT and Claude. More importantly, these identified weaknesses can guide specific model improvements, proving more effective than untargeted data augmentation methods like Self-Instruct. Our approach has led to substantial enhancements in popular LLMs, including the Llama series and Mistral-7b, boosting their performance by over 10% across several benchmarks. Code and data are publicly available at https://github.com/thu-coai/AutoDetect.
☆ Probabilistic Subgoal Representations for Hierarchical Reinforcement learning
In goal-conditioned hierarchical reinforcement learning (HRL), a high-level policy specifies a subgoal for the low-level policy to reach. Effective HRL hinges on a suitable subgoal represen tation function, abstracting state space into latent subgoal space and inducing varied low-level behaviors. Existing methods adopt a subgoal representation that provides a deterministic mapping from state space to latent subgoal space. Instead, this paper utilizes Gaussian Processes (GPs) for the first probabilistic subgoal representation. Our method employs a GP prior on the latent subgoal space to learn a posterior distribution over the subgoal representation functions while exploiting the long-range correlation in the state space through learnable kernels. This enables an adaptive memory that integrates long-range subgoal information from prior planning steps allowing to cope with stochastic uncertainties. Furthermore, we propose a novel learning objective to facilitate the simultaneous learning of probabilistic subgoal representations and policies within a unified framework. In experiments, our approach outperforms state-of-the-art baselines in standard benchmarks but also in environments with stochastic elements and under diverse reward conditions. Additionally, our model shows promising capabilities in transferring low-level policies across different tasks.
☆ Expected Runtime Comparisons Between Breadth-First Search and Constant-Depth Restarting Random Walks ICAPS 2024
When greedy search algorithms encounter a local minima or plateau, the search typically devolves into a breadth-first search (BrFS), or a local search technique is used in an attempt to find a way out. In this work, we formally analyze the performance of BrFS and constant-depth restarting random walks (RRW) -- two methods often used for finding exits to a plateau/local minima -- to better understand when each is best suited. In particular, we formally derive the expected runtime for BrFS in the case of a uniformly distributed set of goals at a given goal depth. We then prove RRW will be faster than BrFS on trees if there are enough goals at that goal depth. We refer to this threshold as the crossover point. Our bound shows that the crossover point grows linearly with the branching factor of the tree, the goal depth, and the error in the random walk depth, while the size of the tree grows exponentially in branching factor and goal depth. Finally, we discuss the practical implications and applicability of this bound.
comment: ICAPS 2024 Heuristics and Search for Domain-Independent Planning Workshop, 5 pages, 1 figure
☆ Public Constitutional AI
We are increasingly subjected to the power of AI authorities. As AI decisions become inescapable, entering domains such as healthcare, education, and law, we must confront a vital question: how can we ensure AI systems have the legitimacy necessary for effective governance? This essay argues that to secure AI legitimacy, we need methods that engage the public in designing and constraining AI systems, ensuring these technologies reflect the community's shared values. Constitutional AI, proposed by Anthropic, represents a step towards this goal, offering a model for democratic control of AI. However, while Constitutional AI's commitment to hardcoding explicit principles into AI models enhances transparency and accountability, it falls short in two crucial aspects: addressing the opacity of individual AI decisions and fostering genuine democratic legitimacy. To overcome these limitations, this essay proposes "Public Constitutional AI." This approach envisions a participatory process where diverse stakeholders, including ordinary citizens, deliberate on the principles guiding AI development. The resulting "AI Constitution" would carry the legitimacy of popular authorship, grounding AI governance in the public will. Furthermore, the essay proposes "AI Courts" to develop "AI case law," providing concrete examples for operationalizing constitutional principles in AI training. This evolving combination of constitutional principles and case law aims to make AI governance more responsive to public values. By grounding AI governance in deliberative democratic processes, Public Constitutional AI offers a path to imbue automated authorities with genuine democratic legitimacy, addressing the unique challenges posed by increasingly powerful AI systems while ensuring their alignment with the public interest.
☆ Link Prediction with Untrained Message Passing Layers
Message passing neural networks (MPNNs) operate on graphs by exchanging information between neigbouring nodes. MPNNs have been successfully applied to various node-, edge-, and graph-level tasks in areas like molecular science, computer vision, natural language processing, and combinatorial optimization. However, most MPNNs require training on large amounts of labeled data, which can be costly and time-consuming. In this work, we explore the use of various untrained message passing layers in graph neural networks, i.e. variants of popular message passing architecture where we remove all trainable parameters that are used to transform node features in the message passing step. Focusing on link prediction, we find that untrained message passing layers can lead to competitive and even superior performance compared to fully trained MPNNs, especially in the presence of high-dimensional features. We provide a theoretical analysis of untrained message passing by relating the inner products of features implicitly produced by untrained message passing layers to path-based topological node similarity measures. As such, untrained message passing architectures can be viewed as a highly efficient and interpretable approach to link prediction.
☆ Segment Any Text: A Universal Approach for Robust, Efficient and Adaptable Sentence Segmentation
Segmenting text into sentences plays an early and crucial role in many NLP systems. This is commonly achieved by using rule-based or statistical methods relying on lexical features such as punctuation. Although some recent works no longer exclusively rely on punctuation, we find that no prior method achieves all of (i) robustness to missing punctuation, (ii) effective adaptability to new domains, and (iii) high efficiency. We introduce a new model - Segment any Text (SaT) - to solve this problem. To enhance robustness, we propose a new pretraining scheme that ensures less reliance on punctuation. To address adaptability, we introduce an extra stage of parameter-efficient fine-tuning, establishing state-of-the-art performance in distinct domains such as verses from lyrics and legal documents. Along the way, we introduce architectural modifications that result in a threefold gain in speed over the previous state of the art and solve spurious reliance on context far in the future. Finally, we introduce a variant of our model with fine-tuning on a diverse, multilingual mixture of sentence-segmented data, acting as a drop-in replacement and enhancement for existing segmentation tools. Overall, our contributions provide a universal approach for segmenting any text. Our method outperforms all baselines - including strong LLMs - across 8 corpora spanning diverse domains and languages, especially in practically relevant situations where text is poorly formatted. Our models and code, including documentation, are available at https://huggingface.co/segment-any-text under the MIT license.
☆ CAVE: Controllable Authorship Verification Explanations
Authorship Verification (AV) (do two documents have the same author?) is essential for many sensitive real-life applications. AV is often used in proprietary domains that require a private, offline model, making SOTA online models like ChatGPT undesirable. Other SOTA systems use methods, e.g. Siamese Networks, that are uninterpretable, and hence cannot be trusted in high-stakes applications. In this work, we take the first step to address the above challenges with our model CAVE (Controllable Authorship Verification Explanations): CAVE generates free-text AV explanations that are controlled to be 1) structured (can be decomposed into sub-explanations with respect to relevant linguistic features), and 2) easily verified for explanation-label consistency (via intermediate labels in sub-explanations). In this work, we train a Llama-3-8B as CAVE; since there are no human-written corpora for AV explanations, we sample silver-standard explanations from GPT-4-TURBO and distill them into a pretrained Llama-3-8B. Results on three difficult AV datasets IMdB2, Blog-Auth, and FanFiction show that CAVE generates high quality explanations (as measured by automatic and human evaluation) as well as competitive task accuracies.
☆ Vision-Language Consistency Guided Multi-modal Prompt Learning for Blind AI Generated Image Quality Assessment
Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal prompt learning method only tunes the language branch of CLIP models. This is not enough for adapting CLIP models to AI generated image quality assessment (AGIQA) since AGIs visually differ from natural images. In addition, the consistency between AGIs and user input text prompts, which correlates with the perceptual quality of AGIs, is not investigated to guide AGIQA. In this letter, we propose vision-language consistency guided multi-modal prompt learning for blind AGIQA, dubbed CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in language and vision branches of CLIP models, respectively. Moreover, we design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts. Experimental results on two public AGIQA datasets demonstrate that the proposed method outperforms state-of-the-art quality assessment models. The source code is available at https://github.com/JunFu1995/CLIP-AGIQA.
comment: Accepted by IEEE Signal Processing Letter
☆ Feature Fusion for Human Activity Recognition using Parameter-Optimized Multi-Stage Graph Convolutional Network and Transformer Models
Human activity recognition (HAR) is a crucial area of research that involves understanding human movements using computer and machine vision technology. Deep learning has emerged as a powerful tool for this task, with models such as Convolutional Neural Networks (CNNs) and Transformers being employed to capture various aspects of human motion. One of the key contributions of this work is the demonstration of the effectiveness of feature fusion in improving HAR accuracy by capturing spatial and temporal features, which has important implications for the development of more accurate and robust activity recognition systems. The study uses sensory data from HuGaDB, PKU-MMD, LARa, and TUG datasets. Two model, the PO-MS-GCN and a Transformer were trained and evaluated, with PO-MS-GCN outperforming state-of-the-art models. HuGaDB and TUG achieved high accuracies and f1-scores, while LARa and PKU-MMD had lower scores. Feature fusion improved results across datasets.
comment: 7 pages, 1 figure, conference
☆ ShadowLLM: Predictor-based Contextual Sparsity for Large Language Models
The high power consumption and latency-sensitive deployments of large language models (LLMs) have motivated techniques like quantization and sparsity. Contextual sparsity, where the sparsity pattern is input-dependent, is crucial in LLMs because the permanent removal of attention heads or neurons from LLMs can significantly degrade accuracy. Prior work has attempted to model contextual sparsity using neural networks trained to predict activation magnitudes, which can be used to dynamically prune structures with low predicted activation magnitude. In this paper, we look beyond magnitude-based pruning criteria to assess attention head and neuron importance in LLMs. We developed a novel predictor called ShadowLLM, which can shadow the LLM behavior and enforce better sparsity patterns, resulting in over 15% improvement in end-to-end accuracy without increasing latency compared to previous methods. ShadowLLM achieves up to a 20\% speed-up over the state-of-the-art DejaVu framework. These enhancements are validated on models with up to 30 billion parameters. Our code is available at \href{https://github.com/abdelfattah-lab/shadow_llm/}{ShadowLLM}.
☆ Hacking a surrogate model approach to XAI
In recent years, the number of new applications for highly complex AI systems has risen significantly. Algorithmic decision-making systems (ADMs) are one of such applications, where an AI system replaces the decision-making process of a human expert. As one approach to ensure fairness and transparency of such systems, explainable AI (XAI) has become more important. One variant to achieve explainability are surrogate models, i.e., the idea to train a new simpler machine learning model based on the input-output-relationship of a black box model. The simpler machine learning model could, for example, be a decision tree, which is thought to be intuitively understandable by humans. However, there is not much insight into how well the surrogate model approximates the black box. Our main assumption is that a good surrogate model approach should be able to bring such a discriminating behavior to the attention of humans; prior to our research we assumed that a surrogate decision tree would identify such a pattern on one of its first levels. However, in this article we show that even if the discriminated subgroup - while otherwise being the same in all categories - does not get a single positive decision from the black box ADM system, the corresponding question of group membership can be pushed down onto a level as low as wanted by the operator of the system. We then generalize this finding to pinpoint the exact level of the tree on which the discriminating question is asked and show that in a more realistic scenario, where discrimination only occurs to some fraction of the disadvantaged group, it is even more feasible to hide such discrimination. Our approach can be generalized easily to other surrogate models.
comment: 24 pages, 7 figures
☆ Evaluation of Language Models in the Medical Context Under Resource-Constrained Settings
Since the emergence of the Transformer architecture, language model development has increased, driven by their promising potential. However, releasing these models into production requires properly understanding their behavior, particularly in sensitive domains such as medicine. Despite this need, the medical literature still lacks technical assessments of pre-trained language models, which are especially valuable in resource-constrained settings in terms of computational power or limited budget. To address this gap, we provide a comprehensive survey of language models in the medical domain. In addition, we selected a subset of these models for thorough evaluation, focusing on classification and text generation tasks. Our subset encompasses 53 models, ranging from 110 million to 13 billion parameters, spanning the three families of Transformer-based models and from diverse knowledge domains. This study employs a series of approaches for text classification together with zero-shot prompting instead of model training or fine-tuning, which closely resembles the limited resource setting in which many users of language models find themselves. Encouragingly, our findings reveal remarkable performance across various tasks and datasets, underscoring the latent potential of certain models to contain medical knowledge, even without domain specialization. Consequently, our study advocates for further exploration of model applications in medical contexts, particularly in resource-constrained settings. The code is available on https://github.com/anpoc/Language-models-in-medicine.
☆ Evaluating the Robustness of Deep-Learning Algorithm-Selection Models by Evolving Adversarial Instances PPSN 2024
Deep neural networks (DNN) are increasingly being used to perform algorithm-selection in combinatorial optimisation domains, particularly as they accommodate input representations which avoid designing and calculating features. Mounting evidence from domains that use images as input shows that deep convolutional networks are vulnerable to adversarial samples, in which a small perturbation of an instance can cause the DNN to misclassify. However, it remains unknown as to whether deep recurrent networks (DRN) which have recently been shown promise as algorithm-selectors in the bin-packing domain are equally vulnerable. We use an evolutionary algorithm (EA) to find perturbations of instances from two existing benchmarks for online bin packing that cause trained DRNs to misclassify: adversarial samples are successfully generated from up to 56% of the original instances depending on the dataset. Analysis of the new misclassified instances sheds light on the `fragility' of some training instances, i.e. instances where it is trivial to find a small perturbation that results in a misclassification and the factors that influence this. Finally, the method generates a large number of new instances misclassified with a wide variation in confidence, providing a rich new source of training data to create more robust models.
comment: To appear in the proceedings of the 18th International Conference on Parallel Problem Solving from Nature (PPSN 2024)
☆ CLEAR: Can Language Models Really Understand Causal Graphs?
Causal reasoning is a cornerstone of how humans interpret the world. To model and reason about causality, causal graphs offer a concise yet effective solution. Given the impressive advancements in language models, a crucial question arises: can they really understand causal graphs? To this end, we pioneer an investigation into language models' understanding of causal graphs. Specifically, we develop a framework to define causal graph understanding, by assessing language models' behaviors through four practical criteria derived from diverse disciplines (e.g., philosophy and psychology). We then develop CLEAR, a novel benchmark that defines three complexity levels and encompasses 20 causal graph-based tasks across these levels. Finally, based on our framework and benchmark, we conduct extensive experiments on six leading language models and summarize five empirical findings. Our results indicate that while language models demonstrate a preliminary understanding of causal graphs, significant potential for improvement remains. Our project website is at https://github.com/OpenCausaLab/CLEAR.
☆ QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds
While pets offer companionship, their limited intelligence restricts advanced reasoning and autonomous interaction with humans. Considering this, we propose QuadrupedGPT, a versatile agent designed to master a broad range of complex tasks with agility comparable to that of a pet. To achieve this goal, the primary challenges include: i) effectively leveraging multimodal observations for decision-making; ii) mastering agile control of locomotion and path planning; iii) developing advanced cognition to execute long-term objectives. QuadrupedGPT processes human command and environmental contexts using a large multimodal model (LMM). Empowered by its extensive knowledge base, our agent autonomously assigns appropriate parameters for adaptive locomotion policies and guides the agent in planning a safe but efficient path towards the goal, utilizing semantic-aware terrain analysis. Moreover, QuadrupedGPT is equipped with problem-solving capabilities that enable it to decompose long-term goals into a sequence of executable subgoals through high-level reasoning. Extensive experiments across various benchmarks confirm that QuadrupedGPT can adeptly handle multiple tasks with intricate instructions, demonstrating a significant step towards the versatile quadruped agents in open-ended worlds. Our website and codes can be found at https://quadruped-hub.github.io/Quadruped-GPT/.
comment: Under review
☆ Differentiable Distributionally Robust Optimization Layers
In recent years, there has been a growing research interest in decision-focused learning, which embeds optimization problems as a layer in learning pipelines and demonstrates a superior performance than the prediction-focused approach. However, for distributionally robust optimization (DRO), a popular paradigm for decision-making under uncertainty, it is still unknown how to embed it as a layer, i.e., how to differentiate decisions with respect to an ambiguity set. In this paper, we develop such differentiable DRO layers for generic mixed-integer DRO problems with parameterized second-order conic ambiguity sets and discuss its extension to Wasserstein ambiguity sets. To differentiate the mixed-integer decisions, we propose a novel dual-view methodology by handling continuous and discrete parts of decisions via different principles. Specifically, we construct a differentiable energy-based surrogate to implement the dual-view methodology and use importance sampling to estimate its gradient. We further prove that such a surrogate enjoys the asymptotic convergency under regularization. As an application of the proposed differentiable DRO layers, we develop a novel decision-focused learning pipeline for contextual distributionally robust decision-making tasks and compare it with the prediction-focused approach in experiments.
comment: In Forty-first International Conference on Machine Learning (2024)
☆ Homomorphisms and Embeddings of STRIPS Planning Models
Determining whether two STRIPS planning instances are isomorphic is the simplest form of comparison between planning instances. It is also a particular case of the problem concerned with finding an isomorphism between a planning instance $P$ and a sub-instance of another instance $P_0$ . One application of such a mapping is to efficiently produce a compiled form containing all solutions to P from a compiled form containing all solutions to $P_0$. We also introduce the notion of embedding from an instance $P$ to another instance $P_0$, which allows us to deduce that $P_0$ has no solution-plan if $P$ is unsolvable. In this paper, we study the complexity of these problems. We show that the first is GI-complete, and can thus be solved, in theory, in quasi-polynomial time. While we prove the remaining problems to be NP-complete, we propose an algorithm to build an isomorphism, when possible. We report extensive experimental trials on benchmark problems which demonstrate conclusively that applying constraint propagation in preprocessing can greatly improve the efficiency of a SAT solver.
☆ Inference of Sequential Patterns for Neural Message Passing in Temporal Graphs
The modelling of temporal patterns in dynamic graphs is an important current research issue in the development of time-aware GNNs. Whether or not a specific sequence of events in a temporal graph constitutes a temporal pattern not only depends on the frequency of its occurrence. We consider whether it deviates from what is expected in a temporal graph where timestamps are randomly shuffled. While accounting for such a random baseline is important to model temporal patterns, it has mostly been ignored by current temporal graph neural networks. To address this issue we propose HYPA-DBGNN, a novel two-step approach that combines (i) the inference of anomalous sequential patterns in time series data on graphs based on a statistically principled null model, with (ii) a neural message passing approach that utilizes a higher-order De Bruijn graph whose edges capture overrepresented sequential patterns. Our method leverages hypergeometric graph ensembles to identify anomalous edges within both first- and higher-order De Bruijn graphs, which encode the temporal ordering of events. The model introduces an inductive bias that enhances model interpretability. We evaluate our approach for static node classification using benchmark datasets and a synthetic dataset that showcases its ability to incorporate the observed inductive bias regarding over- and under-represented temporal edges. We demonstrate the framework's effectiveness in detecting similar patterns within empirical datasets, resulting in superior performance compared to baseline methods in node classification tasks. To the best of our knowledge, our work is the first to introduce statistically informed GNNs that leverage temporal and causal sequence anomalies. HYPA-DBGNN represents a path for bridging the gap between statistical graph inference and neural graph representation learning, with potential applications to static GNNs.
☆ Character-Adapter: Prompt-Guided Region Control for High-Fidelity Character Customization
Customized image generation, which seeks to synthesize images with consistent characters, holds significant relevance for applications such as storytelling, portrait generation, and character design. However, previous approaches have encountered challenges in preserving characters with high-fidelity consistency due to inadequate feature extraction and concept confusion of reference characters. Therefore, we propose Character-Adapter, a plug-and-play framework designed to generate images that preserve the details of reference characters, ensuring high-fidelity consistency. Character-Adapter employs prompt-guided segmentation to ensure fine-grained regional features of reference characters and dynamic region-level adapters to mitigate concept confusion. Extensive experiments are conducted to validate the effectiveness of Character-Adapter. Both quantitative and qualitative results demonstrate that Character-Adapter achieves the state-of-the-art performance of consistent character generation, with an improvement of 24.8% compared with other methods
☆ Token-based Decision Criteria Are Suboptimal in In-context Learning
In-Context Learning (ICL) typically utilizes classification criteria from probabilities of manually selected label tokens. However, we argue that such token-based classification criteria lead to suboptimal decision boundaries, despite delicate calibrations through translation and constrained rotation. To address this problem, we propose Hidden Calibration, which renounces token probabilities and uses the nearest centroid classifier on the LM's last hidden states. In detail, we use the nearest centroid classification on the hidden states, assigning the category of the nearest centroid previously observed from a few-shot calibration set to the test sample as the predicted label. Our experiments on 3 models and 10 classification datasets indicate that Hidden Calibration consistently outperforms current token-based calibrations by about 20%. Our further analysis demonstrates that Hidden Calibration finds better classification criteria with less inter-categories overlap, and LMs provide linearly separable intra-category clusters with the help of demonstrations, which supports Hidden Calibration and gives new insights into the conventional ICL.
comment: 21 pages, 14 figures, 8 tables
☆ NARRepair: Non-Autoregressive Code Generation Model for Automatic Program Repair
With the advancement of deep learning techniques, the performance of Automatic Program Repair(APR) techniques has reached a new level. Previous deep learning-based APR techniques essentially modified program sentences in the Autoregressive(AR) manner, which predicts future values based on past values. Due to the manner of word-by-word generation, the AR-based APR technique has a huge time delay. This negative consequence overshadows the widespread adoption of APR techniques in real-life software development. To address the issue, we aim to apply the Non-Autoregressive(NAR) method to the APR task, which can output target code in a parallel manner to avoid huge inference delays. To effectively adapt the NAR manner for the APR task, we in this paper propose NARRepair, the first customized NAR code generation model for the APR task. The NARRepair features three major novelties, including 1) using repair actions to alleviate the over-correction issue, 2) extracting dependency information from AST to alleviate the issue of lacking inter-word dependency information, 3) employing two-stage decoding to alleviate the issue of lacking contextual information. We evaluated NARRepair on three widely used datasets in the APR community, and the results show that our technique can significantly improve the inference speed while maintaining high repair accuracy.
☆ Carrot and Stick: Inducing Self-Motivation with Positive & Negative Feedback
Positive thinking is thought to be an important component of self-motivation in various practical fields such as education and the workplace. Previous work, including sentiment transfer and positive reframing, has focused on the positive side of language. However, self-motivation that drives people to reach their goals has not yet been studied from a computational perspective. Moreover, negative feedback has not yet been explored, even though positive and negative feedback are both necessary to grow self-motivation. To facilitate self-motivation, we propose CArrot and STICk (CASTIC) dataset, consisting of 12,590 sentences with 5 different strategies for enhancing self-motivation. Our data and code are publicly available at here.
comment: 10 pages, 8 figures
☆ $\text{Alpha}^2$: Discovering Logical Formulaic Alphas using Deep Reinforcement Learning
Alphas are pivotal in providing signals for quantitative trading. The industry highly values the discovery of formulaic alphas for their interpretability and ease of analysis, compared with the expressive yet overfitting-prone black-box alphas. In this work, we focus on discovering formulaic alphas. Prior studies on automatically generating a collection of formulaic alphas were mostly based on genetic programming (GP), which is known to suffer from the problems of being sensitive to the initial population, converting to local optima, and slow computation speed. Recent efforts employing deep reinforcement learning (DRL) for alpha discovery have not fully addressed key practical considerations such as alpha correlations and validity, which are crucial for their effectiveness. In this work, we propose a novel framework for alpha discovery using DRL by formulating the alpha discovery process as program construction. Our agent, $\text{Alpha}^2$, assembles an alpha program optimized for an evaluation metric. A search algorithm guided by DRL navigates through the search space based on value estimates for potential alpha outcomes. The evaluation metric encourages both the performance and the diversity of alphas for a better final trading strategy. Our formulation of searching alphas also brings the advantage of pre-calculation dimensional analysis, ensuring the logical soundness of alphas, and pruning the vast search space to a large extent. Empirical experiments on real-world stock markets demonstrates $\text{Alpha}^2$'s capability to identify a diverse set of logical and effective alphas, which significantly improves the performance of the final trading strategy. The code of our method is available at https://github.com/x35f/alpha2.
☆ UNICAD: A Unified Approach for Attack Detection, Noise Reduction and Novel Class Identification
As the use of Deep Neural Networks (DNNs) becomes pervasive, their vulnerability to adversarial attacks and limitations in handling unseen classes poses significant challenges. The state-of-the-art offers discrete solutions aimed to tackle individual issues covering specific adversarial attack scenarios, classification or evolving learning. However, real-world systems need to be able to detect and recover from a wide range of adversarial attacks without sacrificing classification accuracy and to flexibly act in {\bf unseen} scenarios. In this paper, UNICAD, is proposed as a novel framework that integrates a variety of techniques to provide an adaptive solution. For the targeted image classification, UNICAD achieves accurate image classification, detects unseen classes, and recovers from adversarial attacks using Prototype and Similarity-based DNNs with denoising autoencoders. Our experiments performed on the CIFAR-10 dataset highlight UNICAD's effectiveness in adversarial mitigation and unseen class classification, outperforming traditional models.
☆ OTCE: Hybrid SSM and Attention with Cross Domain Mixture of Experts to construct Observer-Thinker-Conceiver-Expresser
Recent research has shown that combining Mamba with Transformer architecture, which has selective state space and quadratic self-attention mechanism, outperforms using Mamba or Transformer architecture alone in language modeling tasks. The quadratic self-attention mechanism effectively alleviates the shortcomings of selective state space in handling long-term dependencies of any element in the sequence. We propose a position information injection method that connects the selective state space model with the quadratic attention, and integrates these two architectures with hybrid experts with cross-sharing domains, so that we can enjoy the advantages of both. We design a new architecture with a more biomimetic idea: Observer-Thinker-Conceiver-Expresser (OTCE), which can compete with well-known medium-scale open-source language models on a small scale in language modeling tasks.
☆ Cross-domain Transfer of Valence Preferences via a Meta-optimization Approach
Cross-domain recommendation offers a potential avenue for alleviating data sparsity and cold-start problems. Embedding and mapping, as a classic cross-domain research genre, aims to identify a common mapping function to perform representation transformation between two domains. Nevertheless, previous coarse-grained preference representations, non-personalized mapping functions, and excessive reliance on overlapping users limit their performance, especially in scenarios where overlapping users are sparse. To address aforementioned challenges, we propose a novel cross-domain approach, namely CVPM. CVPM formalizes cross-domain interest transfer as a hybrid architecture of parametric meta-learning and self-supervised learning, which not only transfers user preferences at a finer level, but also enables signal enhancement with the knowledge of non-overlapping users. Specifically, with deep insights into user preferences and valence preference theory, we believe that there exists significant difference between users' positive preferences and negative behaviors, and thus employ differentiated encoders to learn their distributions. In particular, we further utilize the pre-trained model and item popularity to sample pseudo-interaction items to ensure the integrity of both distributions. To guarantee the personalization of preference transfer, we treat each user's mapping as two parts, the common transformation and the personalized bias, where the network used to generate the personalized bias is output by a meta-learner. Furthermore, in addition to the supervised loss for overlapping users, we design contrastive tasks for non-overlapping users from both group and individual-levels to avoid model skew and enhance the semantics of representations. Exhaustive data analysis and extensive experimental results demonstrate the effectiveness and advancement of our proposed framework.
☆ Towards Comprehensive Preference Data Collection for Reward Modeling
Reinforcement Learning from Human Feedback (RLHF) facilitates the alignment of large language models (LLMs) with human preferences, thereby enhancing the quality of responses generated. A critical component of RLHF is the reward model, which is trained on preference data and outputs a scalar reward during the inference stage. However, the collection of preference data still lacks thorough investigation. Recent studies indicate that preference data is collected either by AI or humans, where chosen and rejected instances are identified among pairwise responses. We question whether this process effectively filters out noise and ensures sufficient diversity in collected data. To address these concerns, for the first time, we propose a comprehensive framework for preference data collection, decomposing the process into four incremental steps: Prompt Generation, Response Generation, Response Filtering, and Human Labeling. This structured approach ensures the collection of high-quality preferences while reducing reliance on human labor. We conducted comprehensive experiments based on the data collected at different stages, demonstrating the effectiveness of the proposed data collection method.
☆ Emerging NeoHebbian Dynamics in Forward-Forward Learning: Implications for Neuromorphic Computing
Advances in neural computation have predominantly relied on the gradient backpropagation algorithm (BP). However, the recent shift towards non-stationary data modeling has highlighted the limitations of this heuristic, exposing that its adaptation capabilities are far from those seen in biological brains. Unlike BP, where weight updates are computed through a reverse error propagation path, Hebbian learning dynamics provide synaptic updates using only information within the layer itself. This has spurred interest in biologically plausible learning algorithms, hypothesized to overcome BP's shortcomings. In this context, Hinton recently introduced the Forward-Forward Algorithm (FFA), which employs local learning rules for each layer and has empirically proven its efficacy in multiple data modeling tasks. In this work we argue that when employing a squared Euclidean norm as a goodness function driving the local learning, the resulting FFA is equivalent to a neo-Hebbian Learning Rule. To verify this result, we compare the training behavior of FFA in analog networks with its Hebbian adaptation in spiking neural networks. Our experiments demonstrate that both versions of FFA produce similar accuracy and latent distributions. The findings herein reported provide empirical evidence linking biological learning rules with currently used training algorithms, thus paving the way towards extrapolating the positive outcomes from FFA to Hebbian learning rules. Simultaneously, our results imply that analog networks trained under FFA could be directly applied to neuromorphic computing, leading to reduced energy usage and increased computational speed.
☆ Seeking Certainty In Uncertainty: Dual-Stage Unified Framework Solving Uncertainty in Dynamic Facial Expression Recognition
The contemporary state-of-the-art of Dynamic Facial Expression Recognition (DFER) technology facilitates remarkable progress by deriving emotional mappings of facial expressions from video content, underpinned by training on voluminous datasets. Yet, the DFER datasets encompass a substantial volume of noise data. Noise arises from low-quality captures that defy logical labeling, and instances that suffer from mislabeling due to annotation bias, engendering two principal types of uncertainty: the uncertainty regarding data usability and the uncertainty concerning label reliability. Addressing the two types of uncertainty, we have meticulously crafted a two-stage framework aiming at \textbf{S}eeking \textbf{C}ertain data \textbf{I}n extensive \textbf{U}ncertain data (SCIU). This initiative aims to purge the DFER datasets of these uncertainties, thereby ensuring that only clean, verified data is employed in training processes. To mitigate the issue of low-quality samples, we introduce the Coarse-Grained Pruning (CGP) stage, which assesses sample weights and prunes those deemed unusable due to their low weight. For samples with incorrect annotations, the Fine-Grained Correction (FGC) stage evaluates prediction stability to rectify mislabeled data. Moreover, SCIU is conceived as a universally compatible, plug-and-play framework, tailored to integrate seamlessly with prevailing DFER methodologies. Rigorous experiments across prevalent DFER datasets and against numerous benchmark methods substantiates SCIU's capacity to markedly elevate performance metrics.
☆ InterCLIP-MEP: Interactive CLIP and Memory-Enhanced Predictor for Multi-modal Sarcasm Detection
The prevalence of sarcasm in social media, conveyed through text-image combinations, presents significant challenges for sentiment analysis and intention mining. Current multi-modal sarcasm detection methods have been proven to struggle with biases from spurious cues, leading to a superficial understanding of the complex interactions between text and image. To address these issues, we propose InterCLIP-MEP, a robust framework for multi-modal sarcasm detection. InterCLIP-MEP introduces a refined variant of CLIP, Interactive CLIP (InterCLIP), as the backbone, enhancing sample representations by embedding cross-modality information in each encoder. Furthermore, a novel training strategy is designed to adapt InterCLIP for a Memory-Enhanced Predictor (MEP). MEP uses dynamic dual-channel memory to store valuable historical knowledge of test samples and then leverages this memory as a non-parametric classifier to derive the final prediction. By using InterCLIP to encode text-image interactions more effectively and incorporating MEP, InterCLIP-MEP offers a more robust recognition of multi-modal sarcasm. Experiments demonstrate that InterCLIP-MEP achieves state-of-the-art performance on the MMSD2.0 benchmark. Code and data are available at [https://github.com/CoderChen01/InterCLIP-MEP](https://github.com/CoderChen01/InterCLIP-MEP).
comment: 8 pages, 6 figures, 6 tables
☆ Guardrails for avoiding harmful medical product recommendations and off-label promotion in generative AI models CVPR 2024
Generative AI (GenAI) models have demonstrated remarkable capabilities in a wide variety of medical tasks. However, as these models are trained using generalist datasets with very limited human oversight, they can learn uses of medical products that have not been adequately evaluated for safety and efficacy, nor approved by regulatory agencies. Given the scale at which GenAI may reach users, unvetted recommendations pose a public health risk. In this work, we propose an approach to identify potentially harmful product recommendations, and demonstrate it using a recent multimodal large language model.
comment: CVPR 2024 Responsible Generative AI (ReGenAI) workshop
☆ Learning in Wilson-Cowan model for metapopulation
The Wilson-Cowan model for metapopulation, a Neural Mass Network Model, treats different subcortical regions of the brain as connected nodes, with connections representing various types of structural, functional, or effective neuronal connectivity between these regions. Each region comprises interacting populations of excitatory and inhibitory cells, consistent with the standard Wilson-Cowan model. By incorporating stable attractors into such a metapopulation model's dynamics, we transform it into a learning algorithm capable of achieving high image and text classification accuracy. We test it on MNIST and Fashion MNIST, in combination with convolutional neural networks, on CIFAR-10 and TF-FLOWERS, and, in combination with a transformer architecture (BERT), on IMDB, always showing high classification accuracy. These numerical evaluations illustrate that minimal modifications to the Wilson-Cowan model for metapopulation can reveal unique and previously unobserved dynamics.
☆ Theory on Mixture-of-Experts in Continual Learning
Continual learning (CL) has garnered significant attention because of its ability to adapt to new tasks that arrive over time. Catastrophic forgetting (of old tasks) has been identified as a major issue in CL, as the model adapts to new tasks. The Mixture-of-Experts (MoE) model has recently been shown to effectively mitigate catastrophic forgetting in CL, by employing a gating network to sparsify and distribute diverse tasks among multiple experts. However, there is a lack of theoretical analysis of MoE and its impact on the learning performance in CL. This paper provides the first theoretical results to characterize the impact of MoE in CL via the lens of overparameterized linear regression tasks. We establish the benefit of MoE over a single expert by proving that the MoE model can diversify its experts to specialize in different tasks, while its router learns to select the right expert for each task and balance the loads across all experts. Our study further suggests an intriguing fact that the MoE in CL needs to terminate the update of the gating network after sufficient training rounds to attain system convergence, which is not needed in the existing MoE studies that do not consider the continual task arrival. Furthermore, we provide explicit expressions for the expected forgetting and overall generalization error to characterize the benefit of MoE in the learning performance in CL. Interestingly, adding more experts requires additional rounds before convergence, which may not enhance the learning performance. Finally, we conduct experiments on both synthetic and real datasets to extend these insights from linear models to deep neural networks (DNNs), which also shed light on the practical algorithm design for MoE in CL.
☆ Dynamic Pseudo Label Optimization in Point-Supervised Nuclei Segmentation MICCAI2024
Deep learning has achieved impressive results in nuclei segmentation, but the massive requirement for pixel-wise labels remains a significant challenge. To alleviate the annotation burden, existing methods generate pseudo masks for model training using point labels. However, the generated masks are inevitably different from the ground truth, and these dissimilarities are not handled reasonably during the network training, resulting in the subpar performance of the segmentation model. To tackle this issue, we propose a framework named DoNuSeg, enabling \textbf{D}ynamic pseudo label \textbf{O}ptimization in point-supervised \textbf{Nu}clei \textbf{Seg}mentation. Specifically, DoNuSeg takes advantage of class activation maps (CAMs) to adaptively capture regions with semantics similar to annotated points. To leverage semantic diversity in the hierarchical feature levels, we design a dynamic selection module to choose the optimal one among CAMs from different encoder blocks as pseudo masks. Meanwhile, a CAM-guided contrastive module is proposed to further enhance the accuracy of pseudo masks. In addition to exploiting the semantic information provided by CAMs, we consider location priors inherent to point labels, developing a task-decoupled structure for effectively differentiating nuclei. Extensive experiments demonstrate that DoNuSeg outperforms state-of-the-art point-supervised methods. The code is available at https://github.com/shinning0821/MICCAI24-DoNuSeg.
comment: early accepted by MICCAI2024
☆ Fault Detection for agents on power grid topology optimization: A Comprehensive analysis
The topology optimization of transmission networks using Deep Reinforcement Learning (DRL) has increasingly come into focus. Various researchers have proposed different DRL agents, which are often benchmarked on the Grid2Op environment from the Learning to Run a Power Network (L2RPN) challenges. The environments have many advantages with their realistic chronics and underlying power flow backends. However, the interpretation of agent survival or failure is not always clear, as there are a variety of potential causes. In this work, we focus on the failures of the power grid to identify patterns and detect them a priori. We collect the failed chronics of three different agents on the WCCI 2022 L2RPN environment, totaling about 40k data points. By clustering, we are able to detect five distinct clusters, identifying different failure types. Further, we propose a multi-class prediction approach to detect failures beforehand and evaluate five different models. Here, the Light Gradient-Boosting Machine (LightGBM) shows the best performance, with an accuracy of 86%. It also correctly identifies in 91% of the time failure and survival observations. Finally, we provide a detailed feature importance analysis that identifies critical features and regions in the grid.
comment: 11 Pages plus references and appendix. The appendix consist of additional material of the paper and is not included in the initial submission
☆ Memory-Enhanced Neural Solvers for Efficient Adaptation in Combinatorial Optimization
Combinatorial Optimization is crucial to numerous real-world applications, yet still presents challenges due to its (NP-)hard nature. Amongst existing approaches, heuristics often offer the best trade-off between quality and scalability, making them suitable for industrial use. While Reinforcement Learning (RL) offers a flexible framework for designing heuristics, its adoption over handcrafted heuristics remains incomplete within industrial solvers. Existing learned methods still lack the ability to adapt to specific instances and fully leverage the available computational budget. The current best methods either rely on a collection of pre-trained policies, or on data-inefficient fine-tuning; hence failing to fully utilize newly available information within the constraints of the budget. In response, we present MEMENTO, an RL approach that leverages memory to improve the adaptation of neural solvers at inference time. MEMENTO enables updating the action distribution dynamically based on the outcome of previous decisions. We validate its effectiveness on benchmark problems, in particular Traveling Salesman and Capacitated Vehicle Routing, demonstrating it can successfully be combined with standard methods to boost their performance under a given budget, both in and out-of-distribution, improving their performance on all 12 evaluated tasks.
☆ Exploring Cross-Domain Few-Shot Classification via Frequency-Aware Prompting
Cross-Domain Few-Shot Learning has witnessed great stride with the development of meta-learning. However, most existing methods pay more attention to learning domain-adaptive inductive bias (meta-knowledge) through feature-wise manipulation or task diversity improvement while neglecting the phenomenon that deep networks tend to rely more on high-frequency cues to make the classification decision, which thus degenerates the robustness of learned inductive bias since high-frequency information is vulnerable and easy to be disturbed by noisy information. Hence in this paper, we make one of the first attempts to propose a Frequency-Aware Prompting method with mutual attention for Cross-Domain Few-Shot classification, which can let networks simulate the human visual perception of selecting different frequency cues when facing new recognition tasks. Specifically, a frequency-aware prompting mechanism is first proposed, in which high-frequency components of the decomposed source image are switched either with normal distribution sampling or zeroing to get frequency-aware augment samples. Then, a mutual attention module is designed to learn generalizable inductive bias under CD-FSL settings. More importantly, the proposed method is a plug-and-play module that can be directly applied to most off-the-shelf CD-FLS methods. Experimental results on CD-FSL benchmarks demonstrate the effectiveness of our proposed method as well as robustly improve the performance of existing CD-FLS methods. Resources at https://github.com/tinkez/FAP_CDFSC.
☆ PenSLR: Persian end-to-end Sign Language Recognition Using Ensembling
Sign Language Recognition (SLR) is a fast-growing field that aims to fill the communication gaps between the hearing-impaired and people without hearing loss. Existing solutions for Persian Sign Language (PSL) are limited to word-level interpretations, underscoring the need for more advanced and comprehensive solutions. Moreover, previous work on other languages mainly focuses on manipulating the neural network architectures or hardware configurations instead of benefiting from the aggregated results of multiple models. In this paper, we introduce PenSLR, a glove-based sign language system consisting of an Inertial Measurement Unit (IMU) and five flexible sensors powered by a deep learning framework capable of predicting variable-length sequences. We achieve this in an end-to-end manner by leveraging the Connectionist Temporal Classification (CTC) loss function, eliminating the need for segmentation of input signals. To further enhance its capabilities, we propose a novel ensembling technique by leveraging a multiple sequence alignment algorithm known as Star Alignment. Furthermore, we introduce a new PSL dataset, including 16 PSL signs with more than 3000 time-series samples in total. We utilize this dataset to evaluate the performance of our system based on four word-level and sentence-level metrics. Our evaluations show that PenSLR achieves a remarkable word accuracy of 94.58% and 96.70% in subject-independent and subject-dependent setups, respectively. These achievements are attributable to our ensembling algorithm, which not only boosts the word-level performance by 0.51% and 1.32% in the respective scenarios but also yields significant enhancements of 1.46% and 4.00%, respectively, in sentence-level accuracy.
☆ Automatically Generating UI Code from Screenshot: A Divide-and-Conquer-Based Approach
Websites are critical in today's digital world, with over 1.11 billion currently active and approximately 252,000 new sites launched daily. Converting website layout design into functional UI code is a time-consuming yet indispensable step of website development. Manual methods of converting visual designs into functional code present significant challenges, especially for non-experts. To explore automatic design-to-code solutions, we first conduct a motivating study on GPT-4o and identify three types of issues in generating UI code: element omission, element distortion, and element misarrangement. We further reveal that a focus on smaller visual segments can help multimodal large language models (MLLMs) mitigate these failures in the generation process. In this paper, we propose DCGen, a divide-and-conquer-based approach to automate the translation of webpage design to UI code. DCGen starts by dividing screenshots into manageable segments, generating descriptions for each segment, and then reassembling them into complete UI code for the entire screenshot. We conduct extensive testing with a dataset comprised of real-world websites and various MLLMs and demonstrate that DCGen achieves up to a 14% improvement in visual similarity over competing methods. To the best of our knowledge, DCGen is the first segment-aware prompt-based approach for generating UI code directly from screenshots.
☆ On the Transformations across Reward Model, Parameter Update, and In-Context Prompt
Despite the general capabilities of pre-trained large language models (LLMs), they still need further adaptation to better serve practical applications. In this paper, we demonstrate the interchangeability of three popular and distinct adaptation tools: parameter updating, reward modeling, and in-context prompting. This interchangeability establishes a triangular framework with six transformation directions, each of which facilitates a variety of applications. Our work offers a holistic view that unifies numerous existing studies and suggests potential research directions. We envision our work as a useful roadmap for future research on LLMs.
☆ Towards Lightweight Graph Neural Network Search with Curriculum Graph Sparsification KDD 2024
Graph Neural Architecture Search (GNAS) has achieved superior performance on various graph-structured tasks. However, existing GNAS studies overlook the applications of GNAS in resource-constraint scenarios. This paper proposes to design a joint graph data and architecture mechanism, which identifies important sub-architectures via the valuable graph data. To search for optimal lightweight Graph Neural Networks (GNNs), we propose a Lightweight Graph Neural Architecture Search with Graph SparsIfication and Network Pruning (GASSIP) method. In particular, GASSIP comprises an operation-pruned architecture search module to enable efficient lightweight GNN search. Meanwhile, we design a novel curriculum graph data sparsification module with an architecture-aware edge-removing difficulty measurement to help select optimal sub-architectures. With the aid of two differentiable masks, we iteratively optimize these two modules to efficiently search for the optimal lightweight architecture. Extensive experiments on five benchmarks demonstrate the effectiveness of GASSIP. Particularly, our method achieves on-par or even higher node classification performance with half or fewer model parameters of searched GNNs and a sparser graph.
comment: Accepted by KDD 2024. The two first authors made equal contributions
☆ Directed Domain Fine-Tuning: Tailoring Separate Modalities for Specific Training Tasks
Large language models (LLMs) and large visual language models (LVLMs) have been at the forefront of the artificial intelligence field, particularly for tasks like text generation, video captioning, and question-answering. Typically, it is more applicable to train these models on broader knowledge bases or datasets to increase generalizability, learn relationships between topics, and recognize patterns. Instead, we propose to provide instructional datasets specific to the task of each modality within a distinct domain and then fine-tune the parameters of the model using LORA. With our approach, we can eliminate all noise irrelevant to the given task while also ensuring that the model generates with enhanced precision. For this work, we use Video-LLaVA to generate recipes given cooking videos without transcripts. Video-LLaVA's multimodal architecture allows us to provide cooking images to its image encoder, cooking videos to its video encoder, and general cooking questions to its text encoder. Thus, we aim to remove all noise unrelated to cooking while improving our model's capabilities to generate specific ingredient lists and detailed instructions. As a result, our approach to fine-tuning Video-LLaVA leads to gains over the baseline Video-LLaVA by 2% on the YouCook2 dataset. While this may seem like a marginal increase, our model trains on an image instruction dataset 2.5% the size of Video-LLaVA's and a video instruction dataset 23.76% of Video-LLaVA's.
Prompt-Consistency Image Generation (PCIG): A Unified Framework Integrating LLMs, Knowledge Graphs, and Controllable Diffusion Models
The rapid advancement of Text-to-Image(T2I) generative models has enabled the synthesis of high-quality images guided by textual descriptions. Despite this significant progress, these models are often susceptible in generating contents that contradict the input text, which poses a challenge to their reliability and practical deployment. To address this problem, we introduce a novel diffusion-based framework to significantly enhance the alignment of generated images with their corresponding descriptions, addressing the inconsistency between visual output and textual input. Our framework is built upon a comprehensive analysis of inconsistency phenomena, categorizing them based on their manifestation in the image. Leveraging a state-of-the-art large language module, we first extract objects and construct a knowledge graph to predict the locations of these objects in potentially generated images. We then integrate a state-of-the-art controllable image generation model with a visual text generation module to generate an image that is consistent with the original prompt, guided by the predicted object locations. Through extensive experiments on an advanced multimodal hallucination benchmark, we demonstrate the efficacy of our approach in accurately generating the images without the inconsistency with the original prompt. The code can be accessed via https://github.com/TruthAI-Lab/PCIG.
☆ Pruning via Merging: Compressing LLMs via Manifold Alignment Based Layer Merging
While large language models (LLMs) excel in many domains, their complexity and scale challenge deployment in resource-limited environments. Current compression techniques, such as parameter pruning, often fail to effectively utilize the knowledge from pruned parameters. To address these challenges, we propose Manifold-Based Knowledge Alignment and Layer Merging Compression (MKA), a novel approach that uses manifold learning and the Normalized Pairwise Information Bottleneck (NPIB) measure to merge similar layers, reducing model size while preserving essential performance. We evaluate MKA on multiple benchmark datasets and various LLMs. Our findings show that MKA not only preserves model performance but also achieves substantial compression ratios, outperforming traditional pruning methods. Moreover, when coupled with quantization, MKA delivers even greater compression. Specifically, on the MMLU dataset using the Llama3-8B model, MKA achieves a compression ratio of 43.75% with a minimal performance decrease of only 2.82\%. The proposed MKA method offers a resource-efficient and performance-preserving model compression technique for LLMs.
☆ Multimodal Graph Benchmark
Associating unstructured data with structured information is crucial for real-world tasks that require relevance search. However, existing graph learning benchmarks often overlook the rich semantic information associate with each node. To bridge such gap, we introduce the Multimodal Graph Benchmark (MM-GRAPH), the first comprehensive multi-modal graph benchmark that incorporates both textual and visual information. MM-GRAPH surpasses previous efforts, which have primarily focused on text-attributed graphs with various connectivity patterns. MM-GRAPH consists of five graph learning datasets of various scales that are appropriate for different learning tasks. Their multimodal node features, enabling a more comprehensive evaluation of graph learning algorithms in real-world scenarios. To facilitate research on multimodal graph learning, we further provide an extensive study on the performance of various graph neural networks in the presence of features from various modalities. MM-GRAPH aims to foster research on multimodal graph learning and drive the development of more advanced and robust graph learning algorithms. By providing a diverse set of datasets and benchmarks, MM-GRAPH enables researchers to evaluate and compare their models in realistic settings, ultimately leading to improved performance on real-world applications that rely on multimodal graph data.
comment: https://mm-graph-benchmark.github.io/
☆ Does Cross-Cultural Alignment Change the Commonsense Morality of Language Models? ACL 2024
Alignment of the language model with human preferences is a common approach to making a language model useful to end users. However, most alignment work is done in English, and human preference datasets are dominated by English, reflecting only the preferences of English-speaking annotators. Nevertheless, it is common practice to use the English preference data, either directly or by translating it into the target language, when aligning a multilingual language model. The question is whether such an alignment strategy marginalizes the preference of non-English speaking users. To this end, we investigate the effect of aligning Japanese language models with (mostly) English resources. In particular, we focus on evaluating whether the commonsense morality of the resulting fine-tuned models is aligned with Japanese culture using the JCommonsenseMorality (JCM) and ETHICS datasets. The experimental results show that the fine-tuned model outperforms the SFT model. However, it does not demonstrate the same level of improvement as a model fine-tuned using the JCM, suggesting that while some aspects of commonsense morality are transferable, others may not be.
comment: The 2nd Workshop on Cross-Cultural Considerations in NLP (C3NLP) at ACL 2024
☆ Anomaly Detection of Tabular Data Using LLMs
Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.
comment: accepted at the Anomaly Detection with Foundation Models workshop
☆ UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos ICMR'24
With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.
comment: Accepted by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ Compensate Quantization Errors: Make Weights Hierarchical to Compensate Each Other
Emergent Large Language Models (LLMs) use their extraordinary performance and powerful deduction capacity to discern from traditional language models. However, the expenses of computational resources and storage for these LLMs are stunning, quantization then arises as a trending conversation. To address accuracy decay caused by quantization, two streams of works in post-training quantization methods stand out. One uses other weights to compensate existing quantization error, while the other transfers the quantization difficulty to other parts in the model. Combining both merits, we introduce Learnable Singular value Increment (LSI) as an advanced solution. LSI uses Singular Value Decomposition to extract singular values of the weights and make them learnable to help weights compensate each other conditioned on activation. Incorporating LSI with existing techniques, we achieve state-of-the-art performance in diverse quantization settings, no matter in weight-only, weight-activation or extremely low bit scenarios. By unleashing the potential of LSI, efficient finetuning on quantized model is no longer a prohibitive problem.
comment: Efficient quantization method
☆ Relaxing Continuous Constraints of Equivariant Graph Neural Networks for Physical Dynamics Learning
Incorporating Euclidean symmetries (e.g. rotation equivariance) as inductive biases into graph neural networks has improved their generalization ability and data efficiency in unbounded physical dynamics modeling. However, in various scientific and engineering applications, the symmetries of dynamics are frequently discrete due to the boundary conditions. Thus, existing GNNs either overlook necessary symmetry, resulting in suboptimal representation ability, or impose excessive equivariance, which fails to generalize to unobserved symmetric dynamics. In this work, we propose a general Discrete Equivariant Graph Neural Network (DEGNN) that guarantees equivariance to a given discrete point group. Specifically, we show that such discrete equivariant message passing could be constructed by transforming geometric features into permutation-invariant embeddings. Through relaxing continuous equivariant constraints, DEGNN can employ more geometric feature combinations to approximate unobserved physical object interaction functions. Two implementation approaches of DEGNN are proposed based on ranking or pooling permutation-invariant functions. We apply DEGNN to various physical dynamics, ranging from particle, molecular, crowd to vehicle dynamics. In twenty scenarios, DEGNN significantly outperforms existing state-of-the-art approaches. Moreover, we show that DEGNN is data efficient, learning with less data, and can generalize across scenarios such as unobserved orientation.
☆ LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments
Recent advances in Large Language Models (LLMs) have shown inspiring achievements in constructing autonomous agents that rely on language descriptions as inputs. However, it remains unclear how well LLMs can function as few-shot or zero-shot embodied agents in dynamic interactive environments. To address this gap, we introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE (i) offers adaptability to diverse environments without multiple simulation engines, (ii) evaluates agents' capacity to develop ``internalized world knowledge'' with embodied observations, and (iii) allows easy customization of communication and action strategies. To address the embodiment challenge, we devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information. Comprehensive benchmark results illustrate challenges and insights of embodied planning. LangSuitE represents a significant step toward building embodied generalists in the context of language models.
☆ Combining Supervised Learning and Reinforcement Learning for Multi-Label Classification Tasks with Partial Labels
Traditional supervised learning heavily relies on human-annotated datasets, especially in data-hungry neural approaches. However, various tasks, especially multi-label tasks like document-level relation extraction, pose challenges in fully manual annotation due to the specific domain knowledge and large class sets. Therefore, we address the multi-label positive-unlabelled learning (MLPUL) problem, where only a subset of positive classes is annotated. We propose Mixture Learner for Partially Annotated Classification (MLPAC), an RL-based framework combining the exploration ability of reinforcement learning and the exploitation ability of supervised learning. Experimental results across various tasks, including document-level relation extraction, multi-label image classification, and binary PU learning, demonstrate the generalization and effectiveness of our framework.
☆ Reducing Fine-Tuning Memory Overhead by Approximate and Memory-Sharing Backpropagation ICML 2024
Fine-tuning pretrained large models to downstream tasks is an important problem, which however suffers from huge memory overhead due to large-scale parameters. This work strives to reduce memory overhead in fine-tuning from perspectives of activation function and layer normalization. To this end, we propose the Approximate Backpropagation (Approx-BP) theory, which provides the theoretical feasibility of decoupling the forward and backward passes. We apply our Approx-BP theory to backpropagation training and derive memory-efficient alternatives of GELU and SiLU activation functions, which use derivative functions of ReLUs in the backward pass while keeping their forward pass unchanged. In addition, we introduce a Memory-Sharing Backpropagation strategy, which enables the activation memory to be shared by two adjacent layers, thereby removing activation memory usage redundancy. Our method neither induces extra computation nor reduces training efficiency. We conduct extensive experiments with pretrained vision and language models, and the results demonstrate that our proposal can reduce up to $\sim$$30\%$ of the peak memory usage. Our code is released at https://github.com/yyyyychen/LowMemoryBP.
comment: 25 pages, ICML 2024 Accepted
☆ Repairing Catastrophic-Neglect in Text-to-Image Diffusion Models via Attention-Guided Feature Enhancement
Text-to-Image Diffusion Models (T2I DMs) have garnered significant attention for their ability to generate high-quality images from textual descriptions. However, these models often produce images that do not fully align with the input prompts, resulting in semantic inconsistencies. The most prominent issue among these semantic inconsistencies is catastrophic-neglect, where the images generated by T2I DMs miss key objects mentioned in the prompt. We first conduct an empirical study on this issue, exploring the prevalence of catastrophic-neglect, potential mitigation strategies with feature enhancement, and the insights gained. Guided by the empirical findings, we propose an automated repair approach named Patcher to address catastrophic-neglect in T2I DMs. Specifically, Patcher first determines whether there are any neglected objects in the prompt, and then applies attention-guided feature enhancement to these neglected objects, resulting in a repaired prompt. Experimental results on three versions of Stable Diffusion demonstrate that Patcher effectively repairs the issue of catastrophic-neglect, achieving 10.1%-16.3% higher Correct Rate in image generation compared to baselines.
comment: 11 pages, 3 figures
☆ One Thousand and One Pairs: A "novel" challenge for long-context language models
Synthetic long-context LLM benchmarks (e.g., "needle-in-the-haystack") test only surface-level retrieval capabilities, but how well can long-context LLMs retrieve, synthesize, and reason over information across book-length inputs? We address this question by creating NoCha, a dataset of 1,001 minimally different pairs of true and false claims about 67 recently-published English fictional books, written by human readers of those books. In contrast to existing long-context benchmarks, our annotators confirm that the largest share of pairs in NoCha require global reasoning over the entire book to verify. Our experiments show that while human readers easily perform this task, it is enormously challenging for all ten long-context LLMs that we evaluate: no open-weight model performs above random chance (despite their strong performance on synthetic benchmarks), while GPT-4o achieves the highest accuracy at 55.8%. Further analysis reveals that (1) on average, models perform much better on pairs that require only sentence-level retrieval vs. global reasoning; (2) model-generated explanations for their decisions are often inaccurate even for correctly-labeled claims; and (3) models perform substantially worse on speculative fiction books that contain extensive world-building. The methodology proposed in NoCha allows for the evolution of the benchmark dataset and the easy analysis of future models.
comment: preprint, 29 pages
☆ Video-Infinity: Distributed Long Video Generation
Diffusion models have recently achieved remarkable results for video generation. Despite the encouraging performances, the generated videos are typically constrained to a small number of frames, resulting in clips lasting merely a few seconds. The primary challenges in producing longer videos include the substantial memory requirements and the extended processing time required on a single GPU. A straightforward solution would be to split the workload across multiple GPUs, which, however, leads to two issues: (1) ensuring all GPUs communicate effectively to share timing and context information, and (2) modifying existing video diffusion models, which are usually trained on short sequences, to create longer videos without additional training. To tackle these, in this paper we introduce Video-Infinity, a distributed inference pipeline that enables parallel processing across multiple GPUs for long-form video generation. Specifically, we propose two coherent mechanisms: Clip parallelism and Dual-scope attention. Clip parallelism optimizes the gathering and sharing of context information across GPUs which minimizes communication overhead, while Dual-scope attention modulates the temporal self-attention to balance local and global contexts efficiently across the devices. Together, the two mechanisms join forces to distribute the workload and enable the fast generation of long videos. Under an 8 x Nvidia 6000 Ada GPU (48G) setup, our method generates videos up to 2,300 frames in approximately 5 minutes, enabling long video generation at a speed 100 times faster than the prior methods.
☆ User Story Tutor (UST) to Support Agile Software Developers
User Stories record what must be built in projects that use agile practices. User Stories serve both to estimate effort, generally measured in Story Points, and to plan what should be done in a Sprint. Therefore, it is essential to train software engineers on how to create simple, easily readable, and comprehensive User Stories. For that reason, we designed, implemented, applied, and evaluated a web application called User Story Tutor (UST). UST checks the description of a given User Story for readability, and if needed, recommends appropriate practices for improvement. UST also estimates a User Story effort in Story Points using Machine Learning techniques. As such UST may support the continuing education of agile development teams when writing and reviewing User Stories. UST's ease of use was evaluated by 40 agile practitioners according to the Technology Acceptance Model (TAM) and AttrakDiff. The TAM evaluation averages were good in almost all considered variables. Application of the AttrakDiff evaluation framework produced similar good results. Apparently, UST can be used with good reliability. Applying UST to assist in the construction of User Stories is a viable technique that, at the very least, can be used by agile developments to complement and enhance current User Story creation.
☆ MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.
☆ Uncertainty-Aware Reward-Free Exploration with General Function Approximation ICML 2024
Mastering multiple tasks through exploration and learning in an environment poses a significant challenge in reinforcement learning (RL). Unsupervised RL has been introduced to address this challenge by training policies with intrinsic rewards rather than extrinsic rewards. However, current intrinsic reward designs and unsupervised RL algorithms often overlook the heterogeneous nature of collected samples, thereby diminishing their sample efficiency. To overcome this limitation, in this paper, we propose a reward-free RL algorithm called \alg. The key idea behind our algorithm is an uncertainty-aware intrinsic reward for exploring the environment and an uncertainty-weighted learning process to handle heterogeneous uncertainty in different samples. Theoretically, we show that in order to find an $\epsilon$-optimal policy, GFA-RFE needs to collect $\tilde{O} (H^2 \log N_{\mathcal F} (\epsilon) \mathrm{dim} (\mathcal F) / \epsilon^2 )$ number of episodes, where $\mathcal F$ is the value function class with covering number $N_{\mathcal F} (\epsilon)$ and generalized eluder dimension $\mathrm{dim} (\mathcal F)$. Such a result outperforms all existing reward-free RL algorithms. We further implement and evaluate GFA-RFE across various domains and tasks in the DeepMind Control Suite. Experiment results show that GFA-RFE outperforms or is comparable to the performance of state-of-the-art unsupervised RL algorithms.
comment: 32 pages, 5 figures, 4 tables, accepted by ICML 2024
☆ Confidence Regulation Neurons in Language Models
Despite their widespread use, the mechanisms by which large language models (LLMs) represent and regulate uncertainty in next-token predictions remain largely unexplored. This study investigates two critical components believed to influence this uncertainty: the recently discovered entropy neurons and a new set of components that we term token frequency neurons. Entropy neurons are characterized by an unusually high weight norm and influence the final layer normalization (LayerNorm) scale to effectively scale down the logits. Our work shows that entropy neurons operate by writing onto an unembedding null space, allowing them to impact the residual stream norm with minimal direct effect on the logits themselves. We observe the presence of entropy neurons across a range of models, up to 7 billion parameters. On the other hand, token frequency neurons, which we discover and describe here for the first time, boost or suppress each token's logit proportionally to its log frequency, thereby shifting the output distribution towards or away from the unigram distribution. Finally, we present a detailed case study where entropy neurons actively manage confidence in the setting of induction, i.e. detecting and continuing repeated subsequences.
comment: 25 pages, 14 figures
☆ Graph-Augmented LLMs for Personalized Health Insights: A Case Study in Sleep Analysis
Health monitoring systems have revolutionized modern healthcare by enabling the continuous capture of physiological and behavioral data, essential for preventive measures and early health intervention. While integrating this data with Large Language Models (LLMs) has shown promise in delivering interactive health advice, traditional methods like Retrieval-Augmented Generation (RAG) and fine-tuning often fail to fully utilize the complex, multi-dimensional, and temporally relevant data from wearable devices. These conventional approaches typically provide limited actionable and personalized health insights due to their inadequate capacity to dynamically integrate and interpret diverse health data streams. In response, this paper introduces a graph-augmented LLM framework designed to significantly enhance the personalization and clarity of health insights. Utilizing a hierarchical graph structure, the framework captures inter and intra-patient relationships, enriching LLM prompts with dynamic feature importance scores derived from a Random Forest Model. The effectiveness of this approach is demonstrated through a sleep analysis case study involving 20 college students during the COVID-19 lockdown, highlighting the potential of our model to generate actionable and personalized health insights efficiently. We leverage another LLM to evaluate the insights for relevance, comprehensiveness, actionability, and personalization, addressing the critical need for models that process and interpret complex health data effectively. Our findings show that augmenting prompts with our framework yields significant improvements in all 4 criteria. Through our framework, we can elicit well-crafted, more thoughtful responses tailored to a specific patient.
☆ Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at https://github.com/Mihir3009/Multi-LogiEval.
comment: 23 Pages
☆ Reinforcement Learning via Auxiliary Task Distillation
We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves $2.3 \times$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.
☆ Paraphrase and Aggregate with Large Language Models for Minimizing Intent Classification Errors SIGIR 2024
Large language models (LLM) have achieved remarkable success in natural language generation but lesser focus has been given to their applicability in decision making tasks such as classification. We show that LLMs like LLaMa can achieve high performance on large multi-class classification tasks but still make classification errors and worse, generate out-of-vocabulary class labels. To address these critical issues, we introduce Paraphrase and AGgregate (PAG)-LLM approach wherein an LLM generates multiple paraphrases of the input query (parallel queries), performs multi-class classification for the original query and each paraphrase, and at the end aggregate all the classification labels based on their confidence scores. We evaluate PAG-LLM on two large multi-class classication datasets: CLINC, and Banking and show 22.7% and 15.1% error reduction. We show that PAG-LLM is especially effective for hard examples where LLM is uncertain, and reduces the critical misclassification and hallucinated label generation errors
comment: Accepted at SIGIR 2024
☆ Virtual Mines -- Component-level recycling of printed circuit boards using deep learning
This contribution gives an overview of an ongoing project using machine learning and computer vision components for improving the electronic waste recycling process. In circular economy, the "virtual mines" concept refers to production cycles where interesting raw materials are reclaimed in an efficient and cost-effective manner from end-of-life items. In particular, the growth of e-waste, due to the increasingly shorter life cycle of hi-tech goods, is a global problem. In this paper, we describe a pipeline based on deep learning model to recycle printed circuit boards at the component level. A pre-trained YOLOv5 model is used to analyze the results of the locally developed dataset. With a different distribution of class instances, YOLOv5 managed to achieve satisfactory precision and recall, with the ability to optimize with large component instances.
comment: 10 pages, 5 figures
☆ Peirce in the Machine: How Mixture of Experts Models Perform Hypothesis Construction
Mixture of experts is a prediction aggregation method in machine learning that aggregates the predictions of specialized experts. This method often outperforms Bayesian methods despite the Bayesian having stronger inductive guarantees. We argue that this is due to the greater functional capacity of mixture of experts. We prove that in a limiting case of mixture of experts will have greater capacity than equivalent Bayesian methods, which we vouchsafe through experiments on non-limiting cases. Finally, we conclude that mixture of experts is a type of abductive reasoning in the Peircian sense of hypothesis construction.
comment: 31 pages
☆ Quantifying Heterogeneous Ecosystem Services With Multi-Label Soft Classification
Understanding and quantifying ecosystem services are crucial for sustainable environmental management, conservation efforts, and policy-making. The advancement of remote sensing technology and machine learning techniques has greatly facilitated this process. Yet, ground truth labels, such as biodiversity, are very difficult and expensive to measure. In addition, more easily obtainable proxy labels, such as land use, often fail to capture the complex heterogeneity of the ecosystem. In this paper, we demonstrate how land use proxy labels can be implemented with a soft, multi-label classifier to predict ecosystem services with complex heterogeneity.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ GraphPipe: Improving Performance and Scalability of DNN Training with Graph Pipeline Parallelism
Deep neural networks (DNNs) continue to grow rapidly in size, making them infeasible to train on a single device. Pipeline parallelism is commonly used in existing DNN systems to support large-scale DNN training by partitioning a DNN into multiple stages, which concurrently perform DNN training for different micro-batches in a pipeline fashion. However, existing pipeline-parallel approaches only consider sequential pipeline stages and thus ignore the topology of a DNN, resulting in missed model-parallel opportunities. This paper presents graph pipeline parallelism (GPP), a new pipeline-parallel scheme that partitions a DNN into pipeline stages whose dependencies are identified by a directed acyclic graph. GPP generalizes existing sequential pipeline parallelism and preserves the inherent topology of a DNN to enable concurrent execution of computationally-independent operators, resulting in reduced memory requirement and improved GPU performance. In addition, we develop GraphPipe, a distributed system that exploits GPP strategies to enable performant and scalable DNN training. GraphPipe partitions a DNN into a graph of stages, optimizes micro-batch schedules for these stages, and parallelizes DNN training using the discovered GPP strategies. Evaluation on a variety of DNNs shows that GraphPipe outperforms existing pipeline-parallel systems such as PipeDream and Piper by up to 1.6X. GraphPipe also reduces the search time by 9-21X compared to PipeDream and Piper.
☆ Evaluating the Quality of Hallucination Benchmarks for Large Vision-Language Models
Despite the rapid progress and outstanding performance of Large Vision-Language Models (LVLMs) in recent years, LVLMs have been plagued by the issue of hallucination, i.e., LVLMs tend to generate responses that are inconsistent with the corresponding visual inputs. To evaluate the degree of hallucination in LVLMs, previous works have proposed a series of benchmarks featuring different types of tasks and evaluation metrics. However, we find that the quality of the existing hallucination benchmarks varies, with some suffering from problems, e.g., inconsistent evaluation results under repeated tests, and misalignment with human evaluation. To this end, we propose a Hallucination benchmark Quality Measurement framework (HQM), which leverages various indicators to assess the reliability and validity of existing hallucination benchmarks separately. Specifically, for reliability we explore test-retest reliability and parallel-forms reliability, while for validity we examine criterion validity and coverage of hallucination types. Furthermore, based on the results of our quality measurement, we construct a High-Quality Hallucination Benchmark (HQH) for LVLMs. We conduct an extensive evaluation of over 10 representative LVLMs, including GPT-4o and Gemini-Vision-Pro, to provide an in-depth analysis of the hallucination issues in existing models. Our benchmark is publicly available at https://github.com/HQHBench/HQHBench.
☆ Learning Temporal Distances: Contrastive Successor Features Can Provide a Metric Structure for Decision-Making ICML 2024
Temporal distances lie at the heart of many algorithms for planning, control, and reinforcement learning that involve reaching goals, allowing one to estimate the transit time between two states. However, prior attempts to define such temporal distances in stochastic settings have been stymied by an important limitation: these prior approaches do not satisfy the triangle inequality. This is not merely a definitional concern, but translates to an inability to generalize and find shortest paths. In this paper, we build on prior work in contrastive learning and quasimetrics to show how successor features learned by contrastive learning (after a change of variables) form a temporal distance that does satisfy the triangle inequality, even in stochastic settings. Importantly, this temporal distance is computationally efficient to estimate, even in high-dimensional and stochastic settings. Experiments in controlled settings and benchmark suites demonstrate that an RL algorithm based on these new temporal distances exhibits combinatorial generalization (i.e., "stitching") and can sometimes learn more quickly than prior methods, including those based on quasimetrics.
comment: Proceedings of the 41st International Conference on Machine Learning (ICML 2024)
☆ Model-Free Robust Reinforcement Learning with Sample Complexity Analysis UAI 2024
Distributionally Robust Reinforcement Learning (DR-RL) aims to derive a policy optimizing the worst-case performance within a predefined uncertainty set. Despite extensive research, previous DR-RL algorithms have predominantly favored model-based approaches, with limited availability of model-free methods offering convergence guarantees or sample complexities. This paper proposes a model-free DR-RL algorithm leveraging the Multi-level Monte Carlo (MLMC) technique to close such a gap. Our innovative approach integrates a threshold mechanism that ensures finite sample requirements for algorithmic implementation, a significant improvement than previous model-free algorithms. We develop algorithms for uncertainty sets defined by total variation, Chi-square divergence, and KL divergence, and provide finite sample analyses under all three cases. Remarkably, our algorithms represent the first model-free DR-RL approach featuring finite sample complexity for total variation and Chi-square divergence uncertainty sets, while also offering an improved sample complexity and broader applicability compared to existing model-free DR-RL algorithms for the KL divergence model. The complexities of our method establish the tightest results for all three uncertainty models in model-free DR-RL, underscoring the effectiveness and efficiency of our algorithm, and highlighting its potential for practical applications.
comment: UAI 2024
☆ BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions. The high dimensionality of potential triggers in the token space and the diverse range of malicious behaviors make this a critical challenge. We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space. Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations. Experiments show BEEAR reduces the success rate of RLHF time backdoor attacks from >95% to <1% and from 47% to 0% for instruction-tuning time backdoors targeting malicious code generation, without compromising model utility. Requiring only defender-defined safe and unwanted behaviors, BEEAR represents a step towards practical defenses against safety backdoors in LLMs, providing a foundation for further advancements in AI safety and security.
☆ Exploring Biomarker Relationships in Both Type 1 and Type 2 Diabetes Mellitus Through a Bayesian Network Analysis Approach
Understanding the complex relationships of biomarkers in diabetes is pivotal for advancing treatment strategies, a pressing need in diabetes research. This study applies Bayesian network structure learning to analyze the Shanghai Type 1 and Type 2 diabetes mellitus datasets, revealing complex relationships among key diabetes-related biomarkers. The constructed Bayesian network presented notable predictive accuracy, particularly for Type 2 diabetes mellitus, with root mean squared error (RMSE) of 18.23 mg/dL, as validated through leave-one-domain experiments and Clarke error grid analysis. This study not only elucidates the intricate dynamics of diabetes through a deeper understanding of biomarker interplay but also underscores the significant potential of integrating data-driven and knowledge-driven methodologies in the realm of personalized diabetes management. Such an approach paves the way for more custom and effective treatment strategies, marking a notable advancement in the field.
comment: Paper is accepted by EMBC 2024
☆ Meta-GCN: A Dynamically Weighted Loss Minimization Method for Dealing with the Data Imbalance in Graph Neural Networks
Although many real-world applications, such as disease prediction, and fault detection suffer from class imbalance, most existing graph-based classification methods ignore the skewness of the distribution of classes; therefore, tend to be biased towards the majority class(es). Conventional methods typically tackle this problem through the assignment of weights to each one of the class samples based on a function of their loss, which can lead to over-fitting on outliers. In this paper, we propose a meta-learning algorithm, named Meta-GCN, for adaptively learning the example weights by simultaneously minimizing the unbiased meta-data set loss and optimizing the model weights through the use of a small unbiased meta-data set. Through experiments, we have shown that Meta-GCN outperforms state-of-the-art frameworks and other baselines in terms of accuracy, the area under the receiver operating characteristic (AUC-ROC) curve, and macro F1-Score for classification tasks on two different datasets.
☆ Tolerance of Reinforcement Learning Controllers against Deviations in Cyber Physical Systems
Cyber-physical systems (CPS) with reinforcement learning (RL)-based controllers are increasingly being deployed in complex physical environments such as autonomous vehicles, the Internet-of-Things(IoT), and smart cities. An important property of a CPS is tolerance; i.e., its ability to function safely under possible disturbances and uncertainties in the actual operation. In this paper, we introduce a new, expressive notion of tolerance that describes how well a controller is capable of satisfying a desired system requirement, specified using Signal Temporal Logic (STL), under possible deviations in the system. Based on this definition, we propose a novel analysis problem, called the tolerance falsification problem, which involves finding small deviations that result in a violation of the given requirement. We present a novel, two-layer simulation-based analysis framework and a novel search heuristic for finding small tolerance violations. To evaluate our approach, we construct a set of benchmark problems where system parameters can be configured to represent different types of uncertainties and disturbancesin the system. Our evaluation shows that our falsification approach and heuristic can effectively find small tolerance violations.
comment: arXiv admin note: text overlap with arXiv:2311.07462
♻ ☆ Low-Resource Multi-Granularity Academic Function Recognition Based on Multiple Prompt Knowledge
Fine-tuning pre-trained language models (PLMs), e.g., SciBERT, generally requires large numbers of annotated data to achieve state-of-the-art performance on a range of NLP tasks in the scientific domain. However, obtaining the fine-tune data for scientific NLP task is still challenging and expensive. Inspired by recent advancement in prompt learning, in this paper, we propose the Mix Prompt Tuning (MPT), which is a semi-supervised method to alleviate the dependence on annotated data and improve the performance of multi-granularity academic function recognition tasks with a small number of labeled examples. Specifically, the proposed method provides multi-perspective representations by combining manual prompt templates with automatically learned continuous prompt templates to help the given academic function recognition task take full advantage of knowledge in PLMs. Based on these prompt templates and the fine-tuned PLM, a large number of pseudo labels are assigned to the unlabeled examples. Finally, we fine-tune the PLM using the pseudo training set. We evaluate our method on three academic function recognition tasks of different granularity including the citation function, the abstract sentence function, and the keyword function, with datasets from computer science domain and biomedical domain. Extensive experiments demonstrate the effectiveness of our method and statistically significant improvements against strong baselines. In particular, it achieves an average increase of 5% in Macro-F1 score compared with fine-tuning, and 6% in Macro-F1 score compared with other semi-supervised method under low-resource settings. In addition, MPT is a general method that can be easily applied to other low-resource scientific classification tasks.
comment: This article has been accepted by The Electronic Library and the full article is now available on Emerald Insight
♻ ☆ Why Transformers Need Adam: A Hessian Perspective
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blocks vary dramatically, a phenomenon we call "block heterogeneity"; (ii) Heterogeneity hampers SGD: SGD performs worse than Adam on problems with block heterogeneity. To validate (i) and (ii), we check various Transformers, CNNs, MLPs, and quadratic problems, and find that SGD can perform on par with Adam on problems without block heterogeneity, but performs worse than Adam when the heterogeneity exists. Our initial theoretical analysis indicates that SGD performs worse because it applies one single learning rate to all blocks, which cannot handle the heterogeneity among blocks. This limitation could be ameliorated if we use coordinate-wise learning rates, as designed in Adam.
♻ ☆ VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train VideoScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between VideoScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that VideoScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe VideoScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
♻ ☆ Pandora's White-Box: Precise Training Data Detection and Extraction in Large Language Models
In this paper we develop state-of-the-art privacy attacks against Large Language Models (LLMs), where an adversary with some access to the model tries to learn something about the underlying training data. Our headline results are new membership inference attacks (MIAs) against pretrained LLMs that perform hundreds of times better than baseline attacks, and a pipeline showing that over 50% (!) of the fine-tuning dataset can be extracted from a fine-tuned LLM in natural settings. We consider varying degrees of access to the underlying model, pretraining and fine-tuning data, and both MIAs and training data extraction. For pretraining data, we propose two new MIAs: a supervised neural network classifier that predicts training data membership on the basis of (dimensionality-reduced) model gradients, as well as a variant of this attack that only requires logit access to the model by leveraging recent model-stealing work on LLMs. To our knowledge this is the first MIA that explicitly incorporates model-stealing information. Both attacks outperform existing black-box baselines, and our supervised attack closes the gap between MIA attack success against LLMs and the strongest known attacks for other machine learning models. In fine-tuning, we find that a simple attack based on the ratio of the loss between the base and fine-tuned models is able to achieve near-perfect MIA performance; we then leverage our MIA to extract a large fraction of the fine-tuning dataset from fine-tuned Pythia and Llama models. Our code is available at github.com/safr-ai-lab/pandora-llm.
♻ ☆ Who Plays First? Optimizing the Order of Play in Stackelberg Games with Many Robots
We consider the multi-agent spatial navigation problem of computing the socially optimal order of play, i.e., the sequence in which the agents commit to their decisions, and its associated equilibrium in an N-player Stackelberg trajectory game. We model this problem as a mixed-integer optimization problem over the space of all possible Stackelberg games associated with the order of play's permutations. To solve the problem, we introduce Branch and Play (B&P), an efficient and exact algorithm that provably converges to a socially optimal order of play and its Stackelberg equilibrium. As a subroutine for B&P, we employ and extend sequential trajectory planning, i.e., a popular multi-agent control approach, to scalably compute valid local Stackelberg equilibria for any given order of play. We demonstrate the practical utility of B&P to coordinate air traffic control, swarm formation, and delivery vehicle fleets. We find that B&P consistently outperforms various baselines, and computes the socially optimal equilibrium.
comment: Robotics: Science and Systems (RSS) 2024
♻ ☆ Feature learning as alignment: a structural property of gradient descent in non-linear neural networks
Understanding the mechanisms through which neural networks extract statistics from input-label pairs through feature learning is one of the most important unsolved problems in supervised learning. Prior works demonstrated that the gram matrices of the weights (the neural feature matrices, NFM) and the average gradient outer products (AGOP) become correlated during training, in a statement known as the neural feature ansatz (NFA). Through the NFA, the authors introduce mapping with the AGOP as a general mechanism for neural feature learning. However, these works do not provide a theoretical explanation for this correlation or its origins. In this work, we further clarify the nature of this correlation, and explain its emergence. We show that this correlation is equivalent to alignment between the left singular structure of the weight matrices and the newly defined pre-activation tangent features at each layer. We further establish that the alignment is driven by the interaction of weight changes induced by SGD with the pre-activation features, and analyze the resulting dynamics analytically at early times in terms of simple statistics of the inputs and labels. Finally, motivated by the observation that the NFA is driven by this centered correlation, we introduce a simple optimization rule that dramatically increases the NFA correlations at any given layer and improves the quality of features learned.
♻ ☆ Flow of Reasoning: Efficient Training of LLM Policy with Divergent Thinking
Divergent thinking, the cognitive process of generating diverse solutions, is a hallmark of human creativity and problem-solving. For machines, sampling diverse solution trajectories in complex reasoning problems is crucial for robust outcomes, data augmentation, and enhanced model generalization. Large language models (LLMs) often struggle with generating high-quality, diverse reasoning. While supervised fine-tuning helps with quality, it requires extensive supervision data to capture the full diversity of solutions. Alternatively, reinforcement learning methods like PPO aim to find limited highest-reward solutions while neglecting the solution diversity, akin to convergent thinking. To address these limitations, we propose Flow of Reasoning (FoR) -- an efficient LLM training approach enabling diverse reasoning with minimal data. FoR formulates multi-step LLM reasoning as a Markovian flow from an initial state to terminal states. The formulation allows to adapt principled GFlowNet approaches to train the LLM as a policy, which is able to sample multiple reasoning paths with probabilities proportional to the unnormalized reward. Empirical results show that, with limited training data (e.g., 15 examples), FoR can discover diverse high-quality solutions that excel greatly beyond current state-of-the-art methods across three tasks, including embodied reasoning (BlocksWorld), math puzzle solving (Game24), and logical reasoning (PrOntoQA). Code is available at https://github.com/Yu-Fangxu/FoR.
♻ ☆ A practical existence theorem for reduced order models based on convolutional autoencoders
In recent years, deep learning has gained increasing popularity in the fields of Partial Differential Equations (PDEs) and Reduced Order Modeling (ROM), providing domain practitioners with new powerful data-driven techniques such as Physics-Informed Neural Networks (PINNs), Neural Operators, Deep Operator Networks (DeepONets) and Deep-Learning based ROMs (DL-ROMs). In this context, deep autoencoders based on Convolutional Neural Networks (CNNs) have proven extremely effective, outperforming established techniques, such as the reduced basis method, when dealing with complex nonlinear problems. However, despite the empirical success of CNN-based autoencoders, there are only a few theoretical results supporting these architectures, usually stated in the form of universal approximation theorems. In particular, although the existing literature provides users with guidelines for designing convolutional autoencoders, the subsequent challenge of learning the latent features has been barely investigated. Furthermore, many practical questions remain unanswered, e.g., the number of snapshots needed for convergence or the neural network training strategy. In this work, using recent techniques from sparse high-dimensional function approximation, we fill some of these gaps by providing a new practical existence theorem for CNN-based autoencoders when the parameter-to-solution map is holomorphic. This regularity assumption arises in many relevant classes of parametric PDEs, such as the parametric diffusion equation, for which we discuss an explicit application of our general theory.
♻ ☆ Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random Walks
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
♻ ☆ FairytaleQA Translated: Enabling Educational Question and Answer Generation in Less-Resourced Languages
Question Answering (QA) datasets are crucial in assessing reading comprehension skills for both machines and humans. While numerous datasets have been developed in English for this purpose, a noticeable void exists in less-resourced languages. To alleviate this gap, our paper introduces machine-translated versions of FairytaleQA, a renowned QA dataset designed to assess and enhance narrative comprehension skills in young children. By employing fine-tuned, modest-scale models, we establish benchmarks for both Question Generation (QG) and QA tasks within the translated datasets. In addition, we present a case study proposing a model for generating question-answer pairs, with an evaluation incorporating quality metrics such as question well-formedness, answerability, relevance, and children suitability. Our evaluation prioritizes quantifying and describing error cases, along with providing directions for future work. This paper contributes to the advancement of QA and QG research in less-resourced languages, promoting accessibility and inclusivity in the development of these models for reading comprehension. The code and data is publicly available at github.com/bernardoleite/fairytaleqa-translated.
comment: Preprint - Accepted for publication at ECTEL 2024
♻ ☆ GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts
Large language models (LLMs) have recently experienced tremendous popularity and are widely used from casual conversations to AI-driven programming. However, despite their considerable success, LLMs are not entirely reliable and can give detailed guidance on how to conduct harmful or illegal activities. While safety measures can reduce the risk of such outputs, adversarial jailbreak attacks can still exploit LLMs to produce harmful content. These jailbreak templates are typically manually crafted, making large-scale testing challenging. In this paper, we introduce GPTFuzz, a novel black-box jailbreak fuzzing framework inspired by the AFL fuzzing framework. Instead of manual engineering, GPTFuzz automates the generation of jailbreak templates for red-teaming LLMs. At its core, GPTFuzz starts with human-written templates as initial seeds, then mutates them to produce new templates. We detail three key components of GPTFuzz: a seed selection strategy for balancing efficiency and variability, mutate operators for creating semantically equivalent or similar sentences, and a judgment model to assess the success of a jailbreak attack. We evaluate GPTFuzz against various commercial and open-source LLMs, including ChatGPT, LLaMa-2, and Vicuna, under diverse attack scenarios. Our results indicate that GPTFuzz consistently produces jailbreak templates with a high success rate, surpassing human-crafted templates. Remarkably, GPTFuzz achieves over 90% attack success rates against ChatGPT and Llama-2 models, even with suboptimal initial seed templates. We anticipate that GPTFuzz will be instrumental for researchers and practitioners in examining LLM robustness and will encourage further exploration into enhancing LLM safety.
♻ ☆ A Systematic Review of Few-Shot Learning in Medical Imaging
The lack of annotated medical images limits the performance of deep learning models, which usually need large-scale labelled datasets. Few-shot learning techniques can reduce data scarcity issues and enhance medical image analysis, especially with meta-learning. This systematic review gives a comprehensive overview of few-shot learning in medical imaging. We searched the literature systematically and selected 80 relevant articles published from 2018 to 2023. We clustered the articles based on medical outcomes, such as tumour segmentation, disease classification, and image registration; anatomical structure investigated (i.e. heart, lung, etc.); and the meta-learning method used. For each cluster, we examined the papers' distributions and the results provided by the state-of-the-art. In addition, we identified a generic pipeline shared among all the studies. The review shows that few-shot learning can overcome data scarcity in most outcomes and that meta-learning is a popular choice to perform few-shot learning because it can adapt to new tasks with few labelled samples. In addition, following meta-learning, supervised learning and semi-supervised learning stand out as the predominant techniques employed to tackle few-shot learning challenges in medical imaging and also best performing. Lastly, we observed that the primary application areas predominantly encompass cardiac, pulmonary, and abdominal domains. This systematic review aims to inspire further research to improve medical image analysis and patient care.
comment: 48 pages, 29 figures, 10 tables, submitted to Elsevier on 19 Sep 2023
♻ ☆ GIEBench: Towards Holistic Evaluation of Group Identity-based Empathy for Large Language Models
As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals' group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at https://github.com/GIEBench/GIEBench.
♻ ☆ Watermark Stealing in Large Language Models ICML 2024
LLM watermarking has attracted attention as a promising way to detect AI-generated content, with some works suggesting that current schemes may already be fit for deployment. In this work we dispute this claim, identifying watermark stealing (WS) as a fundamental vulnerability of these schemes. We show that querying the API of the watermarked LLM to approximately reverse-engineer a watermark enables practical spoofing attacks, as hypothesized in prior work, but also greatly boosts scrubbing attacks, which was previously unnoticed. We are the first to propose an automated WS algorithm and use it in the first comprehensive study of spoofing and scrubbing in realistic settings. We show that for under $50 an attacker can both spoof and scrub state-of-the-art schemes previously considered safe, with average success rate of over 80%. Our findings challenge common beliefs about LLM watermarking, stressing the need for more robust schemes. We make all our code and additional examples available at https://watermark-stealing.org.
comment: ICML 2024
♻ ☆ Towards Theoretical Understandings of Self-Consuming Generative Models ICML 2024
This paper tackles the emerging challenge of training generative models within a self-consuming loop, wherein successive generations of models are recursively trained on mixtures of real and synthetic data from previous generations. We construct a theoretical framework to rigorously evaluate how this training procedure impacts the data distributions learned by future models, including parametric and non-parametric models. Specifically, we derive bounds on the total variation (TV) distance between the synthetic data distributions produced by future models and the original real data distribution under various mixed training scenarios for diffusion models with a one-hidden-layer neural network score function. Our analysis demonstrates that this distance can be effectively controlled under the condition that mixed training dataset sizes or proportions of real data are large enough. Interestingly, we further unveil a phase transition induced by expanding synthetic data amounts, proving theoretically that while the TV distance exhibits an initial ascent, it declines beyond a threshold point. Finally, we present results for kernel density estimation, delivering nuanced insights such as the impact of mixed data training on error propagation.
comment: Accepted at ICML 2024
♻ ☆ Hyperbolic Random Forests
Hyperbolic space is becoming a popular choice for representing data due to the hierarchical structure - whether implicit or explicit - of many real-world datasets. Along with it comes a need for algorithms capable of solving fundamental tasks, such as classification, in hyperbolic space. Recently, multiple papers have investigated hyperbolic alternatives to hyperplane-based classifiers, such as logistic regression and SVMs. While effective, these approaches struggle with more complex hierarchical data. We, therefore, propose to generalize the well-known random forests to hyperbolic space. We do this by redefining the notion of a split using horospheres. Since finding the globally optimal split is computationally intractable, we find candidate horospheres through a large-margin classifier. To make hyperbolic random forests work on multi-class data and imbalanced experiments, we furthermore outline a new method for combining classes based on their lowest common ancestor and a class-balanced version of the large-margin loss. Experiments on standard and new benchmarks show that our approach outperforms both conventional random forest algorithms and recent hyperbolic classifiers.
comment: Accepted at TMLR. Code available at https://github.com/LarsDoorenbos/HoroRF
♻ ☆ CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training
Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
♻ ☆ A Comprehensive Survey on Relation Extraction: Recent Advances and New Frontiers
Relation extraction (RE) involves identifying the relations between entities from underlying content. RE serves as the foundation for many natural language processing (NLP) and information retrieval applications, such as knowledge graph completion and question answering. In recent years, deep neural networks have dominated the field of RE and made noticeable progress. Subsequently, the large pre-trained language models have taken the state-of-the-art RE to a new level. This survey provides a comprehensive review of existing deep learning techniques for RE. First, we introduce RE resources, including datasets and evaluation metrics. Second, we propose a new taxonomy to categorize existing works from three perspectives, i.e., text representation, context encoding, and triplet prediction. Third, we discuss several important challenges faced by RE and summarize potential techniques to tackle these challenges. Finally, we outline some promising future directions and prospects in this field. This survey is expected to facilitate researchers' collaborative efforts to address the challenges of real-world RE systems.
♻ ☆ Learning Action-based Representations Using Invariance
Robust reinforcement learning agents using high-dimensional observations must be able to identify relevant state features amidst many exogeneous distractors. A representation that captures controllability identifies these state elements by determining what affects agent control. While methods such as inverse dynamics and mutual information capture controllability for a limited number of timesteps, capturing long-horizon elements remains a challenging problem. Myopic controllability can capture the moment right before an agent crashes into a wall, but not the control-relevance of the wall while the agent is still some distance away. To address this we introduce action-bisimulation encoding, a method inspired by the bisimulation invariance pseudometric, that extends single-step controllability with a recursive invariance constraint. By doing this, action-bisimulation learns a multi-step controllability metric that smoothly discounts distant state features that are relevant for control. We demonstrate that action-bisimulation pretraining on reward-free, uniformly random data improves sample efficiency in several environments, including a photorealistic 3D simulation domain, Habitat. Additionally, we provide theoretical analysis and qualitative results demonstrating the information captured by action-bisimulation.
comment: Published at the Reinforcement Learning Conference 2024
♻ ☆ LLMs Are Few-Shot In-Context Low-Resource Language Learners
In-context learning (ICL) empowers large language models (LLMs) to perform diverse tasks in underrepresented languages using only short in-context information, offering a crucial avenue for narrowing the gap between high-resource and low-resource languages. Nonetheless, there is only a handful of works explored ICL for low-resource languages with most of them focusing on relatively high-resource languages, such as French and Spanish. In this work, we extensively study ICL and its cross-lingual variation (X-ICL) on 25 low-resource and 7 relatively higher-resource languages. Our study not only assesses the effectiveness of ICL with LLMs in low-resource languages but also identifies the shortcomings of in-context label alignment, and introduces a more effective alternative: query alignment. Moreover, we provide valuable insights into various facets of ICL for low-resource languages. Our study concludes the significance of few-shot in-context information on enhancing the low-resource understanding quality of LLMs through semantically relevant information by closing the language gap in the target language and aligning the semantics between the targeted low-resource and the high-resource language that the model is proficient in. Our work highlights the importance of advancing ICL research, particularly for low-resource languages. Our code is publicly released at https://github.com/SamuelCahyawijaya/in-context-alignment
♻ ☆ CLUE: A Clinical Language Understanding Evaluation for LLMs
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, evaluation has primarily been limited to non-clinical tasks, which do not reflect the complexity of practical clinical applications. To fill this gap, we present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks. CLUE includes six tasks to test the practical applicability of LLMs in complex healthcare settings. Our evaluation includes a total of $25$ LLMs. In contrast to previous evaluations, CLUE shows a decrease in performance for nine out of twelve biomedical models. Our benchmark represents a step towards a standardized approach to evaluating and developing LLMs in healthcare to align future model development with the real-world needs of clinical application. We open-source all evaluation scripts and datasets for future research at https://github.com/TIO-IKIM/CLUE.
♻ ☆ A Survey on Neural Topic Models: Methods, Applications, and Challenges
Topic models have been prevalent for decades to discover latent topics and infer topic proportions of documents in an unsupervised fashion. They have been widely used in various applications like text analysis and context recommendation. Recently, the rise of neural networks has facilitated the emergence of a new research field -- Neural Topic Models (NTMs). Different from conventional topic models, NTMs directly optimize parameters without requiring model-specific derivations. This endows NTMs with better scalability and flexibility, resulting in significant research attention and plentiful new methods and applications. In this paper, we present a comprehensive survey on neural topic models concerning methods, applications, and challenges. Specifically, we systematically organize current NTM methods according to their network structures and introduce the NTMs for various scenarios like short texts and bilingual documents. We also discuss a wide range of popular applications built on NTMs. Finally, we highlight the challenges confronted by NTMs to inspire future research. We accompany this survey with a repository for easier access to the mentioned paper resources: https://github.com/bobxwu/Paper-Neural-Topic-Models.
comment: Accepted to Artificial Intelligence Review. See https://doi.org/10.1007/s10462-023-10661-7 and a paper list at https://github.com/BobXWu/Paper-Neural-Topic-Models
♻ ☆ Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: https://github.com/YuchenXia/AASbyLLM.
comment: Published in IEEE Access
♻ ☆ A Non-autoregressive Multi-Horizon Flight Trajectory Prediction Framework with Gray Code Representation AAAI
Flight Trajectory Prediction (FTP) is an essential task in Air Traffic Control (ATC), which can assist air traffic controllers in managing airspace more safely and efficiently. Existing approaches generally perform multi-horizon FTP tasks in an autoregressive manner, thereby suffering from error accumulation and low-efficiency problems. In this paper, a novel framework, called FlightBERT++, is proposed to i) forecast multi-horizon flight trajectories directly in a non-autoregressive way, and ii) improve the limitation of the binary encoding (BE) representation in the FlightBERT framework. Specifically, the proposed framework is implemented by a generalized encoder-decoder architecture, in which the encoder learns the temporal-spatial patterns from historical observations and the decoder predicts the flight status for the future horizons. Compared to conventional architecture, an innovative horizon-aware contexts generator is dedicatedly designed to consider the prior horizon information, which further enables non-autoregressive multi-horizon prediction. Additionally, the Gray code representation and the differential prediction paradigm are designed to cope with the high-bit misclassifications of the BE representation, which significantly reduces the outliers in the predictions. Moreover, a differential prompted decoder is proposed to enhance the capability of the differential predictions by leveraging the stationarity of the differential sequence. Extensive experiments are conducted to validate the proposed framework on a real-world flight trajectory dataset. The experimental results demonstrated that the proposed framework outperformed the competitive baselines in both FTP performance and computational efficiency.
comment: An extend version based on the AAAI version
♻ ☆ A Survey of Large Language Models for Graphs KDD'24
Graphs are an essential data structure utilized to represent relationships in real-world scenarios. Prior research has established that Graph Neural Networks (GNNs) deliver impressive outcomes in graph-centric tasks, such as link prediction and node classification. Despite these advancements, challenges like data sparsity and limited generalization capabilities continue to persist. Recently, Large Language Models (LLMs) have gained attention in natural language processing. They excel in language comprehension and summarization. Integrating LLMs with graph learning techniques has attracted interest as a way to enhance performance in graph learning tasks. In this survey, we conduct an in-depth review of the latest state-of-the-art LLMs applied in graph learning and introduce a novel taxonomy to categorize existing methods based on their framework design. We detail four unique designs: i) GNNs as Prefix, ii) LLMs as Prefix, iii) LLMs-Graphs Integration, and iv) LLMs-Only, highlighting key methodologies within each category. We explore the strengths and limitations of each framework, and emphasize potential avenues for future research, including overcoming current integration challenges between LLMs and graph learning techniques, and venturing into new application areas. This survey aims to serve as a valuable resource for researchers and practitioners eager to leverage large language models in graph learning, and to inspire continued progress in this dynamic field. We consistently maintain the related open-source materials at \url{https://github.com/HKUDS/Awesome-LLM4Graph-Papers}.
comment: Published as a KDD'24 survey paper
♻ ☆ Mirror: A Multiple-perspective Self-Reflection Method for Knowledge-rich Reasoning ACL24
While Large language models (LLMs) have the capability to iteratively reflect on their own outputs, recent studies have observed their struggles with knowledge-rich problems without access to external resources. In addition to the inefficiency of LLMs in self-assessment, we also observe that LLMs struggle to revisit their predictions despite receiving explicit negative feedback. Therefore, We propose Mirror, a Multiple-perspective self-reflection method for knowledge-rich reasoning, to avoid getting stuck at a particular reflection iteration. Mirror enables LLMs to reflect from multiple-perspective clues, achieved through a heuristic interaction between a Navigator and a Reasoner. It guides agents toward diverse yet plausibly reliable reasoning trajectory without access to ground truth by encouraging (1) diversity of directions generated by Navigator and (2) agreement among strategically induced perturbations in responses generated by the Reasoner. The experiments on five reasoning datasets demonstrate that Mirror's superiority over several contemporary self-reflection approaches. Additionally, the ablation study studies clearly indicate that our strategies alleviate the aforementioned challenges.
comment: ACL24, Main Conference, long paper. Code is available at https://github.com/hanqi-qi/Mirror.git
♻ ☆ An Embarrassingly Simple Approach to Enhance Transformer Performance in Genomic Selection for Crop Breeding IJCAI2024
Genomic selection (GS), as a critical crop breeding strategy, plays a key role in enhancing food production and addressing the global hunger crisis. The predominant approaches in GS currently revolve around employing statistical methods for prediction. However, statistical methods often come with two main limitations: strong statistical priors and linear assumptions. A recent trend is to capture the non-linear relationships between markers by deep learning. However, as crop datasets are commonly long sequences with limited samples, the robustness of deep learning models, especially Transformers, remains a challenge. In this work, to unleash the unexplored potential of attention mechanism for the task of interest, we propose a simple yet effective Transformer-based framework that enables end-to-end training of the whole sequence. Via experiments on rice3k and wheat3k datasets, we show that, with simple tricks such as k-mer tokenization and random masking, Transformer can achieve overall superior performance against seminal methods on GS tasks of interest.
comment: Accepted by IJCAI2024. Code is available at https://github.com/RenqiChen/Genomic-Selection
♻ ☆ Interpretable Catastrophic Forgetting of Large Language Model Fine-tuning via Instruction Vector
Fine-tuning large language models (LLMs) can cause them to lose their general capabilities. However, the intrinsic mechanisms behind such forgetting remain unexplored. In this paper, we begin by examining this phenomenon by focusing on knowledge understanding and instruction following, with the latter identified as the main contributor to forgetting during fine-tuning. Consequently, we propose the Instruction Vector (IV) framework to capture model representations highly related to specific instruction-following capabilities, thereby making it possible to understand model-intrinsic forgetting. Through the analysis of IV dynamics pre and post-training, we suggest that fine-tuning mostly adds specialized reasoning patterns instead of erasing previous skills, which may appear as forgetting. Building on this insight, we develop IV-guided training, which aims to preserve original computation graph, thereby mitigating catastrophic forgetting. Empirical tests on three benchmarks confirm the efficacy of this new approach, supporting the relationship between IVs and forgetting. Our code will be made available soon.
♻ ☆ ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction WWW 2024
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
comment: Accepted by WWW 2024
♻ ☆ Deep-Learning Approach for Tissue Classification using Acoustic Waves during Ablation with an Er:YAG Laser (Updated)
Today's mechanical tools for bone cutting (osteotomy) cause mechanical trauma that prolongs the healing process. Medical device manufacturers aim to minimize this trauma, with minimally invasive surgery using laser cutting as one innovation. This method ablates tissue using laser light instead of mechanical tools, reducing post-surgery healing time. A reliable feedback system is crucial during laser surgery to prevent damage to surrounding tissues. We propose a tissue classification method analyzing acoustic waves generated during laser ablation, demonstrating its applicability in an ex-vivo experiment. The ablation process with a microsecond pulsed Er:YAG laser produces acoustic waves, acquired with an air-coupled transducer. These waves were used to classify five porcine tissue types: hard bone, soft bone, muscle, fat, and skin. For automated tissue classification, we compared five Neural Network (NN) approaches: a one-dimensional Convolutional Neural Network (CNN) with time-dependent input, a Fully-connected Neural Network (FcNN) with either the frequency spectrum or principal components of the frequency spectrum as input, and a combination of a CNN and an FcNN with time-dependent data and its frequency spectrum as input. Consecutive acoustic waves were used to improve classification accuracy. Grad-Cam identified the activation map of the frequencies, showing low frequencies as the most important for this task. Our results indicated that combining time-dependent data with its frequency spectrum achieved the highest classification accuracy (65.5%-75.5%). We also found that using the frequency spectrum alone was sufficient, with no additional benefit from applying Principal Components Analysis (PCA).
comment: This paper is an updated version of Deep-Learning Approach for Tissue Classification using Acoustic Waves during Ablation with an Er:YAG Laser originally published in DOI:10.1109/ACCESS.2021.3113055. This update addresses several issues and incorporates corrections as outlined in DOI:10.1109/ACCESS.2024.3395071. We provide here a detailed description of our experiments and the new models we used
♻ ☆ Advancing Surgical VQA with Scene Graph Knowledge
Modern operating room is becoming increasingly complex, requiring innovative intra-operative support systems. While the focus of surgical data science has largely been on video analysis, integrating surgical computer vision with language capabilities is emerging as a necessity. Our work aims to advance Visual Question Answering (VQA) in the surgical context with scene graph knowledge, addressing two main challenges in the current surgical VQA systems: removing question-condition bias in the surgical VQA dataset and incorporating scene-aware reasoning in the surgical VQA model design. First, we propose a Surgical Scene Graph-based dataset, SSG-QA, generated by employing segmentation and detection models on publicly available datasets. We build surgical scene graphs using spatial and action information of instruments and anatomies. These graphs are fed into a question engine, generating diverse QA pairs. Our SSG-QA dataset provides a more complex, diverse, geometrically grounded, unbiased, and surgical action-oriented dataset compared to existing surgical VQA datasets. We then propose SSG-QA-Net, a novel surgical VQA model incorporating a lightweight Scene-embedded Interaction Module (SIM), which integrates geometric scene knowledge in the VQA model design by employing cross-attention between the textual and the scene features. Our comprehensive analysis of the SSG-QA dataset shows that SSG-QA-Net outperforms existing methods across different question types and complexities. We highlight that the primary limitation in the current surgical VQA systems is the lack of scene knowledge to answer complex queries. We present a novel surgical VQA dataset and model and show that results can be significantly improved by incorporating geometric scene features in the VQA model design. The source code and the dataset will be made publicly available at: https://github.com/CAMMA-public/SSG-QA
comment: IPCAI 2024, Int J CARS (2024)
♻ ☆ VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models
Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
♻ ☆ Prompting with Divide-and-Conquer Program Makes Large Language Models Discerning to Hallucination and Deception
Foundation models, such as Large language Models (LLMs), have attracted significant amount of interest due to their large number of applications. However, when handling tasks involving repetitive sub-tasks and/or deceptive contents, such as arithmetic calculation and article-level fake news detection, simple instructional prompts suffer from inaccurate responses. Existing works show that more complicated prompting strategies, such as Chain-of-Thoughts and Least-to-Most, can unlock LLM's powerful capacity in diverse areas. Recent researches reveal that simple divide-and-conquer prompting strategy, i.e. simply dividing the input sequence to multiple sub-inputs, can also substantially improve LLM's performance in some specific tasks such as misinformation detection. In this paper, we aim at examining the utility of divide-and-conquer prompting strategy and answer on which kind of tasks this strategy gets advantages. Specifically, we provide a theoretic analysis to divide-and-conquer prompting strategy and help us identify the specific tasks where DaC prompting can bring performance boost with theoretic guarantee. We then present two cases (large integer arithmetic and fact verification) where experimental results aligns with our theoretic analysis.
comment: Preprint
♻ ☆ Towards Bayesian Data Selection ICML 2024
A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
comment: 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024
♻ ☆ On-Device Soft Sensors: Real-Time Fluid Flow Estimation from Level Sensor Data
Soft sensors are crucial in bridging autonomous systems' physical and digital realms, enhancing sensor fusion and perception. Instead of deploying soft sensors on the Cloud, this study shift towards employing on-device soft sensors, promising heightened efficiency and bolstering data security. Our approach substantially improves energy efficiency by deploying Artificial Intelligence (AI) directly on devices within a wireless sensor network. Furthermore, the synergistic integration of the Microcontroller Unit and Field-Programmable Gate Array (FPGA) leverages the rapid AI inference capabilities of the latter. Empirical evidence from our real-world use case demonstrates that FPGA-based soft sensors achieve inference times ranging remarkably from 1.04 to 12.04 microseconds. These compelling results highlight the considerable potential of our innovative approach for executing real-time inference tasks efficiently, thereby presenting a feasible alternative that effectively addresses the latency challenges intrinsic to Cloud-based deployments.
comment: 8 pages, 6 figures, 1 Table, Accepted by the 1st AUTONOMOUS UBIQUITOUS SYSTEMS (AUTOQUITOUS) WORKSHOP of EAI MobiQuitous 2023 - 20th EAI International Conference on Mobile and Ubiquitous Systems: Computing, Networking and Services
♻ ☆ Surgical Triplet Recognition via Diffusion Model
Surgical triplet recognition is an essential building block to enable next-generation context-aware operating rooms. The goal is to identify the combinations of instruments, verbs, and targets presented in surgical video frames. In this paper, we propose DiffTriplet, a new generative framework for surgical triplet recognition employing the diffusion model, which predicts surgical triplets via iterative denoising. To handle the challenge of triplet association, two unique designs are proposed in our diffusion framework, i.e., association learning and association guidance. During training, we optimize the model in the joint space of triplets and individual components to capture the dependencies among them. At inference, we integrate association constraints into each update of the iterative denoising process, which refines the triplet prediction using the information of individual components. Experiments on the CholecT45 and CholecT50 datasets show the superiority of the proposed method in achieving a new state-of-the-art performance for surgical triplet recognition. Our codes will be released.
♻ ☆ LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models ACL 2024
Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present LlamaFactory, a unified framework that integrates a suite of cutting-edge efficient training methods. It provides a solution for flexibly customizing the fine-tuning of 100+ LLMs without the need for coding through the built-in web UI LlamaBoard. We empirically validate the efficiency and effectiveness of our framework on language modeling and text generation tasks. It has been released at https://github.com/hiyouga/LLaMA-Factory and received over 24,000 stars and 3,000 forks.
comment: 13 pages, accepted to ACL 2024 System Demonstration Track
♻ ☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
♻ ☆ This actually looks like that: Proto-BagNets for local and global interpretability-by-design
Interpretability is a key requirement for the use of machine learning models in high-stakes applications, including medical diagnosis. Explaining black-box models mostly relies on post-hoc methods that do not faithfully reflect the model's behavior. As a remedy, prototype-based networks have been proposed, but their interpretability is limited as they have been shown to provide coarse, unreliable, and imprecise explanations. In this work, we introduce Proto-BagNets, an interpretable-by-design prototype-based model that combines the advantages of bag-of-local feature models and prototype learning to provide meaningful, coherent, and relevant prototypical parts needed for accurate and interpretable image classification tasks. We evaluated the Proto-BagNet for drusen detection on publicly available retinal OCT data. The Proto-BagNet performed comparably to the state-of-the-art interpretable and non-interpretable models while providing faithful, accurate, and clinically meaningful local and global explanations. The code is available at https://github.com/kdjoumessi/Proto-BagNets.
♻ ☆ Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Large Language Models (LLMs) have demonstrated impressive capability in many natural language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, by casting multi-step reasoning of LLMs as a heuristic search problem, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function for estimating expected future rewards, our Q* can effectively guide LLMs to select the most promising next reasoning step without fine-tuning LLMs for the current task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP demonstrate the superiority of our method, contributing to improving the reasoning performance of existing open-source LLMs.
♻ ☆ Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach
Amidst the rapid evolution of LLMs, the significance of evaluation in comprehending and propelling these models forward is increasingly paramount. Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs. However, the extent and nature of these impacts continue to be subjects of debate because most assessments have been restricted to a limited number of models and data points. Clarifying the effects of these factors on performance scores can be more effectively achieved through a statistical lens. Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods. With the advent of a uniform evaluation framework, our research leverages an expansive dataset of evaluation results, introducing a comprehensive statistical methodology. This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique, offering a robust and transparent approach to deciphering LLM performance data. Contrary to prevailing findings, our results challenge assumptions about emergent abilities and the influence of given training types and architectures in LLMs. These findings furnish new perspectives on the characteristics, intrinsic nature, and developmental trajectories of LLMs. By providing straightforward and reliable methods to scrutinize and reassess LLM performance data, this study contributes a nuanced perspective on LLM efficiency and potentials.
♻ ☆ CaLM: Contrasting Large and Small Language Models to Verify Grounded Generation ACL 2024
Grounded generation aims to equip language models (LMs) with the ability to produce more credible and accountable responses by accurately citing verifiable sources. However, existing methods, by either feeding LMs with raw or preprocessed materials, remain prone to errors. To address this, we introduce CaLM, a novel verification framework. CaLM leverages the insight that a robust grounded response should be consistent with information derived solely from its cited sources. Our framework empowers smaller LMs, which rely less on parametric memory and excel at processing relevant information given a query, to validate the output of larger LMs. Larger LM responses that closely align with the smaller LMs' output, which relies exclusively on cited documents, are verified. Responses showing discrepancies are iteratively refined through a feedback loop. Experiments on three open-domain question-answering datasets demonstrate significant performance gains of 1.5% to 7% absolute average without any required model fine-tuning.
comment: ACL 2024 Camera Ready Version
♻ ☆ The Uli Dataset: An Exercise in Experience Led Annotation of oGBV
Online gender based violence has grown concomitantly with adoption of the internet and social media. Its effects are worse in the Global majority where many users use social media in languages other than English. The scale and volume of conversations on the internet has necessitated the need for automated detection of hate speech, and more specifically gendered abuse. There is, however, a lack of language specific and contextual data to build such automated tools. In this paper we present a dataset on gendered abuse in three languages- Hindi, Tamil and Indian English. The dataset comprises of tweets annotated along three questions pertaining to the experience of gender abuse, by experts who identify as women or a member of the LGBTQIA community in South Asia. Through this dataset we demonstrate a participatory approach to creating datasets that drive AI systems.
♻ ☆ Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks
Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence's dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.
comment: preprint
♻ ☆ Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents
Foundation model-enabled generative artificial intelligence facilitates the development and implementation of agents, which can leverage distinguished reasoning and language processing capabilities to takes a proactive, autonomous role to pursue users' goals. Nevertheless, there is a lack of systematic knowledge to guide practitioners in designing the agents considering challenges of goal-seeking (including generating instrumental goals and plans), such as hallucinations inherent in foundation models, explainability of reasoning process, complex accountability, etc. To address this issue, we have performed a systematic literature review to understand the state-of-the-art foundation model-based agents and the broader ecosystem. In this paper, we present a pattern catalogue consisting of 17 architectural patterns with analyses of the context, forces, and trade-offs as the outcomes from the previous literature review. The proposed catalogue can provide holistic guidance for the effective use of patterns, and support the architecture design of foundation model-based agents by facilitating goal-seeking and plan generation.
♻ ☆ In-context Pretraining: Language Modeling Beyond Document Boundaries
Large language models (LMs) are currently trained to predict tokens given document prefixes, enabling them to directly perform long-form generation and prompting-style tasks which can be reduced to document completion. Existing pretraining pipelines train LMs by concatenating random sets of short documents to create input contexts but the prior documents provide no signal for predicting the next document. We instead present In-Context Pretraining, a new approach where language models are pretrained on a sequence of related documents, thereby explicitly encouraging them to read and reason across document boundaries. We can do In-Context Pretraining by simply changing the document ordering so that each context contains related documents, and directly applying existing pretraining pipelines. However, this document sorting problem is challenging. There are billions of documents and we would like the sort to maximize contextual similarity for every document without repeating any data. To do this, we introduce approximate algorithms for finding related documents with efficient nearest neighbor search and constructing coherent input contexts with a graph traversal algorithm. Our experiments show In-Context Pretraining offers a simple and scalable approach to significantly enhance LMs'performance: we see notable improvements in tasks that require more complex contextual reasoning, including in-context learning (+8%), reading comprehension (+15%), faithfulness to previous contexts (+16%), long-context reasoning (+5%), and retrieval augmentation (+9%).
♻ ☆ PromptKD: Distilling Student-Friendly Knowledge for Generative Language Models via Prompt Tuning
Recent advancements in large language models (LLMs) have raised concerns about inference costs, increasing the need for research into model compression. While knowledge distillation (KD) is a prominent method for this, research on KD for generative language models like LLMs is relatively sparse, and the approach of distilling student-friendly knowledge, which has shown promising performance in KD for classification models, remains unexplored in generative language models. To explore this approach, we propose PromptKD, a simple yet effective method that utilizes prompt tuning - for the first time in KD - to enable generative language models to transfer student-friendly knowledge. Unlike previous works in classification that require fine-tuning the entire teacher model for extracting student-friendly knowledge, PromptKD achieves similar effects by adding a small number of prompt tokens and tuning only the prompt with student guidance. Extensive experiments on instruction-following datasets show that PromptKD achieves state-of-the-art performance while adding only 0.0007% of the teacher's parameters as prompts. Further analysis suggests that distilling student-friendly knowledge alleviates exposure bias effectively throughout the entire training process, leading to performance enhancements.
comment: Code: https://github.com/gmkim-ai/PromptKD
♻ ☆ LLM-Assisted Content Conditional Debiasing for Fair Text Embedding
Mitigating biases in machine learning models has become an increasing concern in Natural Language Processing (NLP), particularly in developing fair text embeddings, which are crucial yet challenging for real-world applications like search engines. In response, this paper proposes a novel method for learning fair text embeddings. First, we define a novel content-conditional equal distance (CCED) fairness for text embeddings, ensuring content-conditional independence between sensitive attributes and text embeddings. Building on CCED, we introduce a content-conditional debiasing (CCD) loss to ensure that embeddings of texts with different sensitive attributes but identical content maintain the same distance from the embedding of their corresponding neutral text. Additionally, we tackle the issue of insufficient training data by using Large Language Models (LLMs) with instructions to fairly augment texts into different sensitive groups. Our extensive evaluations show that our approach effectively enhances fairness while maintaining the utility of embeddings. Furthermore, our augmented dataset, combined with the CCED metric, serves as an new benchmark for evaluating fairness.
♻ ☆ CTIBench: A Benchmark for Evaluating LLMs in Cyber Threat Intelligence
Cyber threat intelligence (CTI) is crucial in today's cybersecurity landscape, providing essential insights to understand and mitigate the ever-evolving cyber threats. The recent rise of Large Language Models (LLMs) have shown potential in this domain, but concerns about their reliability, accuracy, and hallucinations persist. While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks. To bridge this gap, we introduce CTIBench, a benchmark designed to assess LLMs' performance in CTI applications. CTIBench includes multiple datasets focused on evaluating knowledge acquired by LLMs in the cyber-threat landscape. Our evaluation of several state-of-the-art models on these tasks provides insights into their strengths and weaknesses in CTI contexts, contributing to a better understanding of LLM capabilities in CTI.
♻ ☆ WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models, including GPT4, Claude3.5, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
comment: 26 pages, 9 figures
♻ ☆ FightLadder: A Benchmark for Competitive Multi-Agent Reinforcement Learning ICML 2024
Recent advances in reinforcement learning (RL) heavily rely on a variety of well-designed benchmarks, which provide environmental platforms and consistent criteria to evaluate existing and novel algorithms. Specifically, in multi-agent RL (MARL), a plethora of benchmarks based on cooperative games have spurred the development of algorithms that improve the scalability of cooperative multi-agent systems. However, for the competitive setting, a lightweight and open-sourced benchmark with challenging gaming dynamics and visual inputs has not yet been established. In this work, we present FightLadder, a real-time fighting game platform, to empower competitive MARL research. Along with the platform, we provide implementations of state-of-the-art MARL algorithms for competitive games, as well as a set of evaluation metrics to characterize the performance and exploitability of agents. We demonstrate the feasibility of this platform by training a general agent that consistently defeats 12 built-in characters in single-player mode, and expose the difficulty of training a non-exploitable agent without human knowledge and demonstrations in two-player mode. FightLadder provides meticulously designed environments to address critical challenges in competitive MARL research, aiming to catalyze a new era of discovery and advancement in the field. Videos and code at https://sites.google.com/view/fightladder/home.
comment: ICML 2024
♻ ☆ ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation WWW 2024
With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on three real-world public datasets to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension. To be highlighted, with only less than 10% training samples, few-shot ReLLa can outperform traditional CTR models that are trained on the entire training set (e.g., DCNv2, DIN, SIM). The code is available \url{https://github.com/LaVieEnRose365/ReLLa}.
comment: Accepted by WWW 2024. Full and More Readable Version
♻ ☆ Regularized Best-of-N Sampling to Mitigate Reward Hacking for Language Model Alignment
Best-of-N (BoN) sampling with a reward model has been shown to be an effective strategy for aligning Large Language Models (LLMs) to human preferences at the time of decoding. BoN sampling is susceptible to a problem known as reward hacking. Because the reward model is an imperfect proxy for the true objective, over-optimizing its value can compromise its performance on the true objective. A common solution to prevent reward hacking in preference learning techniques is to optimize a reward using proximity regularization (e.g., KL regularization), which ensures that the language model remains close to the reference model. In this research, we propose Regularized Best-of-N (RBoN), a variant of BoN that aims to mitigate reward hacking by incorporating a proximity term in response selection, similar to preference learning techniques. We evaluate RBoN on the AlpacaFarm and Anthropic's hh-rlhf datasets and find that it outperforms BoN. As an application of RBoN, we use RBoN to generate a pairwise preference learning dataset. Experimental results show that a DPO model trained on a dataset generated with RBoN outperforms a DPO model generated with vanilla BoN. Our code is available at https://github.com/CyberAgentAILab/regularized-bon
♻ ☆ EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models ACL 2024
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.
comment: ACL 2024 System Demonstrations; Code: https://github.com/zjunlp/EasyEdit HF Demo: https://huggingface.co/spaces/zjunlp/EasyEdit Video: https://youtu.be/Gm6T0QaaskU Docs: https://zjunlp.gitbook.io/easyedit
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models ACL 2024
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: ACL 2024 System Demonstrations; Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ Intrinsic LoRA: A Generalist Approach for Discovering Knowledge in Generative Models
Generative models excel at creating images that closely mimic real scenes, suggesting they inherently encode scene representations. We introduce Intrinsic LoRA (I-LoRA), a general approach that uses Low-Rank Adaptation (LoRA) to discover scene intrinsics such as normals, depth, albedo, and shading from a wide array of generative models. I-LoRA is lightweight, adding minimally to the model's parameters and requiring very small datasets for this knowledge discovery. Our approach, applicable to Diffusion models, GANs, and Autoregressive models alike, generates intrinsics using the same output head as the original images. Through control experiments, we establish a correlation between the generative model's quality and the extracted intrinsics' accuracy. Finally, scene intrinsics obtained by our method with just hundreds to thousands of labeled images, perform on par with those from supervised methods trained on millions of labeled examples.
comment: https://intrinsic-lora.github.io/
♻ ☆ Non-rigid Structure-from-Motion: Temporally-smooth Procrustean Alignment and Spatially-variant Deformation Modeling CVPR 2024
Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made, there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-penalize drastic deformations in the 3D shape sequence. This paper proposes to resolve the above issues from a spatial-temporal modeling perspective. First, we propose a novel Temporally-smooth Procrustean Alignment module that estimates 3D deforming shapes and adjusts the camera motion by aligning the 3D shape sequence consecutively. Our new alignment module remedies the requirement of complex reference 3D shape during alignment, which is more conductive to non-isotropic deformation modeling. Second, we propose a spatial-weighted approach to enforce the low-rank constraint adaptively at different locations to accommodate drastic spatially-variant deformation reconstruction better. Our modeling outperform existing low-rank based methods, and extensive experiments across different datasets validate the effectiveness of our method.
comment: Accepted by CVPR 2024; V2 adds new experiments
♻ ☆ Source-Free Domain Adaptation with Diffusion-Guided Source Data Generation
This paper introduces a novel approach to leverage the generalizability of Diffusion Models for Source-Free Domain Adaptation (DM-SFDA). Our proposed DMSFDA method involves fine-tuning a pre-trained text-to-image diffusion model to generate source domain images using features from the target images to guide the diffusion process. Specifically, the pre-trained diffusion model is fine-tuned to generate source samples that minimize entropy and maximize confidence for the pre-trained source model. We then use a diffusion model-based image mixup strategy to bridge the domain gap between the source and target domains. We validate our approach through comprehensive experiments across a range of datasets, including Office-31 [39], Office-Home [48], and VisDA [35]. The results demonstrate significant improvements in SFDA performance, highlighting the potential of diffusion models in generating contextually relevant, domain-specific images.
comment: arXiv admin note: substantial text overlap with arXiv:2310.01701
♻ ☆ VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
♻ ☆ Can Differentiable Decision Trees Enable Interpretable Reward Learning from Human Feedback?
Reinforcement Learning from Human Feedback (RLHF) has emerged as a popular paradigm for capturing human intent to alleviate the challenges of hand-crafting the reward values. Despite the increasing interest in RLHF, most works learn black box reward functions that while expressive are difficult to interpret and often require running the whole costly process of RL before we can even decipher if these frameworks are actually aligned with human preferences. We propose and evaluate a novel approach for learning expressive and interpretable reward functions from preferences using Differentiable Decision Trees (DDTs). Our experiments across several domains, including CartPole, Visual Gridworld environments and Atari games, provide evidence that the tree structure of our learned reward function is useful in determining the extent to which the reward function is aligned with human preferences. We also provide experimental evidence that not only shows that reward DDTs can often achieve competitive RL performance when compared with larger capacity deep neural network reward functions but also demonstrates the diagnostic utility of our framework in checking alignment of learned reward functions. We also observe that the choice between soft and hard (argmax) output of reward DDT reveals a tension between wanting highly shaped rewards to ensure good RL performance, while also wanting simpler, more interpretable rewards. Videos and code, are available at: https://sites.google.com/view/ddt-rlhf
comment: Accepted at RLC 2024
♻ ☆ MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions ICML 2024
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
♻ ☆ Latent Diffusion Model-Enabled Real-Time Semantic Communication Considering Semantic Ambiguities and Channel Noises
Semantic communication (SemCom) has emerged as a new paradigm for 6G communication, with deep learning (DL) models being one of the key drives to shift from the accuracy of bit/symbol to the semantics and pragmatics of data. Nevertheless, DL-based SemCom systems often face performance bottlenecks due to overfitting, poor generalization, and sensitivity to outliers. Furthermore, the varying-fading gains and noises with uncertain signal-to-noise ratios (SNRs) commonly present in wireless channels usually restrict the accuracy of semantic information transmission. Consequently, this paper constructs a latent diffusion model-enabled SemCom system, and proposes three improvements compared to existing works: i) To handle potential outliers in the source data, semantic errors obtained by projected gradient descent based on the vulnerabilities of DL models, are utilized to update the parameters and obtain an outlier-robust encoder. ii) A lightweight single-layer latent space transformation adapter completes one-shot learning at the transmitter and is placed before the decoder at the receiver, enabling adaptation for out-of-distribution data and enhancing human-perceptual quality. iii) An end-to-end consistency distillation (EECD) strategy is used to distill the diffusion models trained in latent space, enabling deterministic single or few-step real-time denoising in various noisy channels while maintaining high semantic quality. Extensive numerical experiments across different datasets demonstrate the superiority of the proposed SemCom system, consistently proving its robustness to outliers, the capability to transmit data with unknown distributions, and the ability to perform real-time channel denoising tasks while preserving high human perceptual quality, outperforming the existing denoising approaches in semantic metrics.
♻ ☆ Laplacian Convolutional Representation for Traffic Time Series Imputation
Spatiotemporal traffic data imputation is of great significance in intelligent transportation systems and data-driven decision-making processes. To perform efficient learning and accurate reconstruction from partially observed traffic data, we assert the importance of characterizing both global and local trends in time series. In the literature, substantial works have demonstrated the effectiveness of utilizing the low-rank property of traffic data by matrix/tensor completion models. In this study, we first introduce a Laplacian kernel to temporal regularization for characterizing local trends in traffic time series, which can be formulated as a circular convolution. Then, we develop a low-rank Laplacian convolutional representation (LCR) model by putting the circulant matrix nuclear norm and the Laplacian kernelized temporal regularization together, which is proved to meet a unified framework that has a fast Fourier transform (FFT) solution in log-linear time complexity. Through extensive experiments on several traffic datasets, we demonstrate the superiority of LCR over several baseline models for imputing traffic time series of various time series behaviors (e.g., data noises and strong/weak periodicity) and reconstructing sparse speed fields of vehicular traffic flow. The proposed LCR model is also an efficient solution to large-scale traffic data imputation over the existing imputation models.
comment: 13 pages, 9 figures
♻ ☆ Supervised learning of spatial features with STDP and homeostasis using Spiking Neural Networks on SpiNNaker
Artificial Neural Networks (ANN) have gained significant popularity thanks to their ability to learn using the well-known backpropagation algorithm. Conversely, Spiking Neural Networks (SNNs), despite having broader capabilities than ANNs, have always posed challenges in the training phase. This paper shows a new method to perform supervised learning on SNNs, using Spike Timing Dependent Plasticity (STDP) and homeostasis, aiming at training the network to identify spatial patterns. Spatial patterns refer to spike patterns without a time component, where all spike events occur simultaneously. The method is tested using the SpiNNaker digital architecture. A SNN is trained to recognise one or multiple patterns and performance metrics are extracted to measure the performance of the network. Some considerations are drawn from the results showing that, in the case of a single trained pattern, the network behaves as the ideal detector, with 100% accuracy in detecting the trained pattern. However, as the number of trained patterns on a single network increases, the accuracy of identification is linked to the similarities between these patterns. This method of training an SNN to detect spatial patterns may be applied to pattern recognition in static images or traffic analysis in computer networks, where each network packet represents a spatial pattern. It will be stipulated that the homeostatic factor may enable the network to detect patterns with some degree of similarity, rather than only perfectly matching patterns.The principles outlined in this article serve as the fundamental building blocks for more complex systems that utilise both spatial and temporal patterns by converting specific features of input signals into spikes.One example of such a system is a computer network packet classifier, tasked with real-time identification of packet streams based on features within the packet content
comment: 14 pages, 6 figures (figure 6 has 9 sub-figures) for a total of 14 images, 10 tables, submitted to the Journal of Neural Networks
♻ ☆ Sparse Expansion and Neuronal Disentanglement
We show how to improve the inference efficiency of an LLM by expanding it into a mixture of sparse experts, where each expert is a copy of the original weights, one-shot pruned for a specific cluster of input values. We call this approach $\textit{Sparse Expansion}$. We show that, for models such as Llama 2 70B, as we increase the number of sparse experts, Sparse Expansion outperforms all other one-shot sparsification approaches for the same inference FLOP budget per token, and that this gap grows as sparsity increases, leading to inference speedups. But why? To answer this, we provide strong evidence that the mixture of sparse experts is effectively $\textit{disentangling}$ the input-output relationship of every individual neuron across clusters of inputs. Specifically, sparse experts approximate the dense neuron output distribution with fewer weights by decomposing the distribution into a collection of simpler ones, each with a separate sparse dot product covering it. Interestingly, we show that the Wasserstein distance between a neuron's output distribution and a Gaussian distribution is an indicator of its entanglement level and contribution to the accuracy of the model. Every layer of an LLM has a fraction of highly entangled Wasserstein neurons, and model performance suffers more when these are sparsified as opposed to others. The code for Sparse Expansion is available at: https://github.com/Shavit-Lab/Sparse-Expansion .
comment: 10 pages, 8 figures
♻ ☆ Hessian Aware Low-Rank Perturbation for Order-Robust Continual Learning
Continual learning aims to learn a series of tasks sequentially without forgetting the knowledge acquired from the previous ones. In this work, we propose the Hessian Aware Low-Rank Perturbation algorithm for continual learning. By modeling the parameter transitions along the sequential tasks with the weight matrix transformation, we propose to apply the low-rank approximation on the task-adaptive parameters in each layer of the neural networks. Specifically, we theoretically demonstrate the quantitative relationship between the Hessian and the proposed low-rank approximation. The approximation ranks are then globally determined according to the marginal increment of the empirical loss estimated by the layer-specific gradient and low-rank approximation error. Furthermore, we control the model capacity by pruning less important parameters to diminish the parameter growth. We conduct extensive experiments on various benchmarks, including a dataset with large-scale tasks, and compare our method against some recent state-of-the-art methods to demonstrate the effectiveness and scalability of our proposed method. Empirical results show that our method performs better on different benchmarks, especially in achieving task order robustness and handling the forgetting issue. The source code is at https://github.com/lijiaqi/HALRP.
comment: Accepted by IEEE Transactions on Knowledge and Data Engineering (TKDE)
♻ ☆ Demystifying the Compression of Mixture-of-Experts Through a Unified Framework
Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B. Code is released at \url{https://github.com/DaizeDong/Unified-MoE-Compression}.
comment: 20 pages, 15 figures, 5 tables
♻ ☆ LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models NAACL 2024
Today's large language models (LLMs) typically train on short text segments (e.g., <4K tokens) due to the quadratic complexity of their Transformer architectures. As a result, their performance suffers drastically on inputs longer than those encountered during training, substantially limiting their applications in real-world tasks involving long contexts such as encoding scientific articles, code repositories, or long dialogues. Through theoretical analysis and empirical investigation, this work identifies three major factors contributing to this length generalization failure. Our theoretical analysis further reveals that commonly used techniques like truncating the attention window or relative positional encodings are inadequate to address them. Answering these challenges, we propose LM-Infinite, a simple and effective method for enhancing LLMs' capabilities of handling long contexts. LM-Infinite is highly flexible and can be used with most modern LLMs off-the-shelf. Without any parameter updates, it allows LLMs pre-trained with 2K or 4K-long segments to generalize to up to 200M length inputs while retaining perplexity. It also improves performance on downstream tasks such as Passkey Retrieval and Qasper in the zero-shot setting. LM-Infinite brings substantial efficiency improvements: it achieves 2.7x decoding speed up and 7.5x memory saving over the original model. Our codes are released at \url{https://github.com/Glaciohound/LM-Infinite}.
comment: NAACL 2024 Outstanding paper, 9 pages, 6 figures
♻ ☆ Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy
Machine learning (ML) plays an important role in quantum chemistry, providing fast-to-evaluate predictive models for various properties of molecules. However, most existing ML models for molecular electronic properties use density functional theory (DFT) databases as ground truth in training, and their prediction accuracy cannot surpass that of DFT. In this work, we developed a unified ML method for electronic structures of organic molecules using the gold-standard CCSD(T) calculations as training data. Tested on hydrocarbon molecules, our model outperforms DFT with the widely-used hybrid and double hybrid functionals in computational costs and prediction accuracy of various quantum chemical properties. As case studies, we apply the model to aromatic compounds and semiconducting polymers on both ground state and excited state properties, demonstrating its accuracy and generalization capability to complex systems that are hard to calculate using CCSD(T)-level methods.
♻ ☆ Sphere Neural-Networks for Rational Reasoning
The success of Large Language Models (LLMs), e.g., ChatGPT, is witnessed by their planetary popularity, their capability of human-like communication, and also by their steadily improved reasoning performance. However, it remains unclear whether LLMs reason. It is an open problem how traditional neural networks can be qualitatively extended to go beyond the statistic paradigm and achieve high-level cognition. Here, we present a novel qualitative extension by generalising computational building blocks from vectors to spheres. We propose Sphere Neural Networks (SphNNs) for human-like reasoning through model construction and inspection, and develop SphNN for syllogistic reasoning, a microcosm of human rationality. SphNN is a hierarchical neuro-symbolic Kolmogorov-Arnold geometric GNN, and uses a neuro-symbolic transition map of neighbourhood spatial relations to transform the current sphere configuration towards the target. SphNN is the first neural model that can determine the validity of long-chained syllogistic reasoning in one epoch without training data, with the worst computational complexity of O(N). SphNN can evolve into various types of reasoning, such as spatio-temporal reasoning, logical reasoning with negation and disjunction, event reasoning, neuro-symbolic unification, and humour understanding (the highest level of cognition). All these suggest a new kind of Herbert A. Simon's scissors with two neural blades. SphNNs will tremendously enhance interdisciplinary collaborations to develop the two neural blades and realise deterministic neural reasoning and human-bounded rationality and elevate LLMs to reliable psychological AI. This work suggests that the non-zero radii of spheres are the missing components that prevent traditional deep-learning systems from reaching the realm of rational reasoning and cause LLMs to be trapped in the swamp of hallucination.
♻ ☆ Enabling Accelerators for Graph Computing
The advent of Graph Neural Networks (GNNs) has revolutionized the field of machine learning, offering a novel paradigm for learning on graph-structured data. Unlike traditional neural networks, GNNs are capable of capturing complex relationships and dependencies inherent in graph data, making them particularly suited for a wide range of applications including social network analysis, molecular chemistry, and network security. GNNs, with their unique structure and operation, present new computational challenges compared to conventional neural networks. This requires comprehensive benchmarking and a thorough characterization of GNNs to obtain insight into their computational requirements and to identify potential performance bottlenecks. In this thesis, we aim to develop a better understanding of how GNNs interact with the underlying hardware and will leverage this knowledge as we design specialized accelerators and develop new optimizations, leading to more efficient and faster GNN computations. A pivotal component within GNNs is the Sparse General Matrix-Matrix Multiplication (SpGEMM) kernel, known for its computational intensity and irregular memory access patterns. In this thesis, we address the challenges posed by SpGEMM by implementing a highly optimized hashing-based SpGEMM kernel tailored for a custom accelerator. Synthesizing these insights and optimizations, we design state-of-the-art hardware accelerators capable of efficiently handling various GNN workloads. Our accelerator architectures are built on our characterization of GNN computational demands, providing clear motivation for our approaches. This exploration into novel models underlines our comprehensive approach, as we strive to enable accelerators that are not just performant, but also versatile, able to adapt to the evolving landscape of graph computing.
comment: Northeastern University Doctoral Dissertation
♻ ☆ Improving the Explain-Any-Concept by Introducing Nonlinearity to the Trainable Surrogate Model
In the evolving field of Explainable AI (XAI), interpreting the decisions of deep neural networks (DNNs) in computer vision tasks is an important process. While pixel-based XAI methods focus on identifying significant pixels, existing concept-based XAI methods use pre-defined or human-annotated concepts. The recently proposed Segment Anything Model (SAM) achieved a significant step forward to prepare automatic concept sets via comprehensive instance segmentation. Building upon this, the Explain Any Concept (EAC) model emerged as a flexible method for explaining DNN decisions. EAC model is based on using a surrogate model which has one trainable linear layer to simulate the target model. In this paper, by introducing an additional nonlinear layer to the original surrogate model, we show that we can improve the performance of the EAC model. We compare our proposed approach to the original EAC model and report improvements obtained on both ImageNet and MS COCO datasets.
comment: This paper is accepted for publication at IEEE SIU conference, 2024
♻ ☆ Aligner: Efficient Alignment by Learning to Correct
With the rapid development of large language models (LLMs) and ever-evolving practical requirements, finding an efficient and effective alignment method has never been more critical. However, the tension between the complexity of current alignment methods and the need for rapid iteration in deployment scenarios necessitates the development of a model-agnostic alignment approach that can operate under these constraints. In this paper, we introduce Aligner, a novel and simple alignment paradigm that learns the correctional residuals between preferred and dispreferred answers using a small model. Designed as a model-agnostic, plug-and-play module, Aligner can be directly applied to various open-source and API-based models with only one-off training, making it suitable for rapid iteration. Notably, Aligner can be applied to any powerful, large-scale upstream models. Moreover, it can even iteratively bootstrap the upstream models using corrected responses as synthetic human preference data, breaking through the model's performance ceiling. Our experiments demonstrate performance improvements by deploying the same Aligner model across 11 different LLMs, evaluated on the 3H dimensions (helpfulness, harmlessness, and honesty). Specifically, Aligner-7B has achieved an average improvement of 68.9% in helpfulness and 23.8% in harmlessness across the tested LLMs while also effectively reducing hallucination. In the Alpaca-Eval leaderboard, stacking Aligner-2B on GPT-4 Turbo improved its LC Win Rate from 55.0% to 58.3%, surpassing GPT-4 Omni's 57.5% Win Rate (community report).
♻ ☆ Perceiver-based CDF Modeling for Time Series Forecasting
Transformers have demonstrated remarkable efficacy in forecasting time series data. However, their extensive dependence on self-attention mechanisms demands significant computational resources, thereby limiting their practical applicability across diverse tasks, especially in multimodal problems. In this work, we propose a new architecture, called perceiver-CDF, for modeling cumulative distribution functions (CDF) of time series data. Our approach combines the perceiver architecture with a copula-based attention mechanism tailored for multimodal time series prediction. By leveraging the perceiver, our model efficiently transforms high-dimensional and multimodal data into a compact latent space, thereby significantly reducing computational demands. Subsequently, we implement a copula-based attention mechanism to construct the joint distribution of missing data for prediction. Further, we propose an output variance testing mechanism to effectively mitigate error propagation during prediction. To enhance efficiency and reduce complexity, we introduce midpoint inference for the local attention mechanism. This enables the model to efficiently capture dependencies within nearby imputed samples without considering all previous samples. The experiments on the unimodal and multimodal benchmarks consistently demonstrate a 20% improvement over state-of-the-art methods while utilizing less than half of the computational resources.
comment: Accepted in Winter Simulation Conference 2024
♻ ☆ Safe Task Planning for Language-Instructed Multi-Robot Systems using Conformal Prediction
This paper addresses task planning problems for language-instructed robot teams. Tasks are expressed in natural language (NL), requiring the robots to apply their capabilities at various locations and semantic objects. Several recent works have addressed similar planning problems by leveraging pre-trained Large Language Models (LLMs) to design effective multi-robot plans. However, these approaches lack mission completion guarantees. To address this challenge, we introduce a new decentralized LLM-based planner, called S-ATLAS for Safe plAnning for Teams of Language-instructed AgentS, that is capable of achieving user-defined mission success rates. This is accomplished by leveraging conformal prediction (CP), a distribution-free uncertainty quantification tool in black-box models. CP allows the proposed multi-robot planner to reason about its inherent uncertainty in a decentralized fashion, enabling robots to make individual decisions when they are sufficiently certain and seek help otherwise. We show, both theoretically and empirically, that the proposed planner can achieve user-specified task success rates while minimizing the overall number of help requests. We provide comparative experiments against related works showing that our method is significantly more computational efficient and achieves lower help rates. The advantage of our algorithm over baselines becomes more pronounced with increasing robot team size.
Information Retrieval 21
☆ Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track
Did you try out the new Bing Search? Or maybe you fiddled around with Google AI~Overviews? These might sound familiar because the modern-day search stack has recently evolved to include retrieval-augmented generation (RAG) systems. They allow searching and incorporating real-time data into large language models (LLMs) to provide a well-informed, attributed, concise summary in contrast to the traditional search paradigm that relies on displaying a ranked list of documents. Therefore, given these recent advancements, it is crucial to have an arena to build, test, visualize, and systematically evaluate RAG-based search systems. With this in mind, we propose the TREC 2024 RAG Track to foster innovation in evaluating RAG systems. In our work, we lay out the steps we've made towards making this track a reality -- we describe the details of our reusable framework, Ragnar\"ok, explain the curation of the new MS MARCO V2.1 collection choice, release the development topics for the track, and standardize the I/O definitions which assist the end user. Next, using Ragnar\"ok, we identify and provide key industrial baselines such as OpenAI's GPT-4o or Cohere's Command R+. Further, we introduce a web-based user interface for an interactive arena allowing benchmarking pairwise RAG systems by crowdsourcing. We open-source our Ragnar\"ok framework and baselines to achieve a unified standard for future RAG systems.
☆ Meta-experiments: Improving experimentation through experimentation
A/B testing is widexly used in the industry to optimize customer facing websites. Many companies employ experimentation specialists to facilitate and improve the process of A/B testing. Here, we present the application of A/B testing to this improvement effort itself, by running experiments on the experimentation process, which we call 'meta-experiments'. We discuss the challenges of this approach using the example of one of our meta-experiments, which helped experimenters to run more sufficiently powered A/B tests. We also point out the benefits of 'dog fooding' for the experimentation specialists when running their own experiments.
comment: 6 pages, 2 figures, 1 table
☆ Star+: A New Multi-Domain Model for CTR Prediction
In this paper, we introduce Star+, a novel multi-domain model for click-through rate (CTR) prediction inspired by the Star model. Traditional single-domain approaches and existing multi-task learning techniques face challenges in multi-domain environments due to their inability to capture domain-specific data distributions and complex inter-domain relationships. Star+ addresses these limitations by enhancing the interaction between shared and domain-specific information through various fusion strategies, such as add, adaptive add, concatenation, and gating fusions, to find the optimal balance between domain-specific and shared information. We also investigate the impact of different normalization techniques, including layer normalization, batch normalization, and partition normalization, on the performance of our model. Our extensive experiments on both industrial and public datasets demonstrate that Star+ significantly improves prediction accuracy and efficiency. This work contributes to the advancement of recommendation systems by providing a robust, scalable, and adaptive solution for multi-domain environments.
☆ Cross-domain Transfer of Valence Preferences via a Meta-optimization Approach
Cross-domain recommendation offers a potential avenue for alleviating data sparsity and cold-start problems. Embedding and mapping, as a classic cross-domain research genre, aims to identify a common mapping function to perform representation transformation between two domains. Nevertheless, previous coarse-grained preference representations, non-personalized mapping functions, and excessive reliance on overlapping users limit their performance, especially in scenarios where overlapping users are sparse. To address aforementioned challenges, we propose a novel cross-domain approach, namely CVPM. CVPM formalizes cross-domain interest transfer as a hybrid architecture of parametric meta-learning and self-supervised learning, which not only transfers user preferences at a finer level, but also enables signal enhancement with the knowledge of non-overlapping users. Specifically, with deep insights into user preferences and valence preference theory, we believe that there exists significant difference between users' positive preferences and negative behaviors, and thus employ differentiated encoders to learn their distributions. In particular, we further utilize the pre-trained model and item popularity to sample pseudo-interaction items to ensure the integrity of both distributions. To guarantee the personalization of preference transfer, we treat each user's mapping as two parts, the common transformation and the personalized bias, where the network used to generate the personalized bias is output by a meta-learner. Furthermore, in addition to the supervised loss for overlapping users, we design contrastive tasks for non-overlapping users from both group and individual-levels to avoid model skew and enhance the semantics of representations. Exhaustive data analysis and extensive experimental results demonstrate the effectiveness and advancement of our proposed framework.
☆ Context-augmented Retrieval: A Novel Framework for Fast Information Retrieval based Response Generation using Large Language Model
Generating high-quality answers consistently by providing contextual information embedded in the prompt passed to the Large Language Model (LLM) is dependent on the quality of information retrieval. As the corpus of contextual information grows, the answer/inference quality of Retrieval Augmented Generation (RAG) based Question Answering (QA) systems declines. This work solves this problem by combining classical text classification with the Large Language Model (LLM) to enable quick information retrieval from the vector store and ensure the relevancy of retrieved information. For the same, this work proposes a new approach Context Augmented retrieval (CAR), where partitioning of vector database by real-time classification of information flowing into the corpus is done. CAR demonstrates good quality answer generation along with significant reduction in information retrieval and answer generation time.
☆ On the Role of Long-tail Knowledge in Retrieval Augmented Large Language Models
Retrieval augmented generation (RAG) exhibits outstanding performance in promoting the knowledge capabilities of large language models (LLMs) with retrieved documents related to user queries. However, RAG only focuses on improving the response quality of LLMs via enhancing queries indiscriminately with retrieved information, paying little attention to what type of knowledge LLMs really need to answer original queries more accurately. In this paper, we suggest that long-tail knowledge is crucial for RAG as LLMs have already remembered common world knowledge during large-scale pre-training. Based on our observation, we propose a simple but effective long-tail knowledge detection method for LLMs. Specifically, the novel Generative Expected Calibration Error (GECE) metric is derived to measure the ``long-tailness'' of knowledge based on both statistics and semantics. Hence, we retrieve relevant documents and infuse them into the model for patching knowledge loopholes only when the input query relates to long-tail knowledge. Experiments show that, compared to existing RAG pipelines, our method achieves over 4x speedup in average inference time and consistent performance improvement in downstream tasks.
☆ A Survey on Intent-aware Recommender Systems
Many modern online services feature personalized recommendations. A central challenge when providing such recommendations is that the reason why an individual user accesses the service may change from visit to visit or even during an ongoing usage session. To be effective, a recommender system should therefore aim to take the users' probable intent of using the service at a certain point in time into account. In recent years, researchers have thus started to address this challenge by incorporating intent-awareness into recommender systems. Correspondingly, a number of technical approaches were put forward, including diversification techniques, intent prediction models or latent intent modeling approaches. In this paper, we survey and categorize existing approaches to building the next generation of Intent-Aware Recommender Systems (IARS). Based on an analysis of current evaluation practices, we outline open gaps and possible future directions in this area, which in particular include the consideration of additional interaction signals and contextual information to further improve the effectiveness of such systems.
☆ DemoRank: Selecting Effective Demonstrations for Large Language Models in Ranking Task
Recently, there has been increasing interest in applying large language models (LLMs) as zero-shot passage rankers. However, few studies have explored how to select appropriate in-context demonstrations for the passage ranking task, which is the focus of this paper. Previous studies mainly apply a demonstration retriever to retrieve demonstrations and use top-$k$ demonstrations for in-context learning (ICL). Although effective, this approach overlooks the dependencies between demonstrations, leading to inferior performance of few-shot ICL in the passage ranking task. In this paper, we formulate the demonstration selection as a \textit{retrieve-then-rerank} process and introduce the DemoRank framework. In this framework, we first use LLM feedback to train a demonstration retriever and construct a novel dependency-aware training samples to train a demonstration reranker to improve few-shot ICL. The construction of such training samples not only considers demonstration dependencies but also performs in an efficient way. Extensive experiments demonstrate DemoRank's effectiveness in in-domain scenarios and strong generalization to out-of-domain scenarios. Our codes are available at~\url{https://github.com/8421BCD/DemoRank}.
☆ Debiased Recommendation with Noisy Feedback KDD 24
Ratings of a user to most items in recommender systems are usually missing not at random (MNAR), largely because users are free to choose which items to rate. To achieve unbiased learning of the prediction model under MNAR data, three typical solutions have been proposed, including error-imputation-based (EIB), inverse-propensity-scoring (IPS), and doubly robust (DR) methods. However, these methods ignore an alternative form of bias caused by the inconsistency between the observed ratings and the users' true preferences, also known as noisy feedback or outcome measurement errors (OME), e.g., due to public opinion or low-quality data collection process. In this work, we study intersectional threats to the unbiased learning of the prediction model from data MNAR and OME in the collected data. First, we design OME-EIB, OME-IPS, and OME-DR estimators, which largely extend the existing estimators to combat OME in real-world recommendation scenarios. Next, we theoretically prove the unbiasedness and generalization bound of the proposed estimators. We further propose an alternate denoising training approach to achieve unbiased learning of the prediction model under MNAR data with OME. Extensive experiments are conducted on three real-world datasets and one semi-synthetic dataset to show the effectiveness of our proposed approaches. The code is available at https://github.com/haoxuanli-pku/KDD24-OME-DR.
comment: KDD 24 Research Track Paper
☆ DEXTER: A Benchmark for open-domain Complex Question Answering using LLMs
Open-domain complex Question Answering (QA) is a difficult task with challenges in evidence retrieval and reasoning. The complexity of such questions could stem from questions being compositional, hybrid evidence, or ambiguity in questions. While retrieval performance for classical QA tasks is well explored, their capabilities for heterogeneous complex retrieval tasks, especially in an open-domain setting, and the impact on downstream QA performance, are relatively unexplored. To address this, in this work, we propose a benchmark composing diverse complex QA tasks and provide a toolkit to evaluate state-of-the-art pre-trained dense and sparse retrieval models in an open-domain setting. We observe that late interaction models and surprisingly lexical models like BM25 perform well compared to other pre-trained dense retrieval models. In addition, since context-based reasoning is critical for solving complex QA tasks, we also evaluate the reasoning capabilities of LLMs and the impact of retrieval performance on their reasoning capabilities. Through experiments, we observe that much progress is to be made in retrieval for complex QA to improve downstream QA performance. Our software and related data can be accessed at https://github.com/VenkteshV/DEXTER
comment: under submission, 22 pages
♻ ☆ Make Large Language Model a Better Ranker
Large Language Models (LLMs) demonstrate robust capabilities across various fields, leading to a paradigm shift in LLM-enhanced Recommender System (RS). Research to date focuses on point-wise and pair-wise recommendation paradigms, which are inefficient for LLM-based recommenders due to high computational costs. However, existing list-wise approaches also fall short in ranking tasks due to misalignment between ranking objectives and next-token prediction. Moreover, these LLM-based methods struggle to effectively address the order relation among candidates, particularly given the scale of ratings. To address these challenges, this paper introduces the large language model framework with Aligned Listwise Ranking Objectives (ALRO). ALRO is designed to bridge the gap between the capabilities of LLMs and the nuanced requirements of ranking tasks. Specifically, ALRO employs explicit feedback in a listwise manner by introducing soft lambda loss, a customized adaptation of lambda loss designed for optimizing order relations. This mechanism provides more accurate optimization goals, enhancing the ranking process. Additionally, ALRO incorporates a permutation-sensitive learning mechanism that addresses position bias, a prevalent issue in generative models, without imposing additional computational burdens during inference. Our evaluative studies reveal that ALRO outperforms both existing embedding-based recommendation methods and LLM-based recommendation baselines.
comment: 12 pages, 5 figures
♻ ☆ A Survey on Neural Topic Models: Methods, Applications, and Challenges
Topic models have been prevalent for decades to discover latent topics and infer topic proportions of documents in an unsupervised fashion. They have been widely used in various applications like text analysis and context recommendation. Recently, the rise of neural networks has facilitated the emergence of a new research field -- Neural Topic Models (NTMs). Different from conventional topic models, NTMs directly optimize parameters without requiring model-specific derivations. This endows NTMs with better scalability and flexibility, resulting in significant research attention and plentiful new methods and applications. In this paper, we present a comprehensive survey on neural topic models concerning methods, applications, and challenges. Specifically, we systematically organize current NTM methods according to their network structures and introduce the NTMs for various scenarios like short texts and bilingual documents. We also discuss a wide range of popular applications built on NTMs. Finally, we highlight the challenges confronted by NTMs to inspire future research. We accompany this survey with a repository for easier access to the mentioned paper resources: https://github.com/bobxwu/Paper-Neural-Topic-Models.
comment: Accepted to Artificial Intelligence Review. See https://doi.org/10.1007/s10462-023-10661-7 and a paper list at https://github.com/BobXWu/Paper-Neural-Topic-Models
♻ ☆ Generation of Asset Administration Shell with Large Language Model Agents: Toward Semantic Interoperability in Digital Twins in the Context of Industry 4.0
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: https://github.com/YuchenXia/AASbyLLM.
comment: Published in IEEE Access
♻ ☆ ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction WWW 2024
Click-through rate (CTR) prediction has become increasingly indispensable for various Internet applications. Traditional CTR models convert the multi-field categorical data into ID features via one-hot encoding, and extract the collaborative signals among features. Such a paradigm suffers from the problem of semantic information loss. Another line of research explores the potential of pretrained language models (PLMs) for CTR prediction by converting input data into textual sentences through hard prompt templates. Although semantic signals are preserved, they generally fail to capture the collaborative information (e.g., feature interactions, pure ID features), not to mention the unacceptable inference overhead brought by the huge model size. In this paper, we aim to model both the semantic knowledge and collaborative knowledge for accurate CTR estimation, and meanwhile address the inference inefficiency issue. To benefit from both worlds and close their gaps, we propose a novel model-agnostic framework (i.e., ClickPrompt), where we incorporate CTR models to generate interaction-aware soft prompts for PLMs. We design a prompt-augmented masked language modeling (PA-MLM) pretraining task, where PLM has to recover the masked tokens based on the language context, as well as the soft prompts generated by CTR model. The collaborative and semantic knowledge from ID and textual features would be explicitly aligned and interacted via the prompt interface. Then, we can either tune the CTR model with PLM for superior performance, or solely tune the CTR model without PLM for inference efficiency. Experiments on four real-world datasets validate the effectiveness of ClickPrompt compared with existing baselines.
comment: Accepted by WWW 2024
♻ ☆ Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions ACL 2024
This paper introduces Fundus, a user-friendly news scraper that enables users to obtain millions of high-quality news articles with just a few lines of code. Unlike existing news scrapers, we use manually crafted, bespoke content extractors that are specifically tailored to the formatting guidelines of each supported online newspaper. This allows us to optimize our scraping for quality such that retrieved news articles are textually complete and without HTML artifacts. Further, our framework combines both crawling (retrieving HTML from the web or large web archives) and content extraction into a single pipeline. By providing a unified interface for a predefined collection of newspapers, we aim to make Fundus broadly usable even for non-technical users. This paper gives an overview of the framework, discusses our design choices, and presents a comparative evaluation against other popular news scrapers. Our evaluation shows that Fundus yields significantly higher quality extractions (complete and artifact-free news articles) than prior work. The framework is available on GitHub under https://github.com/flairNLP/fundus and can be simply installed using pip.
comment: 10 pages, 4 figures, ACL 2024, for a screencast see https://www.youtube.com/watch?v=9GJExMelhdI
♻ ☆ Continuous-time Autoencoders for Regular and Irregular Time Series Imputation WSDM'24
Time series imputation is one of the most fundamental tasks for time series. Real-world time series datasets are frequently incomplete (or irregular with missing observations), in which case imputation is strongly required. Many different time series imputation methods have been proposed. Recent self-attention-based methods show the state-of-the-art imputation performance. However, it has been overlooked for a long time to design an imputation method based on continuous-time recurrent neural networks (RNNs), i.e., neural controlled differential equations (NCDEs). To this end, we redesign time series (variational) autoencoders based on NCDEs. Our method, called continuous-time autoencoder (CTA), encodes an input time series sample into a continuous hidden path (rather than a hidden vector) and decodes it to reconstruct and impute the input. In our experiments with 4 datasets and 19 baselines, our method shows the best imputation performance in almost all cases.
comment: Published as a WSDM'24 full paper (oral presentation)
♻ ☆ ReLLa: Retrieval-enhanced Large Language Models for Lifelong Sequential Behavior Comprehension in Recommendation WWW 2024
With large language models (LLMs) achieving remarkable breakthroughs in natural language processing (NLP) domains, LLM-enhanced recommender systems have received much attention and have been actively explored currently. In this paper, we focus on adapting and empowering a pure large language model for zero-shot and few-shot recommendation tasks. First and foremost, we identify and formulate the lifelong sequential behavior incomprehension problem for LLMs in recommendation domains, i.e., LLMs fail to extract useful information from a textual context of long user behavior sequence, even if the length of context is far from reaching the context limitation of LLMs. To address such an issue and improve the recommendation performance of LLMs, we propose a novel framework, namely Retrieval-enhanced Large Language models (ReLLa) for recommendation tasks in both zero-shot and few-shot settings. For zero-shot recommendation, we perform semantic user behavior retrieval (SUBR) to improve the data quality of testing samples, which greatly reduces the difficulty for LLMs to extract the essential knowledge from user behavior sequences. As for few-shot recommendation, we further design retrieval-enhanced instruction tuning (ReiT) by adopting SUBR as a data augmentation technique for training samples. Specifically, we develop a mixed training dataset consisting of both the original data samples and their retrieval-enhanced counterparts. We conduct extensive experiments on three real-world public datasets to demonstrate the superiority of ReLLa compared with existing baseline models, as well as its capability for lifelong sequential behavior comprehension. To be highlighted, with only less than 10% training samples, few-shot ReLLa can outperform traditional CTR models that are trained on the entire training set (e.g., DCNv2, DIN, SIM). The code is available \url{https://github.com/LaVieEnRose365/ReLLa}.
comment: Accepted by WWW 2024. Full and More Readable Version
♻ ☆ EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models ACL 2024
Large Language Models (LLMs) usually suffer from knowledge cutoff or fallacy issues, which means they are unaware of unseen events or generate text with incorrect facts owing to outdated/noisy data. To this end, many knowledge editing approaches for LLMs have emerged -- aiming to subtly inject/edit updated knowledge or adjust undesired behavior while minimizing the impact on unrelated inputs. Nevertheless, due to significant differences among various knowledge editing methods and the variations in task setups, there is no standard implementation framework available for the community, which hinders practitioners from applying knowledge editing to applications. To address these issues, we propose EasyEdit, an easy-to-use knowledge editing framework for LLMs. It supports various cutting-edge knowledge editing approaches and can be readily applied to many well-known LLMs such as T5, GPT-J, LlaMA, etc. Empirically, we report the knowledge editing results on LlaMA-2 with EasyEdit, demonstrating that knowledge editing surpasses traditional fine-tuning in terms of reliability and generalization. We have released the source code on GitHub, along with Google Colab tutorials and comprehensive documentation for beginners to get started. Besides, we present an online system for real-time knowledge editing, and a demo video.
comment: ACL 2024 System Demonstrations; Code: https://github.com/zjunlp/EasyEdit HF Demo: https://huggingface.co/spaces/zjunlp/EasyEdit Video: https://youtu.be/Gm6T0QaaskU Docs: https://zjunlp.gitbook.io/easyedit
♻ ☆ EasyInstruct: An Easy-to-use Instruction Processing Framework for Large Language Models ACL 2024
In recent years, instruction tuning has gained increasing attention and emerged as a crucial technique to enhance the capabilities of Large Language Models (LLMs). To construct high-quality instruction datasets, many instruction processing approaches have been proposed, aiming to achieve a delicate balance between data quantity and data quality. Nevertheless, due to inconsistencies that persist among various instruction processing methods, there is no standard open-source instruction processing implementation framework available for the community, which hinders practitioners from further developing and advancing. To facilitate instruction processing research and development, we present EasyInstruct, an easy-to-use instruction processing framework for LLMs, which modularizes instruction generation, selection, and prompting, while also considering their combination and interaction. EasyInstruct is publicly released and actively maintained at https://github.com/zjunlp/EasyInstruct, along with an online demo app and a demo video for quick-start, calling for broader research centered on instruction data and synthetic data.
comment: ACL 2024 System Demonstrations; Project website: https://zjunlp.github.io/project/EasyInstruct Code: https://github.com/zjunlp/EasyInstruct Video: https://youtu.be/rfQOWYfziFo Demo: https://huggingface.co/spaces/zjunlp/EasyInstruct
♻ ☆ VeraCT Scan: Retrieval-Augmented Fake News Detection with Justifiable Reasoning
The proliferation of fake news poses a significant threat not only by disseminating misleading information but also by undermining the very foundations of democracy. The recent advance of generative artificial intelligence has further exacerbated the challenge of distinguishing genuine news from fabricated stories. In response to this challenge, we introduce VeraCT Scan, a novel retrieval-augmented system for fake news detection. This system operates by extracting the core facts from a given piece of news and subsequently conducting an internet-wide search to identify corroborating or conflicting reports. Then sources' credibility is leveraged for information verification. Besides determining the veracity of news, we also provide transparent evidence and reasoning to support its conclusions, resulting in the interpretability and trust in the results. In addition to GPT-4 Turbo, Llama-2 13B is also fine-tuned for news content understanding, information verification, and reasoning. Both implementations have demonstrated state-of-the-art accuracy in the realm of fake news detection.
♻ ☆ MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions ICML 2024
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
Multimedia 4
☆ UBiSS: A Unified Framework for Bimodal Semantic Summarization of Videos ICMR'24
With the surge in the amount of video data, video summarization techniques, including visual-modal(VM) and textual-modal(TM) summarization, are attracting more and more attention. However, unimodal summarization inevitably loses the rich semantics of the video. In this paper, we focus on a more comprehensive video summarization task named Bimodal Semantic Summarization of Videos (BiSSV). Specifically, we first construct a large-scale dataset, BIDS, in (video, VM-Summary, TM-Summary) triplet format. Unlike traditional processing methods, our construction procedure contains a VM-Summary extraction algorithm aiming to preserve the most salient content within long videos. Based on BIDS, we propose a Unified framework UBiSS for the BiSSV task, which models the saliency information in the video and generates a TM-summary and VM-summary simultaneously. We further optimize our model with a list-wise ranking-based objective to improve its capacity to capture highlights. Lastly, we propose a metric, $NDCG_{MS}$, to provide a joint evaluation of the bimodal summary. Experiments show that our unified framework achieves better performance than multi-stage summarization pipelines. Code and data are available at https://github.com/MeiYutingg/UBiSS.
comment: Accepted by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ Exploring compressibility of transformer based text-to-music (TTM) models INTERSPEECH 2024
State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.
comment: Proceedings of INTERSPEECH 2024
♻ ☆ ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query, thereby accelerating TTA by hundreds of times. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space and incorporates classifier-free guidance (CFG) into model training. Moreover, unlike diffusion models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware metrics, such as CLAP score, to further enhance the generations. Our objective and subjective evaluation on the AudioCaps dataset shows that compared to diffusion-based counterparts, ConsistencyTTA reduces inference computation by 400x while retaining generation quality and diversity.
♻ ☆ MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions ICML 2024
Image retrieval, i.e., finding desired images given a reference image, inherently encompasses rich, multi-faceted search intents that are difficult to capture solely using image-based measures. Recent works leverage text instructions to allow users to more freely express their search intents. However, they primarily focus on image pairs that are visually similar and/or can be characterized by a small set of pre-defined relations. The core thesis of this paper is that text instructions can enable retrieving images with richer relations beyond visual similarity. To show this, we introduce MagicLens, a series of self-supervised image retrieval models that support open-ended instructions. MagicLens is built on a key novel insight: image pairs that naturally occur on the same web pages contain a wide range of implicit relations (e.g., inside view of), and we can bring those implicit relations explicit by synthesizing instructions via foundation models. Trained on 36.7M (query image, instruction, target image) triplets with rich semantic relations mined from the web, MagicLens achieves results comparable with or better than prior best on eight benchmarks of various image retrieval tasks, while maintaining high parameter efficiency with a significantly smaller model size. Additional human analyses on a 1.4M-image unseen corpus further demonstrate the diversity of search intents supported by MagicLens. Code and models are publicly available at https://open-vision-language.github.io/MagicLens/.
comment: ICML 2024 (Oral); Project Website: https://open-vision-language.github.io/MagicLens/
Computer Vision and Pattern Recognition 31
☆ Gradual Divergence for Seamless Adaptation: A Novel Domain Incremental Learning Method ICML 2024
Domain incremental learning (DIL) poses a significant challenge in real-world scenarios, as models need to be sequentially trained on diverse domains over time, all the while avoiding catastrophic forgetting. Mitigating representation drift, which refers to the phenomenon of learned representations undergoing changes as the model adapts to new tasks, can help alleviate catastrophic forgetting. In this study, we propose a novel DIL method named DARE, featuring a three-stage training process: Divergence, Adaptation, and REfinement. This process gradually adapts the representations associated with new tasks into the feature space spanned by samples from previous tasks, simultaneously integrating task-specific decision boundaries. Additionally, we introduce a novel strategy for buffer sampling and demonstrate the effectiveness of our proposed method, combined with this sampling strategy, in reducing representation drift within the feature encoder. This contribution effectively alleviates catastrophic forgetting across multiple DIL benchmarks. Furthermore, our approach prevents sudden representation drift at task boundaries, resulting in a well-calibrated DIL model that maintains the performance on previous tasks.
comment: Accepted at 41st International Conference on Machine Learning (ICML 2024)
☆ Learning Run-time Safety Monitors for Machine Learning Components
For machine learning components used as part of autonomous systems (AS) in carrying out critical tasks it is crucial that assurance of the models can be maintained in the face of post-deployment changes (such as changes in the operating environment of the system). A critical part of this is to be able to monitor when the performance of the model at runtime (as a result of changes) poses a safety risk to the system. This is a particularly difficult challenge when ground truth is unavailable at runtime. In this paper we introduce a process for creating safety monitors for ML components through the use of degraded datasets and machine learning. The safety monitor that is created is deployed to the AS in parallel to the ML component to provide a prediction of the safety risk associated with the model output. We demonstrate the viability of our approach through some initial experiments using publicly available speed sign datasets.
☆ Breaking the Frame: Image Retrieval by Visual Overlap Prediction
We propose a novel visual place recognition approach, VOP, that efficiently addresses occlusions and complex scenes by shifting from traditional reliance on global image similarities and local features to image overlap prediction. The proposed method enables the identification of visible image sections without requiring expensive feature detection and matching. By focusing on obtaining patch-level embeddings by a Vision Transformer backbone and establishing patch-to-patch correspondences, our approach uses a voting mechanism to assess overlap scores for potential database images, thereby providing a nuanced image retrieval metric in challenging scenarios. VOP leads to more accurate relative pose estimation and localization results on the retrieved image pairs than state-of-the-art baselines on a number of large-scale, real-world datasets. The code is available at https://github.com/weitong8591/vop.
☆ HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis
Spatial transcriptomics (ST) enables interrogating the molecular composition of tissue with ever-increasing resolution, depth, and sensitivity. However, costs, rapidly evolving technology, and lack of standards have constrained computational methods in ST to narrow tasks and small cohorts. In addition, the underlying tissue morphology as reflected by H&E-stained whole slide images (WSIs) encodes rich information often overlooked in ST studies. Here, we introduce HEST-1k, a collection of 1,108 spatial transcriptomic profiles, each linked to a WSI and metadata. HEST-1k was assembled using HEST-Library from 131 public and internal cohorts encompassing 25 organs, two species (Homo Sapiens and Mus Musculus), and 320 cancer samples from 25 cancer types. HEST-1k processing enabled the identification of 1.5 million expression--morphology pairs and 60 million nuclei. HEST-1k is tested on three use cases: (1) benchmarking foundation models for histopathology (HEST-Benchmark), (2) biomarker identification, and (3) multimodal representation learning. HEST-1k, HEST-Library, and HEST-Benchmark can be freely accessed via https://github.com/mahmoodlab/hest.
comment: Under review
☆ Fuzzy Attention-based Border Rendering Network for Lung Organ Segmentation MICCAI 2024
Automatic lung organ segmentation on CT images is crucial for lung disease diagnosis. However, the unlimited voxel values and class imbalance of lung organs can lead to false-negative/positive and leakage issues in advanced methods. Additionally, some slender lung organs are easily lost during the recycled down/up-sample procedure, e.g., bronchioles & arterioles, causing severe discontinuity issue. Inspired by these, this paper introduces an effective lung organ segmentation method called Fuzzy Attention-based Border Rendering (FABR) network. Since fuzzy logic can handle the uncertainty in feature extraction, hence the fusion of deep networks and fuzzy sets should be a viable solution for better performance. Meanwhile, unlike prior top-tier methods that operate on all regular dense points, our FABR depicts lung organ regions as cube-trees, focusing only on recycle-sampled border vulnerable points, rendering the severely discontinuous, false-negative/positive organ regions with a novel Global-Local Cube-tree Fusion (GLCF) module. All experimental results, on four challenging datasets of airway & artery, demonstrate that our method can achieve the favorable performance significantly.
comment: MICCAI 2024
☆ Evaluation and Comparison of Emotionally Evocative Image Augmentation Methods
Experiments in affective computing are based on stimulus datasets that, in the process of standardization, receive metadata describing which emotions each stimulus evokes. In this paper, we explore an approach to creating stimulus datasets for affective computing using generative adversarial networks (GANs). Traditional dataset preparation methods are costly and time consuming, prompting our investigation of alternatives. We conducted experiments with various GAN architectures, including Deep Convolutional GAN, Conditional GAN, Auxiliary Classifier GAN, Progressive Augmentation GAN, and Wasserstein GAN, alongside data augmentation and transfer learning techniques. Our findings highlight promising advances in the generation of emotionally evocative synthetic images, suggesting significant potential for future research and improvements in this domain.
☆ Intensity Confusion Matters: An Intensity-Distance Guided Loss for Bronchus Segmentation ICME
Automatic segmentation of the bronchial tree from CT imaging is important, as it provides structural information for disease diagnosis. Despite the merits of previous automatic bronchus segmentation methods, they have paied less attention to the issue we term as \textit{Intensity Confusion}, wherein the intensity values of certain background voxels approach those of the foreground voxels within bronchi. Conversely, the intensity values of some foreground voxels are nearly identical to those of background voxels. This proximity in intensity values introduces significant challenges to neural network methodologies. To address the issue, we introduce a novel Intensity-Distance Guided loss function, which assigns adaptive weights to different image voxels for mining hard samples that cause the intensity confusion. The proposed loss estimates the voxel-level hardness of samples, on the basis of the following intensity and distance priors. We regard a voxel as a hard sample if it is in: (1) the background and has an intensity value close to the bronchus region; (2) the bronchus region and is of higher intensity than most voxels inside the bronchus; (3) the background region and at a short distance from the bronchus. Extensive experiments not only show the superiority of our method compared with the state-of-the-art methods, but also verify that tackling the intensity confusion issue helps to significantly improve bronchus segmentation. Project page: https://github.com/lhaof/ICM.
comment: IEEE International Conference on Multimedia & Expo (ICME) 2024
Review of Zero-Shot and Few-Shot AI Algorithms in The Medical Domain
In this paper, different techniques of few-shot, zero-shot, and regular object detection have been investigated. The need for few-shot learning and zero-shot learning techniques is crucial and arises from the limitations and challenges in traditional machine learning, deep learning, and computer vision methods where they require large amounts of data, plus the poor generalization of those traditional methods. Those techniques can give us prominent results by using only a few training sets reducing the required amounts of data and improving the generalization. This survey will highlight the recent papers of the last three years that introduce the usage of few-shot learning and zero-shot learning techniques in addressing the challenges mentioned earlier. In this paper we reviewed the Zero-shot, few-shot and regular object detection methods and categorized them in an understandable manner. Based on the comparison made within each category. It been found that the approaches are quite impressive. This integrated review of diverse papers on few-shot, zero-shot, and regular object detection reveals a shared focus on advancing the field through novel frameworks and techniques. A noteworthy observation is the scarcity of detailed discussions regarding the difficulties encountered during the development phase. Contributions include the introduction of innovative models, such as ZSD-YOLO and GTNet, often showcasing improvements with various metrics such as mean average precision (mAP),Recall@100 (RE@100), the area under the receiver operating characteristic curve (AUROC) and precision. These findings underscore a collective move towards leveraging vision-language models for versatile applications, with potential areas for future research including a more thorough exploration of limitations and domain-specific adaptations.
☆ Multimodal Multilabel Classification by CLIP
Multimodal multilabel classification (MMC) is a challenging task that aims to design a learning algorithm to handle two data sources, the image and text, and learn a comprehensive semantic feature presentation across the modalities. In this task, we review the extensive number of state-of-the-art approaches in MMC and leverage a novel technique that utilises the Contrastive Language-Image Pre-training (CLIP) as the feature extractor and fine-tune the model by exploring different classification heads, fusion methods and loss functions. Finally, our best result achieved more than 90% F_1 score in the public Kaggle competition leaderboard. This paper provides detailed descriptions of novel training methods and quantitative analysis through the experimental results.
☆ MLPHand: Real Time Multi-View 3D Hand Mesh Reconstruction via MLP Modeling
Multi-view hand mesh reconstruction is a critical task for applications in virtual reality and human-computer interaction, but it remains a formidable challenge. Although existing multi-view hand reconstruction methods achieve remarkable accuracy, they typically come with an intensive computational burden that hinders real-time inference. To this end, we propose MLPHand, a novel method designed for real-time multi-view single hand reconstruction. MLP Hand consists of two primary modules: (1) a lightweight MLP-based Skeleton2Mesh model that efficiently recovers hand meshes from hand skeletons, and (2) a multi-view geometry feature fusion prediction module that enhances the Skeleton2Mesh model with detailed geometric information from multiple views. Experiments on three widely used datasets demonstrate that MLPHand can reduce computational complexity by 90% while achieving comparable reconstruction accuracy to existing state-of-the-art baselines.
☆ UDHF2-Net: An Uncertainty-diffusion-model-based High-Frequency TransFormer Network for High-accuracy Interpretation of Remotely Sensed Imagery
Remotely sensed image high-accuracy interpretation (RSIHI), including tasks such as semantic segmentation and change detection, faces the three major problems: (1) complementarity problem of spatially stationary-and-non-stationary frequency; (2) edge uncertainty problem caused by down-sampling in the encoder step and intrinsic edge noises; and (3) false detection problem caused by imagery registration error in change detection. To solve the aforementioned problems, an uncertainty-diffusion-model-based high-Frequency TransFormer network (UDHF2-Net) is the proposed for RSIHI, the superiority of which is as following: (1) a spatially-stationary-and-non-stationary high-frequency connection paradigm (SHCP) is proposed to enhance the interaction of spatially stationary and non-stationary frequency features to yield high-fidelity edge extraction result. Inspired by HRFormer, SHCP remains the high-frequency stream through the whole encoder-decoder process with parallel high-to-low frequency streams and reduces the edge loss by a downsampling operation; (2) a mask-and-geo-knowledge-based uncertainty diffusion module (MUDM) is proposed to improve the robustness and edge noise resistance. MUDM could further optimize the uncertain region to improve edge extraction result by gradually removing the multiple geo-knowledge-based noises; (3) a semi-pseudo-Siamese UDHF2-Net for change detection task is proposed to reduce the pseudo change by registration error. It adopts semi-pseudo-Siamese architecture to extract above complemental frequency features for adaptively reducing registration differencing, and MUDM to recover the uncertain region by gradually reducing the registration error besides above edge noises. Comprehensive experiments were performed to demonstrate the superiority of UDHF2-Net. Especially ablation experiments indicate the effectiveness of UDHF2-Net.
☆ Multi-Scale Temporal Difference Transformer for Video-Text Retrieval
Currently, in the field of video-text retrieval, there are many transformer-based methods. Most of them usually stack frame features and regrade frames as tokens, then use transformers for video temporal modeling. However, they commonly neglect the inferior ability of the transformer modeling local temporal information. To tackle this problem, we propose a transformer variant named Multi-Scale Temporal Difference Transformer (MSTDT). MSTDT mainly addresses the defects of the traditional transformer which has limited ability to capture local temporal information. Besides, in order to better model the detailed dynamic information, we make use of the difference feature between frames, which practically reflects the dynamic movement of a video. We extract the inter-frame difference feature and integrate the difference and frame feature by the multi-scale temporal transformer. In general, our proposed MSTDT consists of a short-term multi-scale temporal difference transformer and a long-term temporal transformer. The former focuses on modeling local temporal information, the latter aims at modeling global temporal information. At last, we propose a new loss to narrow the distance of similar samples. Extensive experiments show that backbone, such as CLIP, with MSTDT has attained a new state-of-the-art result.
☆ X-ray2CTPA: Generating 3D CTPA scans from 2D X-ray conditioning
Chest X-rays or chest radiography (CXR), commonly used for medical diagnostics, typically enables limited imaging compared to computed tomography (CT) scans, which offer more detailed and accurate three-dimensional data, particularly contrast-enhanced scans like CT Pulmonary Angiography (CTPA). However, CT scans entail higher costs, greater radiation exposure, and are less accessible than CXRs. In this work we explore cross-modal translation from a 2D low contrast-resolution X-ray input to a 3D high contrast and spatial-resolution CTPA scan. Driven by recent advances in generative AI, we introduce a novel diffusion-based approach to this task. We evaluate the models performance using both quantitative metrics and qualitative feedback from radiologists, ensuring diagnostic relevance of the generated images. Furthermore, we employ the synthesized 3D images in a classification framework and show improved AUC in a PE categorization task, using the initial CXR input. The proposed method is generalizable and capable of performing additional cross-modality translations in medical imaging. It may pave the way for more accessible and cost-effective advanced diagnostic tools. The code for this project is available: https://github.com/NoaCahan/X-ray2CTPA .
comment: preprint, project code: https://github.com/NoaCahan/X-ray2CTPA
☆ Towards Natural Language-Driven Assembly Using Foundation Models
Large Language Models (LLMs) and strong vision models have enabled rapid research and development in the field of Vision-Language-Action models that enable robotic control. The main objective of these methods is to develop a generalist policy that can control robots with various embodiments. However, in industrial robotic applications such as automated assembly and disassembly, some tasks, such as insertion, demand greater accuracy and involve intricate factors like contact engagement, friction handling, and refined motor skills. Implementing these skills using a generalist policy is challenging because these policies might integrate further sensory data, including force or torque measurements, for enhanced precision. In our method, we present a global control policy based on LLMs that can transfer the control policy to a finite set of skills that are specifically trained to perform high-precision tasks through dynamic context switching. The integration of LLMs into this framework underscores their significance in not only interpreting and processing language inputs but also in enriching the control mechanisms for diverse and intricate robotic operations.
☆ Imperative Learning: A Self-supervised Neural-Symbolic Learning Framework for Robot Autonomy
Data-driven methods such as reinforcement and imitation learning have achieved remarkable success in robot autonomy. However, their data-centric nature still hinders them from generalizing well to ever-changing environments. Moreover, collecting large datasets for robotic tasks is often impractical and expensive. To overcome these challenges, we introduce a new self-supervised neural-symbolic (NeSy) computational framework, imperative learning (IL), for robot autonomy, leveraging the generalization abilities of symbolic reasoning. The framework of IL consists of three primary components: a neural module, a reasoning engine, and a memory system. We formulate IL as a special bilevel optimization (BLO), which enables reciprocal learning over the three modules. This overcomes the label-intensive obstacles associated with data-driven approaches and takes advantage of symbolic reasoning concerning logical reasoning, physical principles, geometric analysis, etc. We discuss several optimization techniques for IL and verify their effectiveness in five distinct robot autonomy tasks including path planning, rule induction, optimal control, visual odometry, and multi-robot routing. Through various experiments, we show that IL can significantly enhance robot autonomy capabilities and we anticipate that it will catalyze further research across diverse domains.
☆ A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Zero-shot classification capabilities naturally arise in models trained within a vision-language contrastive framework. Despite their classification prowess, these models struggle in dense tasks like zero-shot open-vocabulary segmentation. This deficiency is often attributed to the absence of localization cues in captions and the intertwined nature of the learning process, which encompasses both image representation learning and cross-modality alignment. To tackle these issues, we propose SimZSS, a Simple framework for open-vocabulary Zero-Shot Segmentation. The method is founded on two key principles: i) leveraging frozen vision-only models that exhibit spatial awareness while exclusively aligning the text encoder and ii) exploiting the discrete nature of text and linguistic knowledge to pinpoint local concepts within captions. By capitalizing on the quality of the visual representations, our method requires only image-caption pairs datasets and adapts to both small curated and large-scale noisy datasets. When trained on COCO Captions across 8 GPUs, SimZSS achieves state-of-the-art results on 7 out of 8 benchmark datasets in less than 15 minutes.
☆ Mamba-based Light Field Super-Resolution with Efficient Subspace Scanning
Transformer-based methods have demonstrated impressive performance in 4D light field (LF) super-resolution by effectively modeling long-range spatial-angular correlations, but their quadratic complexity hinders the efficient processing of high resolution 4D inputs, resulting in slow inference speed and high memory cost. As a compromise, most prior work adopts a patch-based strategy, which fails to leverage the full information from the entire input LFs. The recently proposed selective state-space model, Mamba, has gained popularity for its efficient long-range sequence modeling. In this paper, we propose a Mamba-based Light Field Super-Resolution method, named MLFSR, by designing an efficient subspace scanning strategy. Specifically, we tokenize 4D LFs into subspace sequences and conduct bi-directional scanning on each subspace. Based on our scanning strategy, we then design the Mamba-based Global Interaction (MGI) module to capture global information and the local Spatial- Angular Modulator (SAM) to complement local details. Additionally, we introduce a Transformer-to-Mamba (T2M) loss to further enhance overall performance. Extensive experiments on public benchmarks demonstrate that MLFSR surpasses CNN-based models and rivals Transformer-based methods in performance while maintaining higher efficiency. With quicker inference speed and reduced memory demand, MLFSR facilitates full-image processing of high-resolution 4D LFs with enhanced performance.
comment: 17 pages,7 figures
☆ Detecting Abnormal Operations in Concentrated Solar Power Plants from Irregular Sequences of Thermal Images KDD 2024
Concentrated Solar Power (CSP) plants store energy by heating a storage medium with an array of mirrors that focus sunlight onto solar receivers atop a central tower. Operating at high temperatures these receivers face risks such as freezing, deformation, and corrosion, leading to operational failures, downtime, or costly equipment damage. We study the problem of anomaly detection (AD) in sequences of thermal images collected over a year from an operational CSP plant. These images are captured at irregular intervals ranging from one to five minutes throughout the day by infrared cameras mounted on solar receivers. Our goal is to develop a method to extract useful representations from high-dimensional thermal images for AD. It should be able to handle temporal features of the data, which include irregularity, temporal dependency between images and non-stationarity due to a strong daily seasonal pattern. The co-occurrence of low-temperature anomalies that resemble normal images from the start and the end of the operational cycle with high-temperature anomalies poses an additional challenge. We first evaluate state-of-the-art deep image-based AD methods, which have been shown to be effective in deriving meaningful image representations for the detection of anomalies. Then, we introduce a forecasting-based AD method that predicts future thermal images from past sequences and timestamps via a deep sequence model. This method effectively captures specific temporal data features and distinguishes between difficult-to-detect temperature-based anomalies. Our experiments demonstrate the effectiveness of our approach compared to multiple SOTA baselines across multiple evaluation metrics. We have also successfully deployed our solution on five months of unseen data, providing critical insights for the maintenance of the CSP plant. Our code is available at: https://tinyurl.com/ForecastAD
comment: Accepted in KDD 2024
☆ CAVM: Conditional Autoregressive Vision Model for Contrast-Enhanced Brain Tumor MRI Synthesis MICCAI 2024
Contrast-enhanced magnetic resonance imaging (MRI) is pivotal in the pipeline of brain tumor segmentation and analysis. Gadolinium-based contrast agents, as the most commonly used contrast agents, are expensive and may have potential side effects, and it is desired to obtain contrast-enhanced brain tumor MRI scans without the actual use of contrast agents. Deep learning methods have been applied to synthesize virtual contrast-enhanced MRI scans from non-contrast images. However, as this synthesis problem is inherently ill-posed, these methods fall short in producing high-quality results. In this work, we propose Conditional Autoregressive Vision Model (CAVM) for improving the synthesis of contrast-enhanced brain tumor MRI. As the enhancement of image intensity grows with a higher dose of contrast agents, we assume that it is less challenging to synthesize a virtual image with a lower dose, where the difference between the contrast-enhanced and non-contrast images is smaller. Thus, CAVM gradually increases the contrast agent dosage and produces higher-dose images based on previous lower-dose ones until the final desired dose is achieved. Inspired by the resemblance between the gradual dose increase and the Chain-of-Thought approach in natural language processing, CAVM uses an autoregressive strategy with a decomposition tokenizer and a decoder. Specifically, the tokenizer is applied to obtain a more compact image representation for computational efficiency, and it decomposes the image into dose-variant and dose-invariant tokens. Then, a masked self-attention mechanism is developed for autoregression that gradually increases the dose of the virtual image based on the dose-variant tokens. Finally, the updated dose-variant tokens corresponding to the desired dose are decoded together with dose-invariant tokens to produce the final contrast-enhanced MRI.
comment: The work has been accepted by MICCAI 2024
♻ ☆ TokenCompose: Text-to-Image Diffusion with Token-level Supervision CVPR 2024
We present TokenCompose, a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success, the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only, absent explicit constraint for the consistency between the text prompts and the image contents, leading to unsatisfactory results for composing multiple object categories. TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion, the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. Project link: https://mlpc-ucsd.github.io/TokenCompose
comment: CVPR 2024, 21 pages, 17 figures
♻ ☆ Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
comment: 35 pages, 11 figures
♻ ☆ Back to the Color: Learning Depth to Specific Color Transformation for Unsupervised Depth Estimation
Virtual engines have the capability to generate dense depth maps for various synthetic scenes, making them invaluable for training depth estimation models. However, synthetic colors often exhibit significant discrepancies compared to real-world colors, thereby posing challenges for depth estimation in real-world scenes, particularly in complex and uncertain environments encountered in unsupervised monocular depth estimation tasks. To address this issue, we propose Back2Color, a framework that predicts realistic colors from depth utilizing a model trained on real-world data, thus facilitating the transformation of synthetic colors into real-world counterparts. Additionally, by employing the Syn-Real CutMix method for joint training with both real-world unsupervised and synthetic supervised depth samples, we achieve improved performance in monocular depth estimation for real-world scenes. Moreover, to comprehensively address the impact of non-rigid motions on depth estimation, we propose an auto-learning uncertainty temporal-spatial fusion method (Auto-UTSF), which integrates the benefits of unsupervised learning in both temporal and spatial dimensions. Furthermore, we design a depth estimation network (VADepth) based on the Vision Attention Network. Our Back2Color framework demonstrates state-of-the-art performance, as evidenced by improvements in performance metrics and the production of fine-grained details in our predictions, particularly on challenging datasets such as Cityscapes for unsupervised depth estimation.
♻ ☆ Unleashing the Power of Meta-tuning for Few-shot Generalization Through Sparse Interpolated Experts
Recent successes suggest that parameter-efficient fine-tuning of foundation models as the state-of-the-art method for transfer learning in vision, replacing the rich literature of alternatives such as meta-learning. In trying to harness the best of both worlds, meta-tuning introduces a subsequent optimization stage of foundation models but has so far only shown limited success and crucially tends to underperform on out-of-distribution (OOD) tasks. In this paper, we introduce Sparse MetA-Tuning (SMAT), a method inspired by sparse mixture-of-experts approaches and trained to isolate subsets of pre-trained parameters automatically for meta-tuning on each task. SMAT successfully overcomes OOD sensitivity and delivers on the promise of enhancing the transfer abilities of vision foundation models beyond parameter-efficient fine-tuning. We establish new state-of-the-art results on a challenging combination of Meta-Dataset augmented with additional OOD tasks in both zero-shot and gradient-based adaptation settings. In addition, we provide a thorough analysis of the superiority of learned over hand-designed sparsity patterns for sparse expert methods and the pivotal importance of the sparsity level in balancing between in-distribution and out-of-distribution generalization. Our code is publicly available.
comment: The Forty-first International Conference on Machine Learning, 2024
♻ ☆ Diffusion Models, Image Super-Resolution And Everything: A Survey
Diffusion Models (DMs) have disrupted the image Super-Resolution (SR) field and further closed the gap between image quality and human perceptual preferences. They are easy to train and can produce very high-quality samples that exceed the realism of those produced by previous generative methods. Despite their promising results, they also come with new challenges that need further research: high computational demands, comparability, lack of explainability, color shifts, and more. Unfortunately, entry into this field is overwhelming because of the abundance of publications. To address this, we provide a unified recount of the theoretical foundations underlying DMs applied to image SR and offer a detailed analysis that underscores the unique characteristics and methodologies within this domain, distinct from broader existing reviews in the field. This survey articulates a cohesive understanding of DM principles and explores current research avenues, including alternative input domains, conditioning techniques, guidance mechanisms, corruption spaces, and zero-shot learning approaches. By offering a detailed examination of the evolution and current trends in image SR through the lens of DMs, this survey sheds light on the existing challenges and charts potential future directions, aiming to inspire further innovation in this rapidly advancing area.
♻ ☆ GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation CVPR
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.
comment: CVPR Workshop (CVinW) 2024. Project page: https://jack24658735.github.io/groprompt/
♻ ☆ DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
comment: Project Page: https://tsinghua-mars-lab.github.io/DriveVLM/
♻ ☆ MODIPHY: Multimodal Obscured Detection for IoT using PHantom Convolution-Enabled Faster YOLO ICIP
Low-light conditions and occluded scenarios impede object detection in real-world Internet of Things (IoT) applications like autonomous vehicles and security systems. While advanced machine learning models strive for accuracy, their computational demands clash with the limitations of resource-constrained devices, hampering real-time performance. In our current research, we tackle this challenge, by introducing ``YOLO Phantom", one of the smallest YOLO models ever conceived. YOLO Phantom utilizes the novel Phantom Convolution block, achieving comparable accuracy to the latest YOLOv8n model while simultaneously reducing both parameters and model size by 43\%, resulting in a significant 19\% reduction in Giga Floating-Point Operations (GFLOPs). YOLO Phantom leverages transfer learning on our multimodal RGB-infrared dataset to address low-light and occlusion issues, equipping it with robust vision under adverse conditions. Its real-world efficacy is demonstrated on an IoT platform with advanced low-light and RGB cameras, seamlessly connecting to an AWS-based notification endpoint for efficient real-time object detection. Benchmarks reveal a substantial boost of 17\% and 14\% in frames per second (FPS) for thermal and RGB detection, respectively, compared to the baseline YOLOv8n model. For community contribution, both the code and the multimodal dataset are available on GitHub.
comment: This paper has been accepted for publication at the IEEE International Conference on Image Processing (ICIP) 2024
♻ ☆ Multicam-SLAM: Non-overlapping Multi-camera SLAM for Indirect Visual Localization and Navigation
This paper presents a novel approach to visual simultaneous localization and mapping (SLAM) using multiple RGB-D cameras. The proposed method, Multicam-SLAM, significantly enhances the robustness and accuracy of SLAM systems by capturing more comprehensive spatial information from various perspectives. This method enables the accurate determination of pose relationships among multiple cameras without the need for overlapping fields of view. The proposed Muticam-SLAM includes a unique multi-camera model, a multi-keyframes structure, and several parallel SLAM threads. The multi-camera model allows for the integration of data from multiple cameras, while the multi-keyframes and parallel SLAM threads ensure efficient and accurate pose estimation and mapping. Extensive experiments in various environments demonstrate the superior accuracy and robustness of the proposed method compared to conventional single-camera SLAM systems. The results highlight the potential of the proposed Multicam-SLAM for more complex and challenging applications. Code is available at \url{https://github.com/AlterPang/Multi_ORB_SLAM}.
♻ ☆ Leveraging Latents for Efficient Thermography Classification and Segmentation
Breast cancer is a prominent health concern worldwide, currently being the secondmost common and second-deadliest type of cancer in women. While current breast cancer diagnosis mainly relies on mammography imaging, in recent years the use of thermography for breast cancer imaging has been garnering growing popularity. Thermographic imaging relies on infrared cameras to capture body-emitted heat distributions. While these heat signatures have proven useful for computer-vision systems for accurate breast cancer segmentation and classification, prior work often relies on handcrafted feature engineering or complex architectures, potentially limiting the comparability and applicability of these methods. In this work, we present a novel algorithm for both breast cancer classification and segmentation. Rather than focusing efforts on manual feature and architecture engineering, our algorithm focuses on leveraging an informative, learned feature space, thus making our solution simpler to use and extend to other frameworks and downstream tasks, as well as more applicable to data-scarce settings. Our classification produces SOTA results, while we are the first work to produce segmentation regions studied in this paper.
♻ ☆ LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
comment: arXiv admin note: substantial text overlap with arXiv:2303.09030
♻ ☆ AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We carefully design two types of prompt instructions through interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2)textual descriptions to indicate potential directions for rotation correction.During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts. To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Real-world demonstration can be found at https://sites.google.com/view/aic-mllm
Computation and Language 19
☆ Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
☆ Multi-Objective Linguistic Control of Large Language Models
Large language models (LLMs), despite their breakthroughs on many challenging benchmark tasks, lean to generate verbose responses and lack the controllability of output complexity, which is usually preferred by human users in practice. In this paper, we study how to precisely control multiple linguistic complexities of LLM output by finetuning using off-the-shelf data. To this end, we propose multi-control tuning (MCTune), which includes multiple linguistic complexity values of ground-truth responses as controls in the input for instruction tuning. We finetune LLaMA2-7B on Alpaca-GPT4 and WizardLM datasets. Evaluations on widely used benchmarks demonstrate that our method does not only improve LLMs' multi-complexity controllability substantially but also retains or even enhances the quality of the responses as a side benefit.
☆ Continuous Output Personality Detection Models via Mixed Strategy Training
The traditional personality models only yield binary results. This paper presents a novel approach for training personality detection models that produce continuous output values, using mixed strategies. By leveraging the PANDORA dataset, which includes extensive personality labeling of Reddit comments, we developed models that predict the Big Five personality traits with high accuracy. Our approach involves fine-tuning a RoBERTa-base model with various strategies such as Multi-Layer Perceptron (MLP) integration, and hyperparameter tuning. The results demonstrate that our models significantly outperform traditional binary classification methods, offering precise continuous outputs for personality traits, thus enhancing applications in AI, psychology, human resources, marketing and health care fields.
☆ LLMs' Classification Performance is Overclaimed
In many classification tasks designed for AI or human to solve, gold labels are typically included within the label space by default, often posed as "which of the following is correct?" This standard setup has traditionally highlighted the strong performance of advanced AI, particularly top-performing Large Language Models (LLMs), in routine classification tasks. However, when the gold label is intentionally excluded from the label space, it becomes evident that LLMs still attempt to select from the available label candidates, even when none are correct. This raises a pivotal question: Do LLMs truly demonstrate their intelligence in understanding the essence of classification tasks? In this study, we evaluate both closed-source and open-source LLMs across representative classification tasks, arguing that the perceived performance of LLMs is overstated due to their inability to exhibit the expected comprehension of the task. This paper makes a threefold contribution: i) To our knowledge, this is the first work to identify the limitations of LLMs in classification tasks when gold labels are absent. We define this task as Classify-w/o-Gold and propose it as a new testbed for LLMs. ii) We introduce a benchmark, Know-No, comprising two existing classification tasks and one new task, to evaluate Classify-w/o-Gold. iii) This work defines and advocates for a new evaluation metric, OmniAccuracy, which assesses LLMs' performance in classification tasks both when gold labels are present and absent.
☆ Blind Baselines Beat Membership Inference Attacks for Foundation Models
Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks can be used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks -- that distinguish the member and non-member distributions without looking at any trained model -- outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model's training data.
☆ GraphEval2000: Benchmarking and Improving Large Language Models on Graph Datasets
Large language models (LLMs) have achieved remarkable success in natural language processing (NLP), demonstrating significant capabilities in processing and understanding text data. However, recent studies have identified limitations in LLMs' ability to reason about graph-structured data. To address this gap, we introduce GraphEval2000, the first comprehensive graph dataset, comprising 40 graph data structure problems along with 2000 test cases. Additionally, we introduce an evaluation framework based on GraphEval2000, designed to assess the graph reasoning abilities of LLMs through coding challenges. Our dataset categorizes test cases into four primary and four sub-categories, ensuring a comprehensive evaluation. We evaluate eight popular LLMs on GraphEval2000, revealing that LLMs exhibit a better understanding of directed graphs compared to undirected ones. While private LLMs consistently outperform open-source models, the performance gap is narrowing. Furthermore, to improve the usability of our evaluation framework, we propose Structured Symbolic Decomposition (SSD), an instruction-based method designed to enhance LLM performance on GraphEval2000. Results show that SSD improves the performance of GPT-3.5, GPT-4, and GPT-4o on complex graph problems, with an increase of 11.11\%, 33.37\%, and 33.37\%, respectively.
comment: Submitted to NeurIPs 2024 Dataset and Benchmark track, under review
☆ FS-RAG: A Frame Semantics Based Approach for Improved Factual Accuracy in Large Language Models
We present a novel extension to Retrieval Augmented Generation with the goal of mitigating factual inaccuracies in the output of large language models. Specifically, our method draws on the cognitive linguistic theory of frame semantics for the indexing and retrieval of factual information relevant to helping large language models answer queries. We conduct experiments to demonstrate the effectiveness of this method both in terms of retrieval effectiveness and in terms of the relevance of the frames and frame relations automatically generated. Our results show that this novel mechanism of Frame Semantic-based retrieval, designed to improve Retrieval Augmented Generation (FS-RAG), is effective and offers potential for providing data-driven insights into frame semantics theory. We provide open access to our program code and prompts.
comment: program code and prompts available at https://github.com/H-TayyarMadabushi/A-Frame-Semantics-based-approach-for-Improved-Factual-Accuracy-in-Large-Language-Models
☆ Towards Region-aware Bias Evaluation Metrics
When exposed to human-generated data, language models are known to learn and amplify societal biases. While previous works introduced benchmarks that can be used to assess the bias in these models, they rely on assumptions that may not be universally true. For instance, a gender bias dimension commonly used by these metrics is that of family--career, but this may not be the only common bias in certain regions of the world. In this paper, we identify topical differences in gender bias across different regions and propose a region-aware bottom-up approach for bias assessment. Our proposed approach uses gender-aligned topics for a given region and identifies gender bias dimensions in the form of topic pairs that are likely to capture gender societal biases. Several of our proposed bias topic pairs are on par with human perception of gender biases in these regions in comparison to the existing ones, and we also identify new pairs that are more aligned than the existing ones. In addition, we use our region-aware bias topic pairs in a Word Embedding Association Test (WEAT)-based evaluation metric to test for gender biases across different regions in different data domains. We also find that LLMs have a higher alignment to bias pairs for highly-represented regions showing the importance of region-aware bias evaluation metric.
☆ Chain-of-Probe: Examing the Necessity and Accuracy of CoT Step-by-Step
Current research found the issue of Early Answering in large language models (LLMs), where the models already have an answer before generating the Chain-of-Thought (CoT). This phenomenon suggests a potential lack of necessary dependency between the predicted answer and the reasoning process. Consequently, two important questions arise: (1) Is CoT still necessary if the model already has an answer? (2) Can the correctness of the answer serve as valid evidence for the correctness of CoT? To address these questions, we propose a method, namely Chain-of-Probe (CoP), to probe changes in the mind during the model's reasoning. The probing results show that in a significant number of question-answer cases, CoT appears to be unnecessary, and this necessity correlates with the simplicity of the task, defined by reasoning steps required. Furthermore, by analyzing patterns in mind change, we examine the correctness of the model's reasoning. Our validation reveals that many responses, although correct in their final answer, contain errors in their reasoning process. To this end, we propose a strategic approach based on CoP to prioritize answers with correct reasoning among multiple candidates, thereby bolstering the reliability of the model's reasoning.
☆ Crosslingual Capabilities and Knowledge Barriers in Multilingual Large Language Models
Large language models (LLMs) are typically multilingual due to pretraining on diverse multilingual corpora. But can these models relate corresponding concepts across languages, effectively being crosslingual? This study evaluates six state-of-the-art LLMs on inherently crosslingual tasks. We observe that while these models show promising surface-level crosslingual abilities on machine translation and embedding space analyses, they struggle with deeper crosslingual knowledge transfer, revealing a crosslingual knowledge barrier in both general (MMLU benchmark) and domain-specific (Harry Potter quiz) contexts. We observe that simple inference-time mitigation methods offer only limited improvement. On the other hand, we propose fine-tuning of LLMs on mixed-language data, which effectively reduces these gaps, even when using out-of-domain datasets like WikiText. Our findings suggest the need for explicit optimization to unlock the full crosslingual potential of LLMs. Our code is publicly available at https://github.com/google-research/crosslingual-knowledge-barriers.
♻ ☆ Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers ICML 2024
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. However, the performances of LLMs depend heavily on the instructions given to them, which are typically manually tuned with substantial human efforts. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. However, BO usually falls short when optimizing highly sophisticated (e.g., high-dimensional) objective functions, such as the functions mapping an instruction to the performance of an LLM. This is mainly due to the limited expressive power of the Gaussian process (GP) which is used by BO as a surrogate to model the objective function. Meanwhile, it has been repeatedly shown that neural networks (NNs), especially pre-trained transformers, possess strong expressive power and can model highly complex functions. So, we adopt a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs. More importantly, the neural bandit algorithm allows us to naturally couple the NN surrogate with the hidden representation learned by a pre-trained transformer (i.e., an open-source LLM), which significantly boosts its performance. These motivate us to propose our INSTruction optimization usIng Neural bandits Coupled with Transformers (INSTINCT) algorithm. We perform instruction optimization for ChatGPT and use extensive experiments to show that INSTINCT consistently outperforms baselines in different tasks, e.g., various instruction induction tasks and the task of improving zero-shot chain-of-thought instructions. Our code is available at https://github.com/xqlin98/INSTINCT.
comment: Accepted to ICML 2024
♻ ☆ I2EDL: Interactive Instruction Error Detection and Localization
In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. "turn left" instead of "turn right"). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user's cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.
comment: Accepted at IEEE RO-MAN 2024
♻ ☆ Large Language Models for Data Annotation: A Survey
Data annotation generally refers to the labeling or generating of raw data with relevant information, which could be used for improving the efficacy of machine learning models. The process, however, is labor-intensive and costly. The emergence of advanced Large Language Models (LLMs), exemplified by GPT-4, presents an unprecedented opportunity to automate the complicated process of data annotation. While existing surveys have extensively covered LLM architecture, training, and general applications, we uniquely focus on their specific utility for data annotation. This survey contributes to three core aspects: LLM-Based Annotation Generation, LLM-Generated Annotations Assessment, and LLM-Generated Annotations Utilization. Furthermore, this survey includes an in-depth taxonomy of data types that LLMs can annotate, a comprehensive review of learning strategies for models utilizing LLM-generated annotations, and a detailed discussion of the primary challenges and limitations associated with using LLMs for data annotation. Serving as a key guide, this survey aims to assist researchers and practitioners in exploring the potential of the latest LLMs for data annotation, thereby fostering future advancements in this critical field.
♻ ☆ Open Models, Closed Minds? On Agents Capabilities in Mimicking Human Personalities through Open Large Language Models
The emergence of unveiling human-like behaviors in Large Language Models (LLMs) has led to a closer connection between NLP and human psychology. Scholars have been studying the inherent personalities exhibited by LLMs and attempting to incorporate human traits and behaviors into them. However, these efforts have primarily focused on commercially-licensed LLMs, neglecting the widespread use and notable advancements seen in Open LLMs. This work aims to address this gap by employing a set of 12 LLM Agents based on the most representative Open models and subject them to a series of assessments concerning the Myers-Briggs Type Indicator (MBTI) test and the Big Five Inventory (BFI) test. Our approach involves evaluating the intrinsic personality traits of Open LLM agents and determining the extent to which these agents can mimic human personalities when conditioned by specific personalities and roles. Our findings unveil that $(i)$ each Open LLM agent showcases distinct human personalities; $(ii)$ personality-conditioned prompting produces varying effects on the agents, with only few successfully mirroring the imposed personality, while most of them being ``closed-minded'' (i.e., they retain their intrinsic traits); and $(iii)$ combining role and personality conditioning can enhance the agents' ability to mimic human personalities. Our work represents a step up in understanding the dense relationship between NLP and human psychology through the lens of Open LLMs.
comment: Enhanced methodology and evaluation based on BFI in addition to MBTI, with expanded set of LLM agents. Author list changed w.r.t. the previous version (v1), see Acknowledgements
♻ ☆ Large Language Models Can Self-Correct with Minimal Effort
Intrinsic self-correct was a method that instructed large language models (LLMs) to verify and correct their responses without external feedback. Unfortunately, the study concluded that the LLMs could not self-correct reasoning yet. We find that a simple yet effective verification method can unleash inherent capabilities of the LLMs. That is to mask a key condition in the question, add the current response to construct a verification question, and predict the condition to verify the response. The condition can be an entity in an open-domain question or a numeric value in a math question, which requires minimal effort (via prompting) to identify. We propose an iterative verify-then-correct framework to progressively identify and correct (probably) false responses, named ProCo. We conduct experiments on three reasoning tasks. On average, ProCo, with GPT-3.5-Turbo as the backend LLM, yields $+6.8$ exact match on four open-domain question answering datasets, $+14.1$ accuracy on three arithmetic reasoning datasets, and $+9.6$ accuracy on a commonsense reasoning dataset, compared to Self-Correct.
comment: Work in Progress
♻ ☆ LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages ECML
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
comment: 20 pages, 6 tables. The source code related to this paper is available at https://github.com/mkandai/llms-in-the-loop. This paper has been accepted for publication at ECML PKDD 2024
♻ ☆ Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
♻ ☆ MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.
♻ ☆ Extending Token Computation for LLM Reasoning
Large Language Models (LLMs) are pivotal in advancing natural language processing but often struggle with complex reasoning tasks due to inefficient attention distributions. In this paper, we explore the effect of increased computed tokens on LLM performance and introduce a novel method for extending computed tokens in the Chain-of-Thought (CoT) process, utilizing attention mechanism optimization. By fine-tuning an LLM on a domain-specific, highly structured dataset, we analyze attention patterns across layers, identifying inefficiencies caused by non-semantic tokens with outlier high attention scores. To address this, we propose an algorithm that emulates early layer attention patterns across downstream layers to re-balance skewed attention distributions and enhance knowledge abstraction. Our findings demonstrate that our approach not only facilitates a deeper understanding of the internal dynamics of LLMs but also significantly improves their reasoning capabilities, particularly in non-STEM domains. Our study lays the groundwork for further innovations in LLM design, aiming to create more powerful, versatile, and responsible models capable of tackling a broad range of real-world applications.
Systems and Control 9
☆ Received Power Maximization Using Nonuniform Discrete Phase Shifts for RISs With a Limited Phase Range
To maximize the received power at a user equipment, the problem of optimizing a reconfigurable intelligent surface (RIS) with a limited phase range R and nonuniform discrete phase shifts with adjustable gains is addressed. Necessary and sufficient conditions to achieve this maximization are given. These conditions are employed in two algorithms to achieve the global optimum in linear time for R {\ge} {\pi} and R < {\pi}, where R is the RIS phase range. With a total number of N(2K +1) complex vector additions, it is shown for R {\ge} {\pi} and R < {\pi} that the global optimality is achieved in NK or fewer and N(K +1) or fewer steps, respectively, where N is the number of RIS elements and K is the number of discrete phase shifts which may be placed nonuniformly over the phase range R. In addition, we define an intuitive quantization algorithm that we call the nonuniform polar quantization (NPQ) algorithm. With NPQ, we provide a closed form solution for the approximation ratio with which an arbitrary set of nonuniform discrete phase shifts can approximate the continuous solution. We also show that with a phase range limitation, equal separation among the nonuniform discrete phase shifts maximizes the normalized performance. Furthermore, we show that the gain of using K {\ge} 3 with R < {\pi}/2 and K {\ge} 4 with R < {\pi} is only marginal. Finally, we reveal that when R < 2{\pi}/3, ON/OFF selection for the RIS elements brings significant performance compared to the case when the RIS elements are strictly ON.
comment: 27 pages, 19 figures
☆ Frequency Constrained MPC for Efficient Grid Side Operation of Wind Power Conversion Systems
Model predictive control (MPC) has proven its applicability in power conversion control with its fast dynamic response to reference changes while ensuring critical system constraints are satisfied. Even then, the computational burden still remains a challenge for many MPC variants. In this regard, this paper formulates an indirect MPC scheme for grid-side wind converters. A quadratic program with linear constraints is solved in a receding horizon fashion with a subsequent PWM modulator. To facilitate its solution within a few hundreds of microseconds, its decision variables (modulating signals) are restricted to a specific frequency content. This approach limits the increase in problem size due to horizon length. In case studies, the proposed MPC exhibits fast response in faults and operates the converter within its safety limits.
☆ Monte Carlo Planning for Stochastic Control on Constrained Markov Decision Processes
In the world of stochastic control, especially in economics and engineering, Markov Decision Processes (MDPs) can effectively model various stochastic decision processes, from asset management to transportation optimization. These underlying MDPs, upon closer examination, often reveal a specifically constrained causal structure concerning the transition and reward dynamics. By exploiting this structure, we can obtain a reduction in the causal representation of the problem setting, allowing us to solve of the optimal value function more efficiently. This work defines an MDP framework, the \texttt{SD-MDP}, where we disentangle the causal structure of MDPs' transition and reward dynamics, providing distinct partitions on the temporal causal graph. With this stochastic reduction, the \texttt{SD-MDP} reflects a general class of resource allocation problems. This disentanglement further enables us to derive theoretical guarantees on the estimation error of the value function under an optimal policy by allowing independent value estimation from Monte Carlo sampling. Subsequently, by integrating this estimator into well-known Monte Carlo planning algorithms, such as Monte Carlo Tree Search (MCTS), we derive bounds on the simple regret of the algorithm. Finally, we quantify the policy improvement of MCTS under the \texttt{SD-MDP} framework by demonstrating that the MCTS planning algorithm achieves higher expected reward (lower costs) under a constant simulation budget, on a tangible economic example based on maritime refuelling.
comment: Working manuscript
☆ The Analysis and the Performance of the Parallel-Partial Reset Control System
Reset controllers have demonstrated their effectiveness in enhancing performance in precision motion systems. To further exploiting the potential of reset controllers, this study introduces a parallel-partial reset control structure. Frequency response analysis is effective for the design and fine-tuning of controllers in industries. However, conducting frequency response analysis for reset control systems poses challenges due to their nonlinearities. We develop frequency response analysis methods for both the open-loop and closed-loop parallel-partial reset systems. Simulation results validate the accuracy of the analysis methods, showcasing precision enhancements exceeding 100% compared to the traditional describing function method. Furthermore, we design a parallel-partial reset controller within the Proportional-Integral-Derivative (PID) control structure for a mass-spring-damper system. The frequency response analysis of the designed system indicates that, while maintaining the same bandwidth and phase margin of the first-order harmonics, the new system exhibits lower magnitudes of higher-order harmonics, compared to the traditional reset system. Moreover, simulation results demonstrate that the new system achieves lower overshoot and quicker settling time compared to both the traditional reset and linear systems.
comment: Submitted to The European Control Conference 2024
☆ Proper Implicit Discretization of the Super-Twisting Controller -- without and with Actuator Saturation
The discrete-time implementation of the super-twisting sliding mode controller for a plant with disturbances with bounded slope, zero-order hold actuation, and actuator constraints is considered. Motivated by restrictions of existing implicit or semi-implicit discretization variants, a new proper implicit discretization for the super-twisting controller is proposed. This discretization is then extended to the conditioned super-twisting controller, which mitigates windup in presence of actuator constraints by means of the conditioning technique. It is proven that the proposed controllers achieve best possible worst-case performance subject to similarily simple stability conditions as their continuous-time counterparts. Numerical simulations and comparisons demonstrate and illustrate the results.
☆ Characterization of Lightweight GPS Disciplined Oscillators for Distributed UAV Measurement Applications
With an increasing variety of measurement applications using sensing nodes on unmanned aerial vehicles (UAVs), global positioning system (GPS) and global navigation satellite system (GNSS) disciplined oscillators (GPSDOs, GNSSDOs) are an appealing solution for precise wireless inter-device synchronization. Typically evaluated under laboratory conditions by analyzing the 10 MHz and 1 pulse per second (PPS) reference signal stability, these test methods overlook airborne oscillator performance. This paper characterizes lightweight GNSSDO models for flight suitability using a measurement system based on software defined radios (SDRs). We analyze reference signal stability under controlled GNSS signal impairments to predict performance and measurement precision loss in dynamic operational modes. Additionally, we assess behavior under the impacts caused by operating the UAV, as well as typical flight vibrations and accelerations outlined in the standard for payload devices.
☆ State-Compensation-Linearization-Based Stability Margin Analysis for a Class of Nonlinear Systems: A Data-Driven Method
The classical stability margin analysis based on the linearized model is widely used in practice even in nonlinear systems. Although linear analysis techniques are relatively standard and have simple implementation structures, they are prone to misbehavior and failure when the system is performing an off-nominal operation. To avoid the drawbacks and exploit the advantages of linear analysis methods and frequency-domain stability margin analysis while tackling system nonlinearity, a state-compensation-linearization-based stability margin analysis method is studied in the paper. Based on the state-compensation-linearization-based stabilizing control, the definition and measurement of the stability margin are given. The l2 gain margin and l2 time-delay margin for the closed-loop nonlinear system with state-compensation-linearization-based stabilizing control are defined and derived approximatively by the small-gain theorem in theory. The stability margin measurement can be carried out by the frequency-sweep method in practice. The proposed method is a data-driven method for obtaining the stability margin of nonlinear systems, which is practical and can be applied to practical systems directly. Finally, three numerical examples are given to illustrate the effectiveness of the proposed method.
☆ Deep-MPC: A DAGGER-Driven Imitation Learning Strategy for Optimal Constrained Battery Charging
In the realm of battery charging, several complex aspects demand meticulous attention, including thermal management, capacity degradation, and the need for rapid charging while maintaining safety and battery lifespan. By employing the imitation learning paradigm, this manuscript introduces an innovative solution to confront the inherent challenges often associated with conventional predictive control strategies for constrained battery charging. A significant contribution of this study lies in the adaptation of the Dataset Aggregation (DAGGER) algorithm to address scenarios where battery parameters are uncertain, and internal states are unobservable. Results drawn from a practical battery simulator that incorporates an electrochemical model highlight substantial improvements in battery charging performance, particularly in meeting all safety constraints and outperforming traditional strategies in computational processing.
comment: 7 pages, 4 figures, submitted to American Control Conference 2024 (ACC2024)
♻ ☆ Wholesale Market Participation of DERA: Competitive DER Aggregation
We consider the aggregation of distributed energy resources (DERs) by a profit-seeking DER aggregator (DERA), participating directly in the wholesale electricity market and constrained by the distribution network access. We propose a competitive DER aggregation that maximizes the DERA's profit subject to that each customer of the DERA gains no less surplus and pays no higher energy cost than that under the regulated retail tariff. The DERA participates in the wholesale electricity market as virtual storage with optimized generation offers and consumption bids derived from the DERA's competitive aggregation. Also derived are DERA's bid curves for the distribution network access and DERA's profitability when competing with the regulated retail tariff. We show that, with the same distribution network access, the proposed DERA's wholesale market participation achieves the same welfare-maximizing outcome as when its customers participate directly in the wholesale market. We empirically evaluate how many DERAs can survive in the long-run equilibrium. Numerical studies also compare the proposed DERA with existing benchmarks on surpluses of DERA's customers and DERA profits as functions of the DER adoption level and the distribution network access.
comment: 15 pages, 8 figures
Multiagent Systems 1
☆ TornadoDrone: Bio-inspired DRL-based Drone Landing on 6D Platform with Wind Force Disturbances
Autonomous drone navigation faces a critical challenge in achieving accurate landings on dynamic platforms, especially under unpredictable conditions such as wind turbulence. Our research introduces TornadoDrone, a novel Deep Reinforcement Learning (DRL) model that adopts bio-inspired mechanisms to adapt to wind forces, mirroring the natural adaptability seen in birds. This model, unlike traditional approaches, derives its adaptability from indirect cues such as changes in position and velocity, rather than direct wind force measurements. TornadoDrone was rigorously trained in the gym-pybullet-drone simulator, which closely replicates the complexities of wind dynamics in the real world. Through extensive testing with Crazyflie 2.1 drones in both simulated and real windy conditions, TornadoDrone demonstrated a high performance in maintaining high-precision landing accuracy on moving platforms, surpassing conventional control methods such as PID controllers with Extended Kalman Filters. The study not only highlights the potential of DRL to tackle complex aerodynamic challenges but also paves the way for advanced autonomous systems that can adapt to environmental changes in real-time. The success of TornadoDrone signifies a leap forward in drone technology, particularly for critical applications such as surveillance and emergency response, where reliability and precision are paramount.
comment: arXiv admin note: substantial text overlap with arXiv:2403.06572
Artificial Intelligence 9
☆ Preference Tuning For Toxicity Mitigation Generalizes Across Languages
Detoxifying multilingual Large Language Models (LLMs) has become crucial due to their increasing global use. In this work, we explore zero-shot cross-lingual generalization of preference tuning in detoxifying LLMs. Unlike previous studies that show limited cross-lingual generalization for other safety tasks, we demonstrate that Direct Preference Optimization (DPO) training with only English data can significantly reduce toxicity in multilingual open-ended generations. For example, the probability of mGPT-1.3B generating toxic continuations drops from 46.8% to 3.9% across 17 different languages after training. Our results also extend to other multilingual LLMs, such as BLOOM, Llama3, and Aya-23. Using mechanistic interpretability tools like causal intervention and activation analysis, we identified the dual multilinguality property of MLP layers in LLMs, which explains the cross-lingual generalization of DPO. Finally, we show that bilingual sentence retrieval can predict the cross-lingual transferability of DPO preference tuning.
☆ Jacobian Descent for Multi-Objective Optimization
Many optimization problems are inherently multi-objective. To address them, we formalize Jacobian descent (JD), a direct generalization of gradient descent for vector-valued functions. Each step of this algorithm relies on a Jacobian matrix consisting of one gradient per objective. The aggregator, responsible for reducing this matrix into an update vector, characterizes JD. While the multi-task learning literature already contains a variety of aggregators, they often lack some natural properties. In particular, the update should not conflict with any objective and should scale proportionally to the norm of each gradient. We propose a new aggregator specifically designed to satisfy this. Emphasizing conflict between objectives, we then highlight direct applications for our methods. Most notably, we introduce instance-wise risk minimization (IWRM), a learning paradigm in which the loss of each training example is considered a separate objective. On simple image classification tasks, IWRM exhibits promising results compared to the direct minimization of the average loss. The performance of our aggregator in those experiments also corroborates our theoretical findings. Lastly, as speed is the main limitation of JD, we provide a path towards a more efficient implementation.
comment: 39 pages, 10 figures, conference
☆ Gradual Divergence for Seamless Adaptation: A Novel Domain Incremental Learning Method ICML 2024
Domain incremental learning (DIL) poses a significant challenge in real-world scenarios, as models need to be sequentially trained on diverse domains over time, all the while avoiding catastrophic forgetting. Mitigating representation drift, which refers to the phenomenon of learned representations undergoing changes as the model adapts to new tasks, can help alleviate catastrophic forgetting. In this study, we propose a novel DIL method named DARE, featuring a three-stage training process: Divergence, Adaptation, and REfinement. This process gradually adapts the representations associated with new tasks into the feature space spanned by samples from previous tasks, simultaneously integrating task-specific decision boundaries. Additionally, we introduce a novel strategy for buffer sampling and demonstrate the effectiveness of our proposed method, combined with this sampling strategy, in reducing representation drift within the feature encoder. This contribution effectively alleviates catastrophic forgetting across multiple DIL benchmarks. Furthermore, our approach prevents sudden representation drift at task boundaries, resulting in a well-calibrated DIL model that maintains the performance on previous tasks.
comment: Accepted at 41st International Conference on Machine Learning (ICML 2024)
☆ From Text to Test: AI-Generated Control Software for Materials Science Instruments
Large language models (LLMs) are transforming the landscape of chemistry and materials science. Recent examples of LLM-accelerated experimental research include virtual assistants for parsing synthesis recipes from the literature, or using the extracted knowledge to guide synthesis and characterization. Despite these advancements, their application is constrained to labs with automated instruments and control software, leaving much of materials science reliant on manual processes. Here, we demonstrate the rapid deployment of a Python-based control module for a Keithley 2400 electrical source measure unit using ChatGPT-4. Through iterative refinement, we achieved effective instrument management with minimal human intervention. Additionally, a user-friendly graphical user interface (GUI) was created, effectively linking all instrument controls to interactive screen elements. Finally, we integrated this AI-crafted instrument control software with a high-performance stochastic optimization algorithm to facilitate rapid and automated extraction of electronic device parameters related to semiconductor charge transport mechanisms from current-voltage (IV) measurement data. This integration resulted in a comprehensive open-source toolkit for semiconductor device characterization and analysis using IV curve measurements. We demonstrate the application of these tools by acquiring, analyzing, and parameterizing IV data from a Pt/Cr2O3/\b{eta}-Ga2O3 heterojunction diode, a novel stack for high-power and high-temperature electronic devices. This approach underscores the powerful synergy between LLMs and the development of instruments for scientific inquiry, showcasing a path for further acceleration in materials science.
☆ Continuous Output Personality Detection Models via Mixed Strategy Training
The traditional personality models only yield binary results. This paper presents a novel approach for training personality detection models that produce continuous output values, using mixed strategies. By leveraging the PANDORA dataset, which includes extensive personality labeling of Reddit comments, we developed models that predict the Big Five personality traits with high accuracy. Our approach involves fine-tuning a RoBERTa-base model with various strategies such as Multi-Layer Perceptron (MLP) integration, and hyperparameter tuning. The results demonstrate that our models significantly outperform traditional binary classification methods, offering precise continuous outputs for personality traits, thus enhancing applications in AI, psychology, human resources, marketing and health care fields.
☆ F-FOMAML: GNN-Enhanced Meta-Learning for Peak Period Demand Forecasting with Proxy Data
Demand prediction is a crucial task for e-commerce and physical retail businesses, especially during high-stake sales events. However, the limited availability of historical data from these peak periods poses a significant challenge for traditional forecasting methods. In this paper, we propose a novel approach that leverages strategically chosen proxy data reflective of potential sales patterns from similar entities during non-peak periods, enriched by features learned from a graph neural networks (GNNs)-based forecasting model, to predict demand during peak events. We formulate the demand prediction as a meta-learning problem and develop the Feature-based First-Order Model-Agnostic Meta-Learning (F-FOMAML) algorithm that leverages proxy data from non-peak periods and GNN-generated relational metadata to learn feature-specific layer parameters, thereby adapting to demand forecasts for peak events. Theoretically, we show that by considering domain similarities through task-specific metadata, our model achieves improved generalization, where the excess risk decreases as the number of training tasks increases. Empirical evaluations on large-scale industrial datasets demonstrate the superiority of our approach. Compared to existing state-of-the-art models, our method demonstrates a notable improvement in demand prediction accuracy, reducing the Mean Absolute Error by 26.24% on an internal vending machine dataset and by 1.04% on the publicly accessible JD.com dataset.
♻ ☆ Use Your INSTINCT: INSTruction optimization for LLMs usIng Neural bandits Coupled with Transformers ICML 2024
Large language models (LLMs) have shown remarkable instruction-following capabilities and achieved impressive performances in various applications. However, the performances of LLMs depend heavily on the instructions given to them, which are typically manually tuned with substantial human efforts. Recent work has used the query-efficient Bayesian optimization (BO) algorithm to automatically optimize the instructions given to black-box LLMs. However, BO usually falls short when optimizing highly sophisticated (e.g., high-dimensional) objective functions, such as the functions mapping an instruction to the performance of an LLM. This is mainly due to the limited expressive power of the Gaussian process (GP) which is used by BO as a surrogate to model the objective function. Meanwhile, it has been repeatedly shown that neural networks (NNs), especially pre-trained transformers, possess strong expressive power and can model highly complex functions. So, we adopt a neural bandit algorithm which replaces the GP in BO by an NN surrogate to optimize instructions for black-box LLMs. More importantly, the neural bandit algorithm allows us to naturally couple the NN surrogate with the hidden representation learned by a pre-trained transformer (i.e., an open-source LLM), which significantly boosts its performance. These motivate us to propose our INSTruction optimization usIng Neural bandits Coupled with Transformers (INSTINCT) algorithm. We perform instruction optimization for ChatGPT and use extensive experiments to show that INSTINCT consistently outperforms baselines in different tasks, e.g., various instruction induction tasks and the task of improving zero-shot chain-of-thought instructions. Our code is available at https://github.com/xqlin98/INSTINCT.
comment: Accepted to ICML 2024
♻ ☆ Step-by-Step Diffusion: An Elementary Tutorial
We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.
comment: 35 pages, 11 figures
♻ ☆ I2EDL: Interactive Instruction Error Detection and Localization
In the Vision-and-Language Navigation in Continuous Environments (VLN-CE) task, the human user guides an autonomous agent to reach a target goal via a series of low-level actions following a textual instruction in natural language. However, most existing methods do not address the likely case where users may make mistakes when providing such instruction (e.g. "turn left" instead of "turn right"). In this work, we address a novel task of Interactive VLN in Continuous Environments (IVLN-CE), which allows the agent to interact with the user during the VLN-CE navigation to verify any doubts regarding the instruction errors. We propose an Interactive Instruction Error Detector and Localizer (I2EDL) that triggers the user-agent interaction upon the detection of instruction errors during the navigation. We leverage a pre-trained module to detect instruction errors and pinpoint them in the instruction by cross-referencing the textual input and past observations. In such way, the agent is able to query the user for a timely correction, without demanding the user's cognitive load, as we locate the probable errors to a precise part of the instruction. We evaluate the proposed I2EDL on a dataset of instructions containing errors, and further devise a novel metric, the Success weighted by Interaction Number (SIN), to reflect both the navigation performance and the interaction effectiveness. We show how the proposed method can ask focused requests for corrections to the user, which in turn increases the navigation success, while minimizing the interactions.
comment: Accepted at IEEE RO-MAN 2024
Information Retrieval 12
☆ A Mechanism for Optimizing Media Recommender Systems
A mechanism is described that addresses the fundamental trade off between media producers who want to increase reach and consumers who provide attention based on the rate of utility received, and where overreach negatively impacts that rate. An optimal solution can be achieved when the media source considers the impact of overreach in a cost function used in determining the optimal distribution of content to maximize individual consumer utility and participation. The result is a Nash equilibrium between producer and consumer that is also Pareto efficient. Comparison with the literature on Recommender systems highlights the advantages of the mechanism.The review suggests advancements over that literature including identifying an optimal content volume for the consumer and improvements for handling multiple objectives A practical algorithm to generate the optimal distribution for each consumer is provided.
comment: Main Paper: 20 pages, Appendix with proofs and additional material: 26 pages
☆ SimCE: Simplifying Cross-Entropy Loss for Collaborative Filtering
The learning objective is integral to collaborative filtering systems, where the Bayesian Personalized Ranking (BPR) loss is widely used for learning informative backbones. However, BPR often experiences slow convergence and suboptimal local optima, partially because it only considers one negative item for each positive item, neglecting the potential impacts of other unobserved items. To address this issue, the recently proposed Sampled Softmax Cross-Entropy (SSM) compares one positive sample with multiple negative samples, leading to better performance. Our comprehensive experiments confirm that recommender systems consistently benefit from multiple negative samples during training. Furthermore, we introduce a \underline{Sim}plified Sampled Softmax \underline{C}ross-\underline{E}ntropy Loss (SimCE), which simplifies the SSM using its upper bound. Our validation on 12 benchmark datasets, using both MF and LightGCN backbones, shows that SimCE significantly outperforms both BPR and SSM.
☆ Evaluating Ensemble Methods for News Recommender Systems
News recommendation is crucial for facilitating individuals' access to articles, particularly amid the increasingly digital landscape of news consumption. Consequently, extensive research is dedicated to News Recommender Systems (NRS) with increasingly sophisticated algorithms. Despite this sustained scholarly inquiry, there exists a notable research gap regarding the potential synergy achievable by amalgamating these algorithms to yield superior outcomes. This paper endeavours to address this gap by demonstrating how ensemble methods can be used to combine many diverse state-of-the-art algorithms to achieve superior results on the Microsoft News dataset (MIND). Additionally, we identify scenarios where ensemble methods fail to improve results and offer explanations for this occurrence. Our findings demonstrate that a combination of NRS algorithms can outperform individual algorithms, provided that the base learners are sufficiently diverse, with improvements of up to 5\% observed for an ensemble consisting of a content-based BERT approach and the collaborative filtering LSTUR algorithm. Additionally, our results demonstrate the absence of any improvement when combining insufficiently distinct methods. These findings provide insight into successful approaches of ensemble methods in NRS and advocates for the development of better systems through appropriate ensemble solutions.
☆ Evaluating D-MERIT of Partial-annotation on Information Retrieval
Retrieval models are often evaluated on partially-annotated datasets. Each query is mapped to a few relevant texts and the remaining corpus is assumed to be irrelevant. As a result, models that successfully retrieve false negatives are punished in evaluation. Unfortunately, completely annotating all texts for every query is not resource efficient. In this work, we show that using partially-annotated datasets in evaluation can paint a distorted picture. We curate D-MERIT, a passage retrieval evaluation set from Wikipedia, aspiring to contain all relevant passages for each query. Queries describe a group (e.g., ``journals about linguistics'') and relevant passages are evidence that entities belong to the group (e.g., a passage indicating that Language is a journal about linguistics). We show that evaluating on a dataset containing annotations for only a subset of the relevant passages might result in misleading ranking of the retrieval systems and that as more relevant texts are included in the evaluation set, the rankings converge. We propose our dataset as a resource for evaluation and our study as a recommendation for balance between resource-efficiency and reliable evaluation when annotating evaluation sets for text retrieval.
comment: Our dataset can be downloaded from https://D-MERIT.github.io
☆ Database-Augmented Query Representation for Information Retrieval
Information retrieval models that aim to search for the documents relevant to the given query have shown many successes, which have been applied to diverse tasks. However, the query provided by the user is oftentimes very short, which challenges the retrievers to correctly fetch relevant documents. To tackle this, existing studies have proposed expanding the query with a couple of additional (user-related) features related to the query. Yet, they may be suboptimal to effectively augment the query, though there is plenty of information available to augment it in a relational database. Motivated by this, we present a novel retrieval framework called Database-Augmented Query representation (DAQu), which augments the original query with various (query-related) metadata across multiple tables. In addition, as the number of features in the metadata can be very large and there is no order among them, we encode them with our graph-based set encoding strategy, which considers hierarchies of features in the database without order. We validate DAQu in diverse retrieval scenarios that can incorporate metadata from the relational database, demonstrating that ours significantly enhances overall retrieval performance, compared to existing query augmentation methods.
☆ Learning k-Determinantal Point Processes for Personalized Ranking ICDE 2024
The key to personalized recommendation is to predict a personalized ranking on a catalog of items by modeling the user's preferences. There are many personalized ranking approaches for item recommendation from implicit feedback like Bayesian Personalized Ranking (BPR) and listwise ranking. Despite these methods have shown performance benefits, there are still limitations affecting recommendation performance. First, none of them directly optimize ranking of sets, causing inadequate exploitation of correlations among multiple items. Second, the diversity aspect of recommendations is insufficiently addressed compared to relevance. In this work, we present a new optimization criterion LkP based on set probability comparison for personalized ranking that moves beyond traditional ranking-based methods. It formalizes set-level relevance and diversity ranking comparisons through a Determinantal Point Process (DPP) kernel decomposition. To confer ranking interpretability to the DPP set probabilities and prioritize the practicality of LkP, we condition the standard DPP on the cardinality k of the DPP-distributed set, known as k-DPP, a less-explored extension of DPP. The generic stochastic gradient descent based technique can be directly applied to optimizing models that employ LkP. We implement LkP in the context of both Matrix Factorization (MF) and neural networks approaches, on three real-world datasets, obtaining improved relevance and diversity performances. LkP is broadly applicable, and when applied to existing recommendation models it also yields strong performance improvements, suggesting that LkP holds significant value to the field of recommender systems.
comment: 14 pages, accepted at ICDE 2024 (40th IEEE International Conference on Data Engineering)
♻ ☆ Multi-Margin Loss: Proposal and Application in Recommender Systems
Recommender systems guide users through vast amounts of information by suggesting items based on their predicted preferences. Collaborative filtering-based deep learning techniques have regained popularity due to their simplicity, using only user-item interactions. Typically, these systems consist of three main components: an interaction module, a loss function, and a negative sampling strategy. Initially, researchers focused on enhancing performance by developing complex interaction modules with techniques like multi-layer perceptrons, transformers, or graph neural networks. However, there has been a recent shift toward refining loss functions and negative sampling strategies. This shift has increased interest in contrastive learning, which pulls similar pairs closer while pushing dissimilar ones apart. Contrastive learning involves key practices such as heavy data augmentation, large batch sizes, and hard-negative sampling, but these also bring challenges like high memory demands and under-utilization of some negative samples. The proposed Multi-Margin Loss (MML) addresses these challenges by introducing multiple margins and varying weights for negative samples. MML efficiently utilizes not only the hardest negatives but also other non-trivial negatives, offering a simpler yet effective loss function that outperforms more complex methods, especially when resources are limited. Experiments on two well-known datasets showed MML achieved up to a 20\% performance improvement compared to a baseline contrastive loss function with fewer negative samples.
♻ ☆ LLMs in the Loop: Leveraging Large Language Model Annotations for Active Learning in Low-Resource Languages ECML
Low-resource languages face significant barriers in AI development due to limited linguistic resources and expertise for data labeling, rendering them rare and costly. The scarcity of data and the absence of preexisting tools exacerbate these challenges, especially since these languages may not be adequately represented in various NLP datasets. To address this gap, we propose leveraging the potential of LLMs in the active learning loop for data annotation. Initially, we conduct evaluations to assess inter-annotator agreement and consistency, facilitating the selection of a suitable LLM annotator. The chosen annotator is then integrated into a training loop for a classifier using an active learning paradigm, minimizing the amount of queried data required. Empirical evaluations, notably employing GPT-4-Turbo, demonstrate near-state-of-the-art performance with significantly reduced data requirements, as indicated by estimated potential cost savings of at least 42.45 times compared to human annotation. Our proposed solution shows promising potential to substantially reduce both the monetary and computational costs associated with automation in low-resource settings. By bridging the gap between low-resource languages and AI, this approach fosters broader inclusion and shows the potential to enable automation across diverse linguistic landscapes.
comment: 20 pages, 6 tables. The source code related to this paper is available at https://github.com/mkandai/llms-in-the-loop. This paper has been accepted for publication at ECML PKDD 2024
♻ ☆ Spiral of Silence: How is Large Language Model Killing Information Retrieval? -- A Case Study on Open Domain Question Answering ACL2024
The practice of Retrieval-Augmented Generation (RAG), which integrates Large Language Models (LLMs) with retrieval systems, has become increasingly prevalent. However, the repercussions of LLM-derived content infiltrating the web and influencing the retrieval-generation feedback loop are largely uncharted territories. In this study, we construct and iteratively run a simulation pipeline to deeply investigate the short-term and long-term effects of LLM text on RAG systems. Taking the trending Open Domain Question Answering (ODQA) task as a point of entry, our findings reveal a potential digital "Spiral of Silence" effect, with LLM-generated text consistently outperforming human-authored content in search rankings, thereby diminishing the presence and impact of human contributions online. This trend risks creating an imbalanced information ecosystem, where the unchecked proliferation of erroneous LLM-generated content may result in the marginalization of accurate information. We urge the academic community to take heed of this potential issue, ensuring a diverse and authentic digital information landscape.
comment: Accepted to ACL2024
♻ ☆ Async Learned User Embeddings for Ads Delivery Optimization SIGIR 2024
In recommendation systems, high-quality user embeddings can capture subtle preferences, enable precise similarity calculations, and adapt to changing preferences over time to maintain relevance. The effectiveness of recommendation systems depends on the quality of user embedding. We propose to asynchronously learn high fidelity user embeddings for billions of users each day from sequence based multimodal user activities through a Transformer-like large scale feature learning module. The async learned user representations embeddings (ALURE) are further converted to user similarity graphs through graph learning and then combined with user realtime activities to retrieval highly related ads candidates for the ads delivery system. Our method shows significant gains in both offline and online experiments.
comment: Accepted by workshop on Multimodal Representation and Retrieval at SIGIR 2024, Washington DC
♻ ☆ Beyond Collaborative Filtering: A Relook at Task Formulation in Recommender Systems
Recommender Systems (RecSys) have become indispensable in numerous applications, profoundly influencing our everyday experiences. Despite their practical significance, academic research in RecSys often abstracts the formulation of research tasks from real-world contexts, aiming for a clean problem formulation and more generalizable findings. However, it is observed that there is a lack of collective understanding in RecSys academic research. The root of this issue may lie in the simplification of research task definitions, and an overemphasis on modeling the decision outcomes rather than the decision-making process. That is, we often conceptualize RecSys as the task of predicting missing values in a static user-item interaction matrix, rather than predicting a user's decision on the next interaction within a dynamic, changing, and application-specific context. There exists a mismatch between the inputs accessible to a model and the information available to users during their decision-making process, yet the model is tasked to predict users' decisions. While collaborative filtering is effective in learning general preferences from historical records, it is crucial to also consider the dynamic contextual factors in practical settings. Defining research tasks based on application scenarios using domain-specific datasets may lead to more insightful findings. Accordingly, viable solutions and effective evaluations can emerge for different application scenarios.
comment: Published in ACM SIGWEB Newsletter, Spring 2024: https://dl.acm.org/doi/10.1145/3663752.3663756
♻ ☆ Towards Group-aware Search Success
Traditional measures of search success often overlook the varying information needs of different demographic groups. To address this gap, we introduce a novel metric, named Group-aware Search Success (GA-SS). GA-SS redefines search success to ensure that all demographic groups achieve satisfaction from search outcomes. We introduce a comprehensive mathematical framework to calculate GA-SS, incorporating both static and stochastic ranking policies and integrating user browsing models for a more accurate assessment. In addition, we have proposed Group-aware Most Popular Completion (gMPC) ranking model to account for demographic variances in user intent, aligning more closely with the diverse needs of all user groups. We empirically validate our metric and approach with two real-world datasets: one focusing on query auto-completion and the other on movie recommendations, where the results highlight the impact of stochasticity and the complex interplay among various search success metrics. Our findings advocate for a more inclusive approach in measuring search success, as well as inspiring future investigations into the quality of service of search.
Multimedia 2
☆ Towards Real-Time Neural Volumetric Rendering on Mobile Devices: A Measurement Study
Neural Radiance Fields (NeRF) is an emerging technique to synthesize 3D objects from 2D images with a wide range of potential applications. However, rendering existing NeRF models is extremely computation intensive, making it challenging to support real-time interaction on mobile devices. In this paper, we take the first initiative to examine the state-of-the-art real-time NeRF rendering technique from a system perspective. We first define the entire working pipeline of the NeRF serving system. We then identify possible control knobs that are critical to the system from the communication, computation, and visual performance perspective. Furthermore, an extensive measurement study is conducted to reveal the effects of these control knobs on system performance. Our measurement results reveal that different control knobs contribute differently towards improving the system performance, with the mesh granularity being the most effective knob and the quantization being the least effective knob. In addition, diverse hardware device settings and network conditions have to be considered to fully unleash the benefit of operating under the appropriate knobs
comment: This paper is accepted by ACM SIGCOMM Workshop on Emerging Multimedia Systems 2024
♻ ☆ Diffusion Models, Image Super-Resolution And Everything: A Survey
Diffusion Models (DMs) have disrupted the image Super-Resolution (SR) field and further closed the gap between image quality and human perceptual preferences. They are easy to train and can produce very high-quality samples that exceed the realism of those produced by previous generative methods. Despite their promising results, they also come with new challenges that need further research: high computational demands, comparability, lack of explainability, color shifts, and more. Unfortunately, entry into this field is overwhelming because of the abundance of publications. To address this, we provide a unified recount of the theoretical foundations underlying DMs applied to image SR and offer a detailed analysis that underscores the unique characteristics and methodologies within this domain, distinct from broader existing reviews in the field. This survey articulates a cohesive understanding of DM principles and explores current research avenues, including alternative input domains, conditioning techniques, guidance mechanisms, corruption spaces, and zero-shot learning approaches. By offering a detailed examination of the evolution and current trends in image SR through the lens of DMs, this survey sheds light on the existing challenges and charts potential future directions, aiming to inspire further innovation in this rapidly advancing area.
Systems and Control 8
☆ Voltage enhancement and loss minimization in a radial network through optimal capacitor sizing and placement based on Crow Search Algorithm
In power systems, the distribution network is pivotal for consumers because it is the final stage in the delivery of electricity from the generation plants to the end users. Reactive power demand from consumers can result in challenges like low power factor, diminished voltage, and heightened power losses. The primary challenge encountered when utilizing a radial system as a distribution network is the voltage drop, which leads to distortion in the voltage profile across the entire network. This research focuses on optimizing network performance by appropriately sizing and placing capacitors based on the Crow Search Algorithm (CSA). The results from this approach were compared with Particle Swarm Optimization (PSO), Artificial bee colony, Cultural Algorithm, Firefly Algorithm, Genetic Algorithm, Invasive Weed Optimization, and Teacher Learner Based Optimization methods illustrated in the study where CSA's effectiveness, demonstrated a notable 30.41% and 29.33% overall reduction in active and reactive power losses compared to the base case. The lowest bus voltage, notably at bus 18 which happens to be the longest in the IEEE 33 Radial Distribution Network (RDN) topology, improved from 0.8820 in the base case to 0.908 with CSA, showcasing a 32.9% enhancement in voltage deviation. This optimization led to a significant 3.841pu reduction in capacitor cost, highlighting CSA's efficacy in enhancing RDN performance and cost efficiency.
comment: Kindly find our paper
☆ Multistep Criticality Search and Power Shaping in Microreactors with Reinforcement Learning
Reducing operation and maintenance costs is a key objective for advanced reactors in general and microreactors in particular. To achieve this reduction, developing robust autonomous control algorithms is essential to ensure safe and autonomous reactor operation. Recently, artificial intelligence and machine learning algorithms, specifically reinforcement learning (RL) algorithms, have seen rapid increased application to control problems, such as plasma control in fusion tokamaks and building energy management. In this work, we introduce the use of RL for intelligent control in nuclear microreactors. The RL agent is trained using proximal policy optimization (PPO) and advantage actor-critic (A2C), cutting-edge deep RL techniques, based on a high-fidelity simulation of a microreactor design inspired by the Westinghouse eVinci\textsuperscript{TM} design. We utilized a Serpent model to generate data on drum positions, core criticality, and core power distribution for training a feedforward neural network surrogate model. This surrogate model was then used to guide a PPO and A2C control policies in determining the optimal drum position across various reactor burnup states, ensuring critical core conditions and symmetrical power distribution across all six core portions. The results demonstrate the excellent performance of PPO in identifying optimal drum positions, achieving a hextant power tilt ratio of approximately 1.002 (within the limit of $<$ 1.02) and maintaining criticality within a 10 pcm range. A2C did not provide as competitive of a performance as PPO in terms of performance metrics for all burnup steps considered in the cycle. Additionally, the results highlight the capability of well-trained RL control policies to quickly identify control actions, suggesting a promising approach for enabling real-time autonomous control through digital twins.
comment: 15 pages, 3 figures, and 2 tables
☆ On Physics-Informed Neural Network Control for Power Electronics
Considering the growing necessity for precise modeling of power electronics amidst operational and environmental uncertainties, this paper introduces an innovative methodology that ingeniously combines model-driven and data-driven approaches to enhance the stability of power electronics interacting with grid-forming microgrids. By employing the physics-informed neural network (PINN) as a foundation, this strategy merges robust data-fitting capabilities with fundamental physical principles, thereby constructing an accurate system model. By this means, it significantly enhances the ability to understand and replicate the dynamics of power electronics systems under complex working conditions. Moreover, by incorporating advanced learning-based control methods, the proposed method is enabled to make precise predictions and implement the satisfactory control laws even under serious uncertain conditions. Experimental validation demonstrates the effectiveness and robustness of the proposed approach, highlighting its substantial potential in addressing prevalent uncertainties in controlling modern power electronics systems.
☆ Fault-tolerant control of random switching topology multi-agent system based on event triggering
In this paper, the formation control of multi-agent systems in random switching communication topology is studied, and the problem of excessive bandwidth and low control efficiency among multi-agents is solved. For nonlinear multi-agent systems, a sliding mode formation control algorithm with event-triggered random switching communication topology is proposed. Firstly, a fault-tolerant control strategy based on stochastic system model is designed to solve the problem of low controller efficiency and controller failure during the formation of multi-agent systems. Compared with the traditional multi-agent system dynamic model, the stochastic system model has stronger universality, which improves the efficiency of the controller. Secondly, in order to deal with the problem of high communication load and frequency between agents during formation in random switching communication topology, the sample-based event triggering controller is introduced into the controller, which effectively reduces the communication frequency and reduces the impact of communication delay. Finally, the stability and robustness of event-triggered random switching topology formation control are verified by computer simulation.
☆ Modeling Unknown Stochastic Dynamical System Subject to External Excitation
We present a numerical method for learning unknown nonautonomous stochastic dynamical system, i.e., stochastic system subject to time dependent excitation or control signals. Our basic assumption is that the governing equations for the stochastic system are unavailable. However, short bursts of input/output (I/O) data consisting of certain known excitation signals and their corresponding system responses are available. When a sufficient amount of such I/O data are available, our method is capable of learning the unknown dynamics and producing an accurate predictive model for the stochastic responses of the system subject to arbitrary excitation signals not in the training data. Our method has two key components: (1) a local approximation of the training I/O data to transfer the learning into a parameterized form; and (2) a generative model to approximate the underlying unknown stochastic flow map in distribution. After presenting the method in detail, we present a comprehensive set of numerical examples to demonstrate the performance of the proposed method, especially for long-term system predictions.
☆ Remaining useful life prediction of rolling bearings based on refined composite multi-scale attention entropy and dispersion entropy
Remaining useful life (RUL) prediction based on vibration signals is crucial for ensuring the safe operation and effective health management of rotating machinery. Existing studies often extract health indicators (HI) from time domain and frequency domain features to analyze complex vibration signals, but these features may not accurately capture the degradation process. In this study, we propose a degradation feature extraction method called Fusion of Multi-Modal Multi-Scale Entropy (FMME), which utilizes multi-modal Refined Composite Multi-scale Attention Entropy (RCMATE) and Fluctuation Dispersion Entropy (RCMFDE), to solve the problem that the existing degradation features cannot accurately reflect the degradation process. Firstly, the Empirical Mode Decomposition (EMD) is employed to decompose the dual-channel vibration signals of bearings into multiple modals. The main modals are then selected for further analysis. The subsequent step involves the extraction of RCMATE and RCMFDE from each modal, followed by wavelet denoising. Next, a novel metric is proposed to evaluate the quality of degradation features. The attention entropy and dispersion entropy of the optimal scales under different modals are fused using Laplacian Eigenmap (LE) to obtain the health indicators. Finally, RUL prediction is performed through the similarity of health indicators between fault samples and bearings to be predicted. Experimental results demonstrate that the proposed method yields favorable outcomes across diverse operating conditions.
comment: 12pages, 9 figures
♻ ☆ Human-AI Safety: A Descendant of Generative AI and Control Systems Safety
Artificial intelligence (AI) is interacting with people at an unprecedented scale, offering new avenues for immense positive impact, but also raising widespread concerns around the potential for individual and societal harm. Today, the predominant paradigm for human--AI safety focuses on fine-tuning the generative model's outputs to better agree with human-provided examples or feedback. In reality, however, the consequences of an AI model's outputs cannot be determined in isolation: they are tightly entangled with the responses and behavior of human users over time. In this paper, we distill key complementary lessons from AI safety and control systems safety, highlighting open challenges as well as key synergies between both fields. We then argue that meaningful safety assurances for advanced AI technologies require reasoning about how the feedback loop formed by AI outputs and human behavior may drive the interaction towards different outcomes. To this end, we introduce a unifying formalism to capture dynamic, safety-critical human--AI interactions and propose a concrete technical roadmap towards next-generation human-centered AI safety.
comment: Revised version with refined exposition and technical details. 12 pages + references, 5 figures
♻ ☆ Convex MPC and Thrust Allocation with Deadband for Spacecraft Rendezvous
This paper delves into a rendezvous scenario involving a chaser and a target spacecraft, focusing on the application of Model Predictive Control (MPC) to design a controller capable of guiding the chaser toward the target. The operational principle of spacecraft thrusters, requiring a minimum activation time that leads to the existence of a control deadband, introduces mixed-integer constraints into the optimization, posing a considerable computational challenge due to the exponential complexity on the number of integer constraints. We address this complexity by presenting two solver algorithms that efficiently approximate the optimal solution in significantly less time than standard solvers, making them well-suited for real-time applications.
comment: Published in IEEE Control Systems Letters
Multiagent Systems 3
☆ Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models
Learning a world model for model-free Reinforcement Learning (RL) agents can significantly improve the sample efficiency by learning policies in imagination. However, building a world model for Multi-Agent RL (MARL) can be particularly challenging due to the scalability issue in a centralized architecture arising from a large number of agents, and also the non-stationarity issue in a decentralized architecture stemming from the inter-dependency among agents. To address both challenges, we propose a novel world model for MARL that learns decentralized local dynamics for scalability, combined with a centralized representation aggregation from all agents. We cast the dynamics learning as an auto-regressive sequence modeling problem over discrete tokens by leveraging the expressive Transformer architecture, in order to model complex local dynamics across different agents and provide accurate and consistent long-term imaginations. As the first pioneering Transformer-based world model for multi-agent systems, we introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation within this context. Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.
♻ ☆ Human-compatible driving partners through data-regularized self-play reinforcement learning
A central challenge for autonomous vehicles is coordinating with humans. Therefore, incorporating realistic human agents is essential for scalable training and evaluation of autonomous driving systems in simulation. Simulation agents are typically developed by imitating large-scale, high-quality datasets of human driving. However, pure imitation learning agents empirically have high collision rates when executed in a multi-agent closed-loop setting. To build agents that are realistic and effective in closed-loop settings, we propose Human-Regularized PPO (HR-PPO), a multi-agent algorithm where agents are trained through self-play with a small penalty for deviating from a human reference policy. In contrast to prior work, our approach is RL-first and only uses 30 minutes of imperfect human demonstrations. We evaluate agents in a large set of multi-agent traffic scenes. Results show our HR-PPO agents are highly effective in achieving goals, with a success rate of 93%, an off-road rate of 3.5%, and a collision rate of 3%. At the same time, the agents drive in a human-like manner, as measured by their similarity to existing human driving logs. We also find that HR-PPO agents show considerable improvements on proxy measures for coordination with human driving, particularly in highly interactive scenarios. We open-source our code and trained agents at https://github.com/Emerge-Lab/nocturne_lab and provide demonstrations of agent behaviors at https://sites.google.com/view/driving-partners.
♻ ☆ Emergent Dominance Hierarchies in Reinforcement Learning Agents
Modern Reinforcement Learning (RL) algorithms are able to outperform humans in a wide variety of tasks. Multi-agent reinforcement learning (MARL) settings present additional challenges, and successful cooperation in mixed-motive groups of agents depends on a delicate balancing act between individual and group objectives. Social conventions and norms, often inspired by human institutions, are used as tools for striking this balance. In this paper, we examine a fundamental, well-studied social convention that underlies cooperation in both animal and human societies: dominance hierarchies. We adapt the ethological theory of dominance hierarchies to artificial agents, borrowing the established terminology and definitions with as few amendments as possible. We demonstrate that populations of RL agents, operating without explicit programming or intrinsic rewards, can invent, learn, enforce, and transmit a dominance hierarchy to new populations. The dominance hierarchies that emerge have a similar structure to those studied in chickens, mice, fish, and other species.
Information Retrieval 4
☆ LLM-Powered Explanations: Unraveling Recommendations Through Subgraph Reasoning
Recommender systems are pivotal in enhancing user experiences across various web applications by analyzing the complicated relationships between users and items. Knowledge graphs(KGs) have been widely used to enhance the performance of recommender systems. However, KGs are known to be noisy and incomplete, which are hard to provide reliable explanations for recommendation results. An explainable recommender system is crucial for the product development and subsequent decision-making. To address these challenges, we introduce a novel recommender that synergies Large Language Models (LLMs) and KGs to enhance the recommendation and provide interpretable results. Specifically, we first harness the power of LLMs to augment KG reconstruction. LLMs comprehend and decompose user reviews into new triples that are added into KG. In this way, we can enrich KGs with explainable paths that express user preferences. To enhance the recommendation on augmented KGs, we introduce a novel subgraph reasoning module that effectively measures the importance of nodes and discovers reasoning for recommendation. Finally, these reasoning paths are fed into the LLMs to generate interpretable explanations of the recommendation results. Our approach significantly enhances both the effectiveness and interpretability of recommender systems, especially in cross-selling scenarios where traditional methods falter. The effectiveness of our approach has been rigorously tested on four open real-world datasets, with our methods demonstrating a superior performance over contemporary state-of-the-art techniques by an average improvement of 12%. The application of our model in a multinational engineering and technology company cross-selling recommendation system further underscores its practical utility and potential to redefine recommendation practices through improved accuracy and user trust.
☆ Understanding the Role of User Profile in the Personalization of Large Language Models
Utilizing user profiles to personalize Large Language Models (LLMs) has been shown to enhance the performance on a wide range of tasks. However, the precise role of user profiles and their effect mechanism on LLMs remains unclear. This study first confirms that the effectiveness of user profiles is primarily due to personalization information rather than semantic information. Furthermore, we investigate how user profiles affect the personalization of LLMs. Within the user profile, we reveal that it is the historical personalized response produced or approved by users that plays a pivotal role in personalizing LLMs. This discovery unlocks the potential of LLMs to incorporate a greater number of user profiles within the constraints of limited input length. As for the position of user profiles, we observe that user profiles integrated into different positions of the input context do not contribute equally to personalization. Instead, where the user profile that is closer to the beginning affects more on the personalization of LLMs. Our findings reveal the role of user profiles for the personalization of LLMs, and showcase how incorporating user profiles impacts performance providing insight to leverage user profiles effectively.
♻ ☆ RecExplainer: Aligning Large Language Models for Explaining Recommendation Models
Recommender systems are widely used in online services, with embedding-based models being particularly popular due to their expressiveness in representing complex signals. However, these models often function as a black box, making them less transparent and reliable for both users and developers. Recently, large language models (LLMs) have demonstrated remarkable intelligence in understanding, reasoning, and instruction following. This paper presents the initial exploration of using LLMs as surrogate models to explaining black-box recommender models. The primary concept involves training LLMs to comprehend and emulate the behavior of target recommender models. By leveraging LLMs' own extensive world knowledge and multi-step reasoning abilities, these aligned LLMs can serve as advanced surrogates, capable of reasoning about observations. Moreover, employing natural language as an interface allows for the creation of customizable explanations that can be adapted to individual user preferences. To facilitate an effective alignment, we introduce three methods: behavior alignment, intention alignment, and hybrid alignment. Behavior alignment operates in the language space, representing user preferences and item information as text to mimic the target model's behavior; intention alignment works in the latent space of the recommendation model, using user and item representations to understand the model's behavior; hybrid alignment combines both language and latent spaces. Comprehensive experiments conducted on three public datasets show that our approach yields promising results in understanding and mimicking target models, producing high-quality, high-fidelity, and distinct explanations. Our code is available at https://github.com/microsoft/RecAI.
comment: 12 pages, 9 figures, 5 tables
♻ ☆ Knowledge Conflicts for LLMs: A Survey
This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.
comment: Our GitHub repo is available at https://github.com/pillowsofwind/Knowledge-Conflicts-Survey
Multimedia 2
♻ ☆ Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
♻ ☆ Archiving Body Movements: Collective Generation of Chinese Calligraphy
As a communication channel, body movements have been widely explored in behavioral studies and kinesics. Performing and visual arts share the same interests but focus on documenting and representing human body movements, such as for dance notation and visual work creation. This paper investigates body movements in oriental calligraphy and how to apply calligraphy principles to stimulate and archive body movements. Through an artwork (Wushu), the authors experiment with an interactive and generative approach to engage the audience's bodily participation and archive the body movements as a compendium of generated calligraphy. The audience assumes the role of both writers and readers; creating ("writing") and appreciating ("reading") the generated calligraphy becomes a cyclical process within this infinite "Book," which can motivate further attention and discussions concerning Chinese characters and calligraphy.
comment: 8 pages, 8 figures
Computer Vision and Pattern Recognition 104
☆ NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https://github.com/autonomousvision/navsim.
☆ Full-Scale Indexing and Semantic Annotation of CT Imaging: Boosting FAIRness
Background: The integration of artificial intelligence into medicine has led to significant advances, particularly in diagnostics and treatment planning. However, the reliability of AI models is highly dependent on the quality of the training data, especially in medical imaging, where varying patient data and evolving medical knowledge pose a challenge to the accuracy and generalizability of given datasets. Results: The proposed approach focuses on the integration and enhancement of clinical computed tomography (CT) image series for better findability, accessibility, interoperability, and reusability. Through an automated indexing process, CT image series are semantically enhanced using the TotalSegmentator framework for segmentation and resulting SNOMED CT annotations. The metadata is standardized with HL7 FHIR resources to enable efficient data recognition and data exchange between research projects. Conclusions: The study successfully integrates a robust process within the UKSH MeDIC, leading to the semantic enrichment of over 230,000 CT image series and over 8 million SNOMED CT annotations. The standardized representation using HL7 FHIR resources improves discoverability and facilitates interoperability, providing a foundation for the FAIRness of medical imaging data. However, developing automated annotation methods that can keep pace with growing clinical datasets remains a challenge to ensure continued progress in large-scale integration and indexing of medical imaging for advanced healthcare AI applications.
☆ Image Conductor: Precision Control for Interactive Video Synthesis
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
comment: Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
☆ Keystroke Dynamics Against Academic Dishonesty in the Age of LLMs
The transition to online examinations and assignments raises significant concerns about academic integrity. Traditional plagiarism detection systems often struggle to identify instances of intelligent cheating, particularly when students utilize advanced generative AI tools to craft their responses. This study proposes a keystroke dynamics-based method to differentiate between bona fide and assisted writing within academic contexts. To facilitate this, a dataset was developed to capture the keystroke patterns of individuals engaged in writing tasks, both with and without the assistance of generative AI. The detector, trained using a modified TypeNet architecture, achieved accuracies ranging from 74.98% to 85.72% in condition-specific scenarios and from 52.24% to 80.54% in condition-agnostic scenarios. The findings highlight significant differences in keystroke dynamics between genuine and assisted writing. The outcomes of this study enhance our understanding of how users interact with generative AI and have implications for improving the reliability of digital educational platforms.
comment: Accepted for publication at The IEEE International Joint Conference on Biometrics (IJCB2024), contains 9 pages, 3 figures, 3 tables
☆ Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
☆ GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation
In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. This limits these methods to a low-resolution representation and makes it difficult to scale up to the dense views for better quality. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to effectively integrate image features into 3D representations. We implement this solution through a two-stage pipeline: initially, a lightweight proposal network generates a sparse set of 3D anchor points from the posed image inputs; subsequently, a specialized reconstruction transformer refines the geometry and retrieves textural details. Extensive experimental results demonstrate that GeoLRM significantly outperforms existing models, especially for dense view inputs. We also demonstrate the practical applicability of our model with 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications.
comment: The code is available at https://github.com/alibaba-yuanjing-aigclab/GeoLRM
☆ Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild
Virtual Try-On (VTON) is a highly active line of research, with increasing demand. It aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity. Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation. In this paper, we present a novel zero-shot training-free method for inpainting a clothing garment by reference. Our approach employs the prior of a diffusion model with no additional training, fully leveraging its native generalization capabilities. The method employs extended attention to transfer image information from reference to target images, overcoming two significant challenges. We first initially warp the reference garment over the target human using deep features, alleviating "texture sticking". We then leverage the extended attention mechanism with careful masking, eliminating leakage of reference background and unwanted influence. Through a user study, qualitative, and quantitative comparison to state-of-the-art approaches, we demonstrate superior image quality and garment preservation compared unseen clothing pieces or human figures.
comment: Project page available at https://nadavorzech.github.io/max4zero.github.io/
☆ An End-to-End, Segmentation-Free, Arabic Handwritten Recognition Model on KHATT
An end-to-end, segmentation-free, deep learning model trained from scratch is proposed, leveraging DCNN for feature extraction, alongside Bidirectional Long-Short Term Memory (BLSTM) for sequence recognition and Connectionist Temporal Classification (CTC) loss function on the KHATT database. The training phase yields remarkable results 84% recognition rate on the test dataset at the character level and 71% on the word level, establishing an image-based sequence recognition framework that operates without segmentation only at the line level. The analysis and preprocessing of the KFUPM Handwritten Arabic TexT (KHATT) database are also presented. Finally, advanced image processing techniques, including filtering, transformation, and line segmentation are implemented. The importance of this work is highlighted by its wide-ranging applications. Including digitizing, documentation, archiving, and text translation in fields such as banking. Moreover, AHR serves as a pivotal tool for making images searchable, enhancing information retrieval capabilities, and enabling effortless editing. This functionality significantly reduces the time and effort required for tasks such as Arabic data organization and manipulation.
☆ Rethinking Remote Sensing Change Detection With A Mask View
Remote sensing change detection aims to compare two or more images recorded for the same area but taken at different time stamps to quantitatively and qualitatively assess changes in geographical entities and environmental factors. Mainstream models usually built on pixel-by-pixel change detection paradigms, which cannot tolerate the diversity of changes due to complex scenes and variation in imaging conditions. To address this shortcoming, this paper rethinks the change detection with the mask view, and further proposes the corresponding: 1) meta-architecture CDMask and 2) instance network CDMaskFormer. Components of CDMask include Siamese backbone, change extractor, pixel decoder, transformer decoder and normalized detector, which ensures the proper functioning of the mask detection paradigm. Since the change query can be adaptively updated based on the bi-temporal feature content, the proposed CDMask can adapt to different latent data distributions, thus accurately identifying regions of interest changes in complex scenarios. Consequently, we further propose the instance network CDMaskFormer customized for the change detection task, which includes: (i) a Spatial-temporal convolutional attention-based instantiated change extractor to capture spatio-temporal context simultaneously with lightweight operations; and (ii) a scene-guided axial attention-instantiated transformer decoder to extract more spatial details. State-of-the-art performance of CDMaskFormer is achieved on five benchmark datasets with a satisfactory efficiency-accuracy trade-off. Code is available at https://github.com/xwmaxwma/rschange.
comment: Under review
☆ Learning Spatio-Temporal Patterns of Polar Ice Layers With Physics-Informed Graph Neural Network
Learning spatio-temporal patterns of polar ice layers is crucial for monitoring the change in ice sheet balance and evaluating ice dynamic processes. While a few researchers focus on learning ice layer patterns from echogram images captured by airborne snow radar sensors via different convolutional neural networks, the noise in the echogram images proves to be a major obstacle. Instead, we focus on geometric deep learning based on graph neural networks to learn the spatio-temporal patterns from thickness information of shallow ice layers and make predictions for deep layers. In this paper, we propose a physics-informed hybrid graph neural network that combines the GraphSAGE framework for graph feature learning with the long short-term memory (LSTM) structure for learning temporal changes, and introduce measurements of physical ice properties from Model Atmospheric Regional (MAR) weather model as physical node features. We found that our proposed network can consistently outperform the current non-inductive or non-physical model in predicting deep ice layer thickness.
☆ You Only Acquire Sparse-channel (YOAS): A Unified Framework for Dense-channel EEG Generation
High-precision acquisition of dense-channel electroencephalogram (EEG) signals is often impeded by the costliness and lack of portability of equipment. In contrast, generating dense-channel EEG signals effectively from sparse channels shows promise and economic viability. However, sparse-channel EEG poses challenges such as reduced spatial resolution, information loss, signal mixing, and heightened susceptibility to noise and interference. To address these challenges, we first theoretically formulate the dense-channel EEG generation problem as by optimizing a set of cross-channel EEG signal generation problems. Then, we propose the YOAS framework for generating dense-channel data from sparse-channel EEG signals. The YOAS totally consists of four sequential stages: Data Preparation, Data Preprocessing, Biased-EEG Generation, and Synthetic EEG Generation. Data Preparation and Preprocessing carefully consider the distribution of EEG electrodes and low signal-to-noise ratio problem of EEG signals. Biased-EEG Generation includes sub-modules of BiasEEGGanFormer and BiasEEGDiffFormer, which facilitate long-term feature extraction with attention and generate signals by combining electrode position alignment with diffusion model, respectively. Synthetic EEG Generation synthesizes the final signals, employing a deduction paradigm for multi-channel EEG generation. Extensive experiments confirmed YOAS's feasibility, efficiency, and theoretical validity, even remarkably enhancing data discernibility. This breakthrough in dense-channel EEG signal generation from sparse-channel data opens new avenues for exploration in EEG signal processing and application.
☆ Fingerprint Membership and Identity Inference Against Generative Adversarial Networks
Generative models are gaining significant attention as potential catalysts for a novel industrial revolution. Since automated sample generation can be useful to solve privacy and data scarcity issues that usually affect learned biometric models, such technologies became widely spread in this field. In this paper, we assess the vulnerabilities of generative machine learning models concerning identity protection by designing and testing an identity inference attack on fingerprint datasets created by means of a generative adversarial network. Experimental results show that the proposed solution proves to be effective under different configurations and easily extendable to other biometric measurements.
comment: Paper submitted at "Pattern Recognition Letters", 9 pages, 6 images
☆ MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train MantisScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between MantisScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that MantisScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe MantisScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
☆ Landscape More Secure Than Portrait? Zooming Into the Directionality of Digital Images With Security Implications
The orientation in which a source image is captured can affect the resulting security in downstream applications. One reason for this is that many state-of-the-art methods in media security assume that image statistics are similar in the horizontal and vertical directions, allowing them to reduce the number of features (or trainable weights) by merging coefficients. We show that this artificial symmetrization tends to suppress important properties of natural images and common processing operations, causing a loss of performance. We also observe the opposite problem, where unaddressed directionality causes learning-based methods to overfit to a single orientation. These are vulnerable to manipulation if an adversary chooses inputs with the less common orientation. This paper takes a comprehensive approach, identifies and systematizes causes of directionality at several stages of a typical acquisition pipeline, measures their effect, and demonstrates for three selected security applications (steganalysis, forensic source identification, and the detection of synthetic images) how the performance of state-of-the-art methods can be improved by properly accounting for directionality.
☆ DiffExplainer: Unveiling Black Box Models Via Counterfactual Generation MICCAI 2024
In the field of medical imaging, particularly in tasks related to early disease detection and prognosis, understanding the reasoning behind AI model predictions is imperative for assessing their reliability. Conventional explanation methods encounter challenges in identifying decisive features in medical image classifications, especially when discriminative features are subtle or not immediately evident. To address this limitation, we propose an agent model capable of generating counterfactual images that prompt different decisions when plugged into a black box model. By employing this agent model, we can uncover influential image patterns that impact the black model's final predictions. Through our methodology, we efficiently identify features that influence decisions of the deep black box. We validated our approach in the rigorous domain of medical prognosis tasks, showcasing its efficacy and potential to enhance the reliability of deep learning models in medical image classification compared to existing interpretation methods. The code will be publicly available at https://github.com/ayanglab/DiffExplainer.
comment: MICCAI 2024
☆ Multimodal Deformable Image Registration for Long-COVID Analysis Based on Progressive Alignment and Multi-perspective Loss
Long COVID is characterized by persistent symptoms, particularly pulmonary impairment, which necessitates advanced imaging for accurate diagnosis. Hyperpolarised Xenon-129 MRI (XeMRI) offers a promising avenue by visualising lung ventilation, perfusion, as well as gas transfer. Integrating functional data from XeMRI with structural data from Computed Tomography (CT) is crucial for comprehensive analysis and effective treatment strategies in long COVID, requiring precise data alignment from those complementary imaging modalities. To this end, CT-MRI registration is an essential intermediate step, given the significant challenges posed by the direct alignment of CT and Xe-MRI. Therefore, we proposed an end-to-end multimodal deformable image registration method that achieves superior performance for aligning long-COVID lung CT and proton density MRI (pMRI) data. Moreover, our method incorporates a novel Multi-perspective Loss (MPL) function, enhancing state-of-the-art deep learning methods for monomodal registration by making them adaptable for multimodal tasks. The registration results achieve a Dice coefficient score of 0.913, indicating a substantial improvement over the state-of-the-art multimodal image registration techniques. Since the XeMRI and pMRI images are acquired in the same sessions and can be roughly aligned, our results facilitate subsequent registration between XeMRI and CT, thereby potentially enhancing clinical decision-making for long COVID management.
☆ Stochastic Optimisation Framework using the Core Imaging Library and Synergistic Image Reconstruction Framework for PET Reconstruction
We introduce a stochastic framework into the open--source Core Imaging Library (CIL) which enables easy development of stochastic algorithms. Five such algorithms from the literature are developed, Stochastic Gradient Descent, Stochastic Average Gradient (-Am\'elior\'e), (Loopless) Stochastic Variance Reduced Gradient. We showcase the functionality of the framework with a comparative study against a deterministic algorithm on a simulated 2D PET dataset, with the use of the open-source Synergistic Image Reconstruction Framework. We observe that stochastic optimisation methods can converge in fewer passes of the data than a standard deterministic algorithm.
☆ Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks
Simulators are powerful tools for autonomous robot learning as they offer scalable data generation, flexible design, and optimization of trajectories. However, transferring behavior learned from simulation data into the real world proves to be difficult, usually mitigated with compute-heavy domain randomization methods or further model fine-tuning. We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks. To this end, we first build a simulator by integrating Gaussian Splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks. In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, crafty programming of expert demonstration training data, and the task understanding capabilities of Liquid networks. Through a series of quantitative flight tests, we demonstrate the robust transfer of navigation skills learned in a single simulation scene directly to the real world. We further show the ability to maintain performance beyond the training environment under drastic distribution and physical environment changes. Our learned Liquid policies, trained on single target manoeuvres curated from a photorealistic simulated indoor flight only, generalize to multi-step hikes onboard a real hardware platform outdoors.
☆ A Wavelet Guided Attention Module for Skin Cancer Classification with Gradient-based Feature Fusion
Skin cancer is a highly dangerous type of cancer that requires an accurate diagnosis from experienced physicians. To help physicians diagnose skin cancer more efficiently, a computer-aided diagnosis (CAD) system can be very helpful. In this paper, we propose a novel model, which uses a novel attention mechanism to pinpoint the differences in features across the spatial dimensions and symmetry of the lesion, thereby focusing on the dissimilarities of various classes based on symmetry, uniformity in texture and color, etc. Additionally, to take into account the variations in the boundaries of the lesions for different classes, we employ a gradient-based fusion of wavelet and soft attention-aided features to extract boundary information of skin lesions. We have tested our model on the multi-class and highly class-imbalanced dataset, called HAM10000, and achieved promising results, with a 91.17\% F1-score and 90.75\% accuracy. The code is made available at: https://github.com/AyushRoy2001/WAGF-Fusion.
☆ High Resolution Surface Reconstruction of Cultural Heritage Objects Using Shape from Polarization Method
Nowadays, three-dimensional reconstruction is used in various fields like computer vision, computer graphics, mixed reality and digital twin. The three-dimensional reconstruction of cultural heritage objects is one of the most important applications in this area which is usually accomplished by close range photogrammetry. The problem here is that the images are often noisy, and the dense image matching method has significant limitations to reconstruct the geometric details of cultural heritage objects in practice. Therefore, displaying high-level details in three-dimensional models, especially for cultural heritage objects, is a severe challenge in this field. In this paper, the shape from polarization method has been investigated, a passive method with no drawbacks of active methods. In this method, the resolution of the depth maps can be dramatically increased using the information obtained from the polarization light by rotating a linear polarizing filter in front of a digital camera. Through these polarized images, the surface details of the object can be reconstructed locally with high accuracy. The fusion of polarization and photogrammetric methods is an appropriate solution for achieving high resolution three-dimensional reconstruction. The surface reconstruction assessments have been performed visually and quantitatively. The evaluations showed that the proposed method could significantly reconstruct the surfaces' details in the three-dimensional model compared to the photogrammetric method with 10 times higher depth resolution.
☆ Surface Normal Reconstruction Using Polarization-Unet
Today, three-dimensional reconstruction of objects has many applications in various fields, and therefore, choosing a suitable method for high resolution three-dimensional reconstruction is an important issue and displaying high-level details in three-dimensional models is a serious challenge in this field. Until now, active methods have been used for high-resolution three-dimensional reconstruction. But the problem of active three-dimensional reconstruction methods is that they require a light source close to the object. Shape from polarization (SfP) is one of the best solutions for high-resolution three-dimensional reconstruction of objects, which is a passive method and does not have the drawbacks of active methods. The changes in polarization of the reflected light from an object can be analyzed by using a polarization camera or locating polarizing filter in front of the digital camera and rotating the filter. Using this information, the surface normal can be reconstructed with high accuracy, which will lead to local reconstruction of the surface details. In this paper, an end-to-end deep learning approach has been presented to produce the surface normal of objects. In this method a benchmark dataset has been used to train the neural network and evaluate the results. The results have been evaluated quantitatively and qualitatively by other methods and under different lighting conditions. The MAE value (Mean-Angular-Error) has been used for results evaluation. The evaluations showed that the proposed method could accurately reconstruct the surface normal of objects with the lowest MAE value which is equal to 18.06 degree on the whole dataset, in comparison to previous physics-based methods which are between 41.44 and 49.03 degree.
☆ FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays
Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in developing or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial to ensure effective treatment and increase survival rates. Chest X-ray imaging is the most commonly used method for diagnosing pneumonia. However, visual examination of chest X-rays can be difficult and subjective. In this study, we have developed a computer-aided diagnosis system for automatic pneumonia detection using chest X-ray images. We have used DenseNet-121 and ResNet50 as the backbone for the binary class (pneumonia and normal) and multi-class (bacterial pneumonia, viral pneumonia, and normal) classification tasks, respectively. We have also implemented a channel-specific spatial attention mechanism, called Fuzzy Channel Selective Spatial Attention Module (FCSSAM), to highlight the specific spatial regions of relevant channels while removing the irrelevant channels of the extracted features by the backbone. We evaluated the proposed approach on a publicly available chest X-ray dataset, using binary and multi-class classification setups. Our proposed method achieves accuracy rates of 97.15\% and 79.79\% for the binary and multi-class classification setups, respectively. The results of our proposed method are superior to state-of-the-art (SOTA) methods. The code of the proposed model will be available at: https://github.com/AyushRoy2001/FA-Net.
☆ A Dual Attention-aided DenseNet-121 for Classification of Glaucoma from Fundus Images
Deep learning and computer vision methods are nowadays predominantly used in the field of ophthalmology. In this paper, we present an attention-aided DenseNet-121 for classifying normal and glaucomatous eyes from fundus images. It involves the convolutional block attention module to highlight relevant spatial and channel features extracted by DenseNet-121. The channel recalibration module further enriches the features by utilizing edge information along with the statistical features of the spatial dimension. For the experiments, two standard datasets, namely RIM-ONE and ACRIMA, have been used. Our method has shown superior results than state-of-the-art models. An ablation study has also been conducted to show the effectiveness of each of the components. The code of the proposed work is available at: https://github.com/Soham2004GitHub/DADGC.
☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
☆ Deciphering the Definition of Adversarial Robustness for post-hoc OOD Detectors
Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in real-world scenarios. In recent years, many OOD detectors have been developed, and even the benchmarking has been standardized, i.e. OpenOOD. The number of post-hoc detectors is growing fast and showing an option to protect a pre-trained classifier against natural distribution shifts, claiming to be ready for real-world scenarios. However, its efficacy in handling adversarial examples has been neglected in the majority of studies. This paper investigates the adversarial robustness of the 16 post-hoc detectors on several evasion attacks and discuss a roadmap towards adversarial defense in OOD detectors.
☆ HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization
With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory savings and acceleration on real GPUs with negligible quality degradation.
☆ ECLIPSE: Expunging Clean-label Indiscriminate Poisons via Sparse Diffusion Purification ESORICS 2024
Clean-label indiscriminate poisoning attacks add invisible perturbations to correctly labeled training images, thus dramatically reducing the generalization capability of the victim models. Recently, some defense mechanisms have been proposed such as adversarial training, image transformation techniques, and image purification. However, these schemes are either susceptible to adaptive attacks, built on unrealistic assumptions, or only effective against specific poison types, limiting their universal applicability. In this research, we propose a more universally effective, practical, and robust defense scheme called ECLIPSE. We first investigate the impact of Gaussian noise on the poisons and theoretically prove that any kind of poison will be largely assimilated when imposing sufficient random noise. In light of this, we assume the victim has access to an extremely limited number of clean images (a more practical scene) and subsequently enlarge this sparse set for training a denoising probabilistic model (a universal denoising tool). We then begin by introducing Gaussian noise to absorb the poisons and then apply the model for denoising, resulting in a roughly purified dataset. Finally, to address the trade-off of the inconsistency in the assimilation sensitivity of different poisons by Gaussian noise, we propose a lightweight corruption compensation module to effectively eliminate residual poisons, providing a more universal defense approach. Extensive experiments demonstrate that our defense approach outperforms 10 state-of-the-art defenses. We also propose an adaptive attack against ECLIPSE and verify the robustness of our defense scheme. Our code is available at https://github.com/CGCL-codes/ECLIPSE.
comment: Accepted by ESORICS 2024
☆ Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
☆ Improving Interpretability and Robustness for the Detection of AI-Generated Images
With growing abilities of generative models, artificial content detection becomes an increasingly important and difficult task. However, all popular approaches to this problem suffer from poor generalization across domains and generative models. In this work, we focus on the robustness of AI-generated image (AIGI) detectors. We analyze existing state-of-the-art AIGI detection methods based on frozen CLIP embeddings and show how to interpret them, shedding light on how images produced by various AI generators differ from real ones. Next we propose two ways to improve robustness: based on removing harmful components of the embedding vector and based on selecting the best performing attention heads in the image encoder model. Our methods increase the mean out-of-distribution (OOD) classification score by up to 6% for cross-model transfer. We also propose a new dataset for AIGI detection and use it in our evaluation; we believe this dataset will help boost further research. The dataset and code are provided as a supplement.
☆ SVFormer: A Direct Training Spiking Transformer for Efficient Video Action Recognition IJCAI 2024
Video action recognition (VAR) plays crucial roles in various domains such as surveillance, healthcare, and industrial automation, making it highly significant for the society. Consequently, it has long been a research spot in the computer vision field. As artificial neural networks (ANNs) are flourishing, convolution neural networks (CNNs), including 2D-CNNs and 3D-CNNs, as well as variants of the vision transformer (ViT), have shown impressive performance on VAR. However, they usually demand huge computational cost due to the large data volume and heavy information redundancy introduced by the temporal dimension. To address this challenge, some researchers have turned to brain-inspired spiking neural networks (SNNs), such as recurrent SNNs and ANN-converted SNNs, leveraging their inherent temporal dynamics and energy efficiency. Yet, current SNNs for VAR also encounter limitations, such as nontrivial input preprocessing, intricate network construction/training, and the need for repetitive processing of the same video clip, hindering their practical deployment. In this study, we innovatively propose the directly trained SVFormer (Spiking Video transFormer) for VAR. SVFormer integrates local feature extraction, global self-attention, and the intrinsic dynamics, sparsity, and spike-driven nature of SNNs, to efficiently and effectively extract spatio-temporal features. We evaluate SVFormer on two RGB datasets (UCF101, NTU-RGBD60) and one neuromorphic dataset (DVS128-Gesture), demonstrating comparable performance to the mainstream models in a more efficient way. Notably, SVFormer achieves a top-1 accuracy of 84.03% with ultra-low power consumption (21 mJ/video) on UCF101, which is state-of-the-art among directly trained deep SNNs, showcasing significant advantages over prior models.
comment: Accepted by IJCAI 2024 workshop - Human Brain and Artificial Intelligence
☆ A3D: Does Diffusion Dream about 3D Alignment?
We tackle the problem of text-driven 3D generation from a geometry alignment perspective. We aim at the generation of multiple objects which are consistent in terms of semantics and geometry. Recent methods based on Score Distillation have succeeded in distilling the knowledge from 2D diffusion models to high-quality objects represented by 3D neural radiance fields. These methods handle multiple text queries separately, and therefore, the resulting objects have a high variability in object pose and structure. However, in some applications such as geometry editing, it is desirable to obtain aligned objects. In order to achieve alignment, we propose to optimize the continuous trajectories between the aligned objects, by modeling a space of linear pairwise interpolations of the textual embeddings with a single NeRF representation. We demonstrate that similar objects, consisting of semantically corresponding parts, can be well aligned in 3D space without costly modifications to the generation process. We provide several practical scenarios including mesh editing and object hybridization that benefit from geometry alignment and experimentally demonstrate the efficiency of our method. https://voyleg.github.io/a3d/
☆ Real-Time Hand Gesture Recognition: Integrating Skeleton-Based Data Fusion and Multi-Stream CNN
This study focuses on Hand Gesture Recognition (HGR), which is vital for perceptual computing across various real-world contexts. The primary challenge in the HGR domain lies in dealing with the individual variations inherent in human hand morphology. To tackle this challenge, we introduce an innovative HGR framework that combines data-level fusion and an Ensemble Tuner Multi-stream CNN architecture. This approach effectively encodes spatiotemporal gesture information from the skeleton modality into RGB images, thereby minimizing noise while improving semantic gesture comprehension. Our framework operates in real-time, significantly reducing hardware requirements and computational complexity while maintaining competitive performance on benchmark datasets such as SHREC2017, DHG1428, FPHA, LMDHG and CNR. This improvement in HGR demonstrates robustness and paves the way for practical, real-time applications that leverage resource-limited devices for human-machine interaction and ambient intelligence.
comment: 13 pages. 7 figures. Code available at https://github.com/Outsiders17711/e2eET-Skeleton-Based-HGR-Using-Data-Level-Fusion
☆ Benchmarking Retinal Blood Vessel Segmentation Models for Cross-Dataset and Cross-Disease Generalization
Retinal blood vessel segmentation can extract clinically relevant information from fundus images. As manual tracing is cumbersome, algorithms based on Convolution Neural Networks have been developed. Such studies have used small publicly available datasets for training and measuring performance, running the risk of overfitting. Here, we provide a rigorous benchmark for various architectural and training choices commonly used in the literature on the largest dataset published to date. We train and evaluate five published models on the publicly available FIVES fundus image dataset, which exceeds previous ones in size and quality and which contains also images from common ophthalmological conditions (diabetic retinopathy, age-related macular degeneration, glaucoma). We compare the performance of different model architectures across different loss functions, levels of image qualitiy and ophthalmological conditions and assess their ability to perform well in the face of disease-induced domain shifts. Given sufficient training data, basic architectures such as U-Net perform just as well as more advanced ones, and transfer across disease-induced domain shifts typically works well for most architectures. However, we find that image quality is a key factor determining segmentation outcomes. When optimizing for segmentation performance, investing into a well curated dataset to train a standard architecture yields better results than tuning a sophisticated architecture on a smaller dataset or one with lower image quality. We distilled the utility of architectural advances in terms of their clinical relevance therefore providing practical guidance for model choices depending on the circumstances of the clinical setting
comment: 12 pages, 4 figures
☆ Disability Representations: Finding Biases in Automatic Image Generation CVPR 2024
Recent advancements in image generation technology have enabled widespread access to AI-generated imagery, prominently used in advertising, entertainment, and progressively in every form of visual content. However, these technologies often perpetuate societal biases. This study investigates the representation biases in popular image generation models towards people with disabilities (PWD). Through a comprehensive experiment involving several popular text-to-image models, we analyzed the depiction of disability. The results indicate a significant bias, with most generated images portraying disabled individuals as old, sad, and predominantly using manual wheelchairs. These findings highlight the urgent need for more inclusive AI development, ensuring diverse and accurate representation of PWD in generated images. This research underscores the importance of addressing and mitigating biases in AI models to foster equitable and realistic representations.
comment: Presented at AVA Workshop of CVPR 2024
☆ E2GS: Event Enhanced Gaussian Splatting
Event cameras, known for their high dynamic range, absence of motion blur, and low energy usage, have recently found a wide range of applications thanks to these attributes. In the past few years, the field of event-based 3D reconstruction saw remarkable progress, with the Neural Radiance Field (NeRF) based approach demonstrating photorealistic view synthesis results. However, the volume rendering paradigm of NeRF necessitates extensive training and rendering times. In this paper, we introduce Event Enhanced Gaussian Splatting (E2GS), a novel method that incorporates event data into Gaussian Splatting, which has recently made significant advances in the field of novel view synthesis. Our E2GS effectively utilizes both blurry images and event data, significantly improving image deblurring and producing high-quality novel view synthesis. Our comprehensive experiments on both synthetic and real-world datasets demonstrate our E2GS can generate visually appealing renderings while offering faster training and rendering speed (140 FPS). Our code is available at https://github.com/deguchihiroyuki/E2GS.
comment: 7pages,
☆ CoCPF: Coordinate-based Continuous Projection Field for Ill-Posed Inverse Problem in Imaging
Sparse-view computed tomography (SVCT) reconstruction aims to acquire CT images based on sparsely-sampled measurements. It allows the subjects exposed to less ionizing radiation, reducing the lifetime risk of developing cancers. Recent researches employ implicit neural representation (INR) techniques to reconstruct CT images from a single SV sinogram. However, due to ill-posedness, these INR-based methods may leave considerable ``holes'' (i.e., unmodeled spaces) in their fields, leading to sub-optimal results. In this paper, we propose the Coordinate-based Continuous Projection Field (CoCPF), which aims to build hole-free representation fields for SVCT reconstruction, achieving better reconstruction quality. Specifically, to fill the holes, CoCPF first employs the stripe-based volume sampling module to broaden the sampling regions of Radon transformation from rays (1D space) to stripes (2D space), which can well cover the internal regions between SV projections. Then, by feeding the sampling regions into the proposed differentiable rendering modules, the holes can be jointly optimized during training, reducing the ill-posed levels. As a result, CoCPF can accurately estimate the internal measurements between SV projections (i.e., DV sinograms), producing high-quality CT images after re-projection. Extensive experiments on simulated and real projection datasets demonstrate that CoCPF outperforms state-of-the-art methods for 2D and 3D SVCT reconstructions under various projection numbers and geometries, yielding fine-grained details and fewer artifacts. Our code will be publicly available.
☆ LU2Net: A Lightweight Network for Real-time Underwater Image Enhancement
Computer vision techniques have empowered underwater robots to effectively undertake a multitude of tasks, including object tracking and path planning. However, underwater optical factors like light refraction and absorption present challenges to underwater vision, which cause degradation of underwater images. A variety of underwater image enhancement methods have been proposed to improve the effectiveness of underwater vision perception. Nevertheless, for real-time vision tasks on underwater robots, it is necessary to overcome the challenges associated with algorithmic efficiency and real-time capabilities. In this paper, we introduce Lightweight Underwater Unet (LU2Net), a novel U-shape network designed specifically for real-time enhancement of underwater images. The proposed model incorporates axial depthwise convolution and the channel attention module, enabling it to significantly reduce computational demands and model parameters, thereby improving processing speed. The extensive experiments conducted on the dataset and real-world underwater robots demonstrate the exceptional performance and speed of proposed model. It is capable of providing well-enhanced underwater images at a speed 8 times faster than the current state-of-the-art underwater image enhancement method. Moreover, LU2Net is able to handle real-time underwater video enhancement.
☆ VividDreamer: Towards High-Fidelity and Efficient Text-to-3D Generation
Text-to-3D generation aims to create 3D assets from text-to-image diffusion models. However, existing methods face an inherent bottleneck in generation quality because the widely-used objectives such as Score Distillation Sampling (SDS) inappropriately omit U-Net jacobians for swift generation, leading to significant bias compared to the "true" gradient obtained by full denoising sampling. This bias brings inconsistent updating direction, resulting in implausible 3D generation e.g., color deviation, Janus problem, and semantically inconsistent details). In this work, we propose Pose-dependent Consistency Distillation Sampling (PCDS), a novel yet efficient objective for diffusion-based 3D generation tasks. Specifically, PCDS builds the pose-dependent consistency function within diffusion trajectories, allowing to approximate true gradients through minimal sampling steps (1-3). Compared to SDS, PCDS can acquire a more accurate updating direction with the same sampling time (1 sampling step), while enabling few-step (2-3) sampling to trade compute for higher generation quality. For efficient generation, we propose a coarse-to-fine optimization strategy, which first utilizes 1-step PCDS to create the basic structure of 3D objects, and then gradually increases PCDS steps to generate fine-grained details. Extensive experiments demonstrate that our approach outperforms the state-of-the-art in generation quality and training efficiency, conspicuously alleviating the implausible 3D generation issues caused by the deviated updating direction. Moreover, it can be simply applied to many 3D generative applications to yield impressive 3D assets, please see our project page: https://narcissusex.github.io/VividDreamer.
☆ Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning
Compositional Zero-shot Learning (CZSL) aims to identify novel compositions via known attribute-object pairs. The primary challenge in CZSL tasks lies in the significant discrepancies introduced by the complex interaction between the visual primitives of attribute and object, consequently decreasing the classification performance towards novel compositions. Previous remarkable works primarily addressed this issue by focusing on disentangling strategy or utilizing object-based conditional probabilities to constrain the selection space of attributes. Unfortunately, few studies have explored the problem from the perspective of modeling the mechanism of visual primitive interactions. Inspired by the success of vanilla adversarial learning in Cross-Domain Few-Shot Learning, we take a step further and devise a model-agnostic and Primitive-Based Adversarial training (PBadv) method to deal with this problem. Besides, the latest studies highlight the weakness of the perception of hard compositions even under data-balanced conditions. To this end, we propose a novel over-sampling strategy with object-similarity guidance to augment target compositional training data. We performed detailed quantitative analysis and retrieval experiments on well-established datasets, such as UT-Zappos50K, MIT-States, and C-GQA, to validate the effectiveness of our proposed method, and the state-of-the-art (SOTA) performance demonstrates the superiority of our approach. The code is available at https://github.com/lisuyi/PBadv_czsl.
☆ Skip and Skip: Segmenting Medical Images with Prompts
Most medical image lesion segmentation methods rely on hand-crafted accurate annotations of the original image for supervised learning. Recently, a series of weakly supervised or unsupervised methods have been proposed to reduce the dependence on pixel-level annotations. However, these methods are essentially based on pixel-level annotation, ignoring the image-level diagnostic results of the current massive medical images. In this paper, we propose a dual U-shaped two-stage framework that utilizes image-level labels to prompt the segmentation. In the first stage, we pre-train a classification network with image-level labels, which is used to obtain the hierarchical pyramid features and guide the learning of downstream branches. In the second stage, we feed the hierarchical features obtained from the classification branch into the downstream branch through short-skip and long-skip and get the lesion masks under the supervised learning of pixel-level labels. Experiments show that our framework achieves better results than networks simply using pixel-level annotations.
comment: Work in progress
☆ A Unified Framework for Synthesizing Multisequence Brain MRI via Hybrid Fusion
Multisequence Magnetic Resonance Imaging (MRI) provides a reliable diagnosis in clinical applications through complementary information within sequences. However, in practice, the absence of certain MR sequences is a common problem that can lead to inconsistent analysis results. In this work, we propose a novel unified framework for synthesizing multisequence MR images, called Hybrid Fusion GAN (HF-GAN). We introduce a hybrid fusion encoder designed to ensure the disentangled extraction of complementary and modality-specific information, along with a channel attention-based feature fusion module that integrates the features into a common latent space handling the complexity from combinations of accessible MR sequences. Common feature representations are transformed into a target latent space via the modality infuser to synthesize missing MR sequences. We have performed experiments on multisequence brain MRI datasets from healthy individuals and patients diagnosed with brain tumors. Experimental results show that our method outperforms state-of-the-art methods in both quantitative and qualitative comparisons. In addition, a detailed analysis of our framework demonstrates the superiority of our designed modules and their effectiveness for use in data imputation tasks.
comment: 11 pages, 7 figures
☆ Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health
Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model's performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual's cardiovascular health status. Our code will be available at https://github.com/Ngk03/AI-vascular-age.
☆ Brightearth roads: Towards fully automatic road network extraction from satellite imagery
The modern road network topology comprises intricately designed structures that introduce complexity when automatically reconstructing road networks. While open resources like OpenStreetMap (OSM) offer road networks with well-defined topology, they may not always be up to date worldwide. In this paper, we propose a fully automated pipeline for extracting road networks from very-high-resolution (VHR) satellite imagery. Our approach directly generates road line-strings that are seamlessly connected and precisely positioned. The process involves three key modules: a CNN-based neural network for road segmentation, a graph optimization algorithm to convert road predictions into vector line-strings, and a machine learning model for classifying road materials. Compared to OSM data, our results demonstrate significant potential for providing the latest road layouts and precise positions of road segments.
☆ Gaussian-Informed Continuum for Physical Property Identification and Simulation
This paper studies the problem of estimating physical properties (system identification) through visual observations. To facilitate geometry-aware guidance in physical property estimation, we introduce a novel hybrid framework that leverages 3D Gaussian representation to not only capture explicit shapes but also enable the simulated continuum to deduce implicit shapes during training. We propose a new dynamic 3D Gaussian framework based on motion factorization to recover the object as 3D Gaussian point sets across different time states. Furthermore, we develop a coarse-to-fine filling strategy to generate the density fields of the object from the Gaussian reconstruction, allowing for the extraction of object continuums along with their surfaces and the integration of Gaussian attributes into these continuums. In addition to the extracted object surfaces, the Gaussian-informed continuum also enables the rendering of object masks during simulations, serving as implicit shape guidance for physical property estimation. Extensive experimental evaluations demonstrate that our pipeline achieves state-of-the-art performance across multiple benchmarks and metrics. Additionally, we illustrate the effectiveness of the proposed method through real-world demonstrations, showcasing its practical utility. Our project page is at https://jukgei.github.io/project/gic.
comment: 19 pages, 8 figures
☆ DiPEx: Dispersing Prompt Expansion for Class-Agnostic Object Detection
Class-agnostic object detection (OD) can be a cornerstone or a bottleneck for many downstream vision tasks. Despite considerable advancements in bottom-up and multi-object discovery methods that leverage basic visual cues to identify salient objects, consistently achieving a high recall rate remains difficult due to the diversity of object types and their contextual complexity. In this work, we investigate using vision-language models (VLMs) to enhance object detection via a self-supervised prompt learning strategy. Our initial findings indicate that manually crafted text queries often result in undetected objects, primarily because detection confidence diminishes when the query words exhibit semantic overlap. To address this, we propose a Dispersing Prompt Expansion (DiPEx) approach. DiPEx progressively learns to expand a set of distinct, non-overlapping hyperspherical prompts to enhance recall rates, thereby improving performance in downstream tasks such as out-of-distribution OD. Specifically, DiPEx initiates the process by self-training generic parent prompts and selecting the one with the highest semantic uncertainty for further expansion. The resulting child prompts are expected to inherit semantics from their parent prompts while capturing more fine-grained semantics. We apply dispersion losses to ensure high inter-class discrepancy among child prompts while preserving semantic consistency between parent-child prompt pairs. To prevent excessive growth of the prompt sets, we utilize the maximum angular coverage (MAC) of the semantic space as a criterion for early termination. We demonstrate the effectiveness of DiPEx through extensive class-agnostic OD and OOD-OD experiments on MS-COCO and LVIS, surpassing other prompting methods by up to 20.1% in AR and achieving a 21.3% AP improvement over SAM. The code is available at https://github.com/jason-lim26/DiPEx.
comment: 19 pages
☆ LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking
The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature's exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks
In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN's performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.
☆ FC3DNet: A Fully Connected Encoder-Decoder for Efficient Demoir'eing ICIP2024
Moir\'e patterns are commonly seen when taking photos of screens. Camera devices usually have limited hardware performance but take high-resolution photos. However, users are sensitive to the photo processing time, which presents a hardly considered challenge of efficiency for demoir\'eing methods. To balance the network speed and quality of results, we propose a \textbf{F}ully \textbf{C}onnected en\textbf{C}oder-de\textbf{C}oder based \textbf{D}emoir\'eing \textbf{Net}work (FC3DNet). FC3DNet utilizes features with multiple scales in each stage of the decoder for comprehensive information, which contains long-range patterns as well as various local moir\'e styles that both are crucial aspects in demoir\'eing. Besides, to make full use of multiple features, we design a Multi-Feature Multi-Attention Fusion (MFMAF) module to weigh the importance of each feature and compress them for efficiency. These designs enable our network to achieve performance comparable to state-of-the-art (SOTA) methods in real-world datasets while utilizing only a fraction of parameters, FLOPs, and runtime.
comment: Accepted by ICIP2024
☆ SelfReg-UNet: Self-Regularized UNet for Medical Image Segmentation MICCAI
Since its introduction, UNet has been leading a variety of medical image segmentation tasks. Although numerous follow-up studies have also been dedicated to improving the performance of standard UNet, few have conducted in-depth analyses of the underlying interest pattern of UNet in medical image segmentation. In this paper, we explore the patterns learned in a UNet and observe two important factors that potentially affect its performance: (i) irrelative feature learned caused by asymmetric supervision; (ii) feature redundancy in the feature map. To this end, we propose to balance the supervision between encoder and decoder and reduce the redundant information in the UNet. Specifically, we use the feature map that contains the most semantic information (i.e., the last layer of the decoder) to provide additional supervision to other blocks to provide additional supervision and reduce feature redundancy by leveraging feature distillation. The proposed method can be easily integrated into existing UNet architecture in a plug-and-play fashion with negligible computational cost. The experimental results suggest that the proposed method consistently improves the performance of standard UNets on four medical image segmentation datasets. The code is available at \url{https://github.com/ChongQingNoSubway/SelfReg-UNet}
comment: Accepted as a conference paper to 2024 MICCAI
☆ MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection
LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assembles best-fit supermodels for each test batch from a bank of historical checkpoints, leveraging long-term knowledge to guide model updates without forgetting. The model assembly is directed by the proposed synergy weights (SW), employed for weighted averaging of the selected checkpoints to minimize redundancy in the composite supermodel. These weights are calculated by evaluating the similarity of predicted bounding boxes on test data and the feature independence among model pairs in the bank. To maintain an informative yet compact model bank, we pop out checkpoints with the lowest average SW scores and insert newly updated model weights. Our method was rigorously tested against prior test-time domain adaptation strategies on three datasets and under eight types of corruptions, demonstrating its superior adaptability to changing scenes and conditions. Remarkably, our approach achieved a 67.3% increase in performance in a complex "cross-corruption" scenario, which involves cross-dataset inconsistencies and real-world scene corruptions, providing a more realistic testbed of adaptation capabilities. The code is available at https://github.com/zhuoxiao-chen/MOS.
☆ TraceNet: Segment one thing efficiently
Efficient single instance segmentation is essential for unlocking features in the mobile imaging applications, such as capture or editing. Existing on-the-fly mobile imaging applications scope the segmentation task to portraits or the salient subject due to the computational constraints. Instance segmentation, despite its recent developments towards efficient networks, is still heavy due to the cost of computation on the entire image to identify all instances. To address this, we propose and formulate a one tap driven single instance segmentation task that segments a single instance selected by a user via a positive tap. This task, in contrast to the broader task of segmenting anything as suggested in the Segment Anything Model \cite{sam}, focuses on efficient segmentation of a single instance specified by the user. To solve this problem, we present TraceNet, which explicitly locates the selected instance by way of receptive field tracing. TraceNet identifies image regions that are related to the user tap and heavy computations are only performed on selected regions of the image. Therefore overall computation cost and memory consumption are reduced during inference. We evaluate the performance of TraceNet on instance IoU average over taps and the proportion of the region that a user tap can fall into for a high-quality single-instance mask. Experimental results on MS-COCO and LVIS demonstrate the effectiveness and efficiency of the proposed approach. TraceNet can jointly achieve the efficiency and interactivity, filling in the gap between needs for efficient mobile inference and recent research trend towards multimodal and interactive segmentation models.
☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
☆ Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis
Limited access to neurological care leads to missed diagnoses of Parkinson's disease (PD), leaving many individuals unidentified and untreated. We trained a novel neural network-based fusion architecture to detect Parkinson's disease (PD) by analyzing features extracted from webcam recordings of three tasks: finger tapping, facial expression (smiling), and speech (uttering a sentence containing all letters of the alphabet). Additionally, the model incorporated Monte Carlo Dropout to improve prediction accuracy by considering uncertainties. The study participants (n = 845, 272 with PD) were randomly split into three sets: 60% for training, 20% for model selection (hyper-parameter tuning), and 20% for final performance evaluation. The dataset consists of 1102 sessions, each session containing videos of all three tasks. Our proposed model achieved significantly better accuracy, area under the ROC curve (AUROC), and sensitivity at non-inferior specificity compared to any single-task model. Withholding uncertain predictions further boosted the performance, achieving 88.0% (95% CI: 87.7% - 88.4%) accuracy, 93.0% (92.8% - 93.2%) AUROC, 79.3% (78.4% - 80.2%) sensitivity, and 92.6% (92.3% - 92.8%) specificity, at the expense of not being able to predict for 2.3% (2.0% - 2.6%) data. Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. This accessible, low-cost approach requiring only an internet-enabled device with a webcam and microphone paves the way for convenient PD screening at home, particularly in regions with limited access to clinical specialists.
☆ Six-CD: Benchmarking Concept Removals for Benign Text-to-image Diffusion Models
Text-to-image (T2I) diffusion models have shown exceptional capabilities in generating images that closely correspond to textual prompts. However, the advancement of T2I diffusion models presents significant risks, as the models could be exploited for malicious purposes, such as generating images with violence or nudity, or creating unauthorized portraits of public figures in inappropriate contexts. To mitigate these risks, concept removal methods have been proposed. These methods aim to modify diffusion models to prevent the generation of malicious and unwanted concepts. Despite these efforts, existing research faces several challenges: (1) a lack of consistent comparisons on a comprehensive dataset, (2) ineffective prompts in harmful and nudity concepts, (3) overlooked evaluation of the ability to generate the benign part within prompts containing malicious concepts. To address these gaps, we propose to benchmark the concept removal methods by introducing a new dataset, Six-CD, along with a novel evaluation metric. In this benchmark, we conduct a thorough evaluation of concept removals, with the experimental observations and discussions offering valuable insights in the field.
☆ PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers
The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (<= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.
☆ Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning -- a fundamental component of human cognition -- remains under-explored. We develop novel benchmarks that cover diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.
☆ Fair Text to Medical Image Diffusion Model with Subgroup Distribution Aligned Tuning
The text to medical image (T2MedI) with latent diffusion model has great potential to alleviate the scarcity of medical imaging data and explore the underlying appearance distribution of lesions in a specific patient status description. However, as the text to nature image models, we show that the T2MedI model can also bias to some subgroups to overlook the minority ones in the training set. In this work, we first build a T2MedI model based on the pre-trained Imagen model, which has the fixed contrastive language-image pre-training (CLIP) text encoder, while its decoder has been fine-tuned on medical images from the Radiology Objects in COntext (ROCO) dataset. Its gender bias is analyzed qualitatively and quantitatively. Toward this issue, we propose to fine-tune the T2MedI toward the target application dataset to align their sensitive subgroups distribution probability. Specifically, the alignment loss for fine-tuning is guided by an off-the-shelf sensitivity-subgroup classifier to match the classification probability between the generated images and the expected target dataset. In addition, the image quality is maintained by a CLIP-consistency regularization term following a knowledge distillation scheme. For evaluation, we set the target dataset to be enhanced as the BraST18 dataset, and trained a brain magnetic resonance (MR) slice-based gender classifier from it. With our method, the generated MR image can markedly reduce the inconsistency with the gender proportion in the BraTS18 dataset.
☆ CLIP-Decoder : ZeroShot Multilabel Classification using Multimodal CLIP Aligned Representation ICCV
Multi-label classification is an essential task utilized in a wide variety of real-world applications. Multi-label zero-shot learning is a method for classifying images into multiple unseen categories for which no training data is available, while in general zero-shot situations, the test set may include observed classes. The CLIP-Decoder is a novel method based on the state-of-the-art ML-Decoder attention-based head. We introduce multi-modal representation learning in CLIP-Decoder, utilizing the text encoder to extract text features and the image encoder for image feature extraction. Furthermore, we minimize semantic mismatch by aligning image and word embeddings in the same dimension and comparing their respective representations using a combined loss, which comprises classification loss and CLIP loss. This strategy outperforms other methods and we achieve cutting-edge results on zero-shot multilabel classification tasks using CLIP-Decoder. Our method achieves an absolute increase of 3.9% in performance compared to existing methods for zero-shot learning multi-label classification tasks. Additionally, in the generalized zero-shot learning multi-label classification task, our method shows an impressive increase of almost 2.3%.
comment: Accepted at ICCVW- VLAR
☆ SAM-EG: Segment Anything Model with Egde Guidance framework for efficient Polyp Segmentation
Polyp segmentation, a critical concern in medical imaging, has prompted numerous proposed methods aimed at enhancing the quality of segmented masks. While current state-of-the-art techniques produce impressive results, the size and computational cost of these models pose challenges for practical industry applications. Recently, the Segment Anything Model (SAM) has been proposed as a robust foundation model, showing promise for adaptation to medical image segmentation. Inspired by this concept, we propose SAM-EG, a framework that guides small segmentation models for polyp segmentation to address the computation cost challenge. Additionally, in this study, we introduce the Edge Guiding module, which integrates edge information into image features to assist the segmentation model in addressing boundary issues from current segmentation model in this task. Through extensive experiments, our small models showcase their efficacy by achieving competitive results with state-of-the-art methods, offering a promising approach to developing compact models with high accuracy for polyp segmentation and in the broader field of medical imaging.
☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
☆ Relighting Scenes with Object Insertions in Neural Radiance Fields
The insertion of objects into a scene and relighting are commonly utilized applications in augmented reality (AR). Previous methods focused on inserting virtual objects using CAD models or real objects from single-view images, resulting in highly limited AR application scenarios. We propose a novel NeRF-based pipeline for inserting object NeRFs into scene NeRFs, enabling novel view synthesis and realistic relighting, supporting physical interactions like casting shadows onto each other, from two sets of images depicting the object and scene. The lighting environment is in a hybrid representation of Spherical Harmonics and Spherical Gaussians, representing both high- and low-frequency lighting components very well, and supporting non-Lambertian surfaces. Specifically, we leverage the benefits of volume rendering and introduce an innovative approach for efficient shadow rendering by comparing the depth maps between the camera view and the light source view and generating vivid soft shadows. The proposed method achieves realistic relighting effects in extensive experimental evaluations.
comment: 14 pages
☆ Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification
Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.
♻ ☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
♻ ☆ Unfolding ADMM for Enhanced Subspace Clustering of Hyperspectral Images
Deep subspace clustering methods are now prominent in clustering, typically using fully connected networks and a self-representation loss function. However, these methods often struggle with overfitting and lack interpretability. In this paper, we explore an alternative clustering approach based on deep unfolding. By unfolding iterative optimization methods into neural networks, this approach offers enhanced interpretability and reliability compared to data-driven deep learning methods, and greater adaptability and generalization than model-based approaches. Hence, unfolding has become widely used in inverse imaging problems, such as image restoration, reconstruction, and super-resolution, but has not been sufficiently explored yet in the context of clustering. In this work, we introduce an innovative clustering architecture for hyperspectral images (HSI) by unfolding an iterative solver based on the Alternating Direction Method of Multipliers (ADMM) for sparse subspace clustering. To our knowledge, this is the first attempt to apply unfolding ADMM for computing the self-representation matrix in subspace clustering. Moreover, our approach captures well the structural characteristics of HSI data by employing the K nearest neighbors algorithm as part of a structure preservation module. Experimental evaluation of three established HSI datasets shows clearly the potential of the unfolding approach in HSI clustering and even demonstrates superior performance compared to state-of-the-art techniques.
♻ ☆ Directly Fine-Tuning Diffusion Models on Differentiable Rewards ICLR 2024
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
comment: Published at ICLR 2024
♻ ☆ A Dataset and Benchmark for Copyright Infringement Unlearning from Text-to-Image Diffusion Models
Copyright law confers upon creators the exclusive rights to reproduce, distribute, and monetize their creative works. However, recent progress in text-to-image generation has introduced formidable challenges to copyright enforcement. These technologies enable the unauthorized learning and replication of copyrighted content, artistic creations, and likenesses, leading to the proliferation of unregulated content. Notably, models like stable diffusion, which excel in text-to-image synthesis, heighten the risk of copyright infringement and unauthorized distribution.Machine unlearning, which seeks to eradicate the influence of specific data or concepts from machine learning models, emerges as a promising solution by eliminating the \enquote{copyright memories} ingrained in diffusion models. Yet, the absence of comprehensive large-scale datasets and standardized benchmarks for evaluating the efficacy of unlearning techniques in the copyright protection scenarios impedes the development of more effective unlearning methods. To address this gap, we introduce a novel pipeline that harmonizes CLIP, ChatGPT, and diffusion models to curate a dataset. This dataset encompasses anchor images, associated prompts, and images synthesized by text-to-image models. Additionally, we have developed a mixed metric based on semantic and style information, validated through both human and artist assessments, to gauge the effectiveness of unlearning approaches. Our dataset, benchmark library, and evaluation metrics will be made publicly available to foster future research and practical applications (https://rmpku.github.io/CPDM-page/, website / http://149.104.22.83/unlearning.tar.gz, dataset).
comment: 20 pages, 7 figures, 3 table
♻ ☆ Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
comment: Reinforcement Learning Conference (RLC), 2024
♻ ☆ Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How
Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. In search of empirical support for our proposition, we perform a series of pre-registered experiments combining replay-approximated joint objectives with gradient projection-based optimization routines. However, this first experimental attempt fails to show clear and consistent benefits. Nevertheless, our conceptual arguments, as well as some of our empirical results, demonstrate the distinctive importance of the optimization trajectory in continual learning, thereby opening up a new direction for continual learning research.
comment: Full paper version of pre-registered report accepted at the 1st ContinualAI Unconference. The originally submitted pre-registered proposal can be found at arXiv:2311.04898v1
♻ ☆ FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models
Human motion synthesis is a fundamental task in computer animation. Despite recent progress in this field utilizing deep learning and motion capture data, existing methods are always limited to specific motion categories, environments, and styles. This poor generalizability can be partially attributed to the difficulty and expense of collecting large-scale and high-quality motion data. At the same time, foundation models trained with internet-scale image and text data have demonstrated surprising world knowledge and reasoning ability for various downstream tasks. Utilizing these foundation models may help with human motion synthesis, which some recent works have superficially explored. However, these methods didn't fully unveil the foundation models' potential for this task and only support several simple actions and environments. In this paper, we for the first time, without any motion data, explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs across any motion task and environment. Our framework can be split into two stages: 1) sequential keyframe generation by utilizing MLLMs as a keyframe designer and animator; 2) motion filling between keyframes through interpolation and motion tracking. Our method can achieve general human motion synthesis for many downstream tasks. The promising results demonstrate the worth of mocap-free human motion synthesis aided by MLLMs and pave the way for future research.
♻ ☆ LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression
The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in the learned image compression community, we introduce a few large kernelbased depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter pointwise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by 9.49%, 9.47%, 10.94% on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.
comment: Accepted to IEEE Transactions on Multimedia 2024
♻ ☆ Benchmarking Pathology Feature Extractors for Whole Slide Image Classification
Weakly supervised whole slide image classification is a key task in computational pathology, which involves predicting a slide-level label from a set of image patches constituting the slide. Constructing models to solve this task involves multiple design choices, often made without robust empirical or conclusive theoretical justification. To address this, we conduct a comprehensive benchmarking of feature extractors to answer three critical questions: 1) Is stain normalisation still a necessary preprocessing step? 2) Which feature extractors are best for downstream slide-level classification? 3) How does magnification affect downstream performance? Our study constitutes the most comprehensive evaluation of publicly available pathology feature extractors to date, involving more than 10,000 training runs across 14 feature extractors, 9 tasks, 5 datasets, 3 downstream architectures, 2 levels of magnification, and various preprocessing setups. Our findings challenge existing assumptions: 1) We observe empirically, and by analysing the latent space, that skipping stain normalisation and image augmentations does not degrade performance, while significantly reducing memory and computational demands. 2) We develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor for downstream performance. 3) We find that lower-magnification slides are sufficient for accurate slide-level classification. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.
comment: For the conference version see: arXiv:2311.11772v4. For the longer journal version with additional experiments see arXiv:2311.11772v5
♻ ☆ Efficient Perception, Planning, and Control Algorithm for Vision-Based Automated Vehicles
Autonomous vehicles have limited computational resources and thus require efficient control systems. The cost and size of sensors have limited the development of self-driving cars. To overcome these restrictions, this study proposes an efficient framework for the operation of vision-based automatic vehicles; the framework requires only a monocular camera and a few inexpensive radars. The proposed algorithm comprises a multi-task UNet (MTUNet) network for extracting image features and constrained iterative linear quadratic regulator (CILQR) and vision predictive control (VPC) modules for rapid motion planning and control. MTUNet is designed to simultaneously solve lane line segmentation, the ego vehicle's heading angle regression, road type classification, and traffic object detection tasks at approximately 40 FPS for 228 x 228 pixel RGB input images. The CILQR controllers then use the MTUNet outputs and radar data as inputs to produce driving commands for lateral and longitudinal vehicle guidance within only 1 ms. In particular, the VPC algorithm is included to reduce steering command latency to below actuator latency, preventing performance degradation during tight turns. The VPC algorithm uses road curvature data from MTUNet to estimate the appropriate correction for the current steering angle at a look-ahead point to adjust the turning amount. The inclusion of the VPC algorithm in a VPC-CILQR controller leads to higher performance on curvy roads than the use of CILQR alone. Our experiments demonstrate that the proposed autonomous driving system, which does not require high-definition maps, can be applied in current autonomous vehicles.
comment: 14 figures, 14 pages
♻ ☆ Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation SP
Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug.
comment: Under review. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug
♻ ☆ AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.
comment: Code available https://github.com/alanaai/EVUD
♻ ☆ The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography
Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatment outcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. While numerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys, and lungs, the larger variety of lesion types encountered in clinical practice demands a more universal approach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across this region, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated a dataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in a clinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publicly accessible via uls23.grand-challenge.org, enabling researchers worldwide to assess the performance of their segmentation methods. Furthermore, we have developed and publicly released our baseline semi-supervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 $\pm$ 0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ Retrieval-Augmented Generation for AI-Generated Content: A Survey
Advancements in model algorithms, the growth of foundational models, and access to high-quality datasets have propelled the evolution of Artificial Intelligence Generated Content (AIGC). Despite its notable successes, AIGC still faces hurdles such as updating knowledge, handling long-tail data, mitigating data leakage, and managing high training and inference costs. Retrieval-Augmented Generation (RAG) has recently emerged as a paradigm to address such challenges. In particular, RAG introduces the information retrieval process, which enhances the generation process by retrieving relevant objects from available data stores, leading to higher accuracy and better robustness. In this paper, we comprehensively review existing efforts that integrate RAG technique into AIGC scenarios. We first classify RAG foundations according to how the retriever augments the generator, distilling the fundamental abstractions of the augmentation methodologies for various retrievers and generators. This unified perspective encompasses all RAG scenarios, illuminating advancements and pivotal technologies that help with potential future progress. We also summarize additional enhancements methods for RAG, facilitating effective engineering and implementation of RAG systems. Then from another view, we survey on practical applications of RAG across different modalities and tasks, offering valuable references for researchers and practitioners. Furthermore, we introduce the benchmarks for RAG, discuss the limitations of current RAG systems, and suggest potential directions for future research. Github: https://github.com/PKU-DAIR/RAG-Survey.
comment: Citing 353 papers, 22 pages, 1 table, 12 figures. Project: https://github.com/PKU-DAIR/RAG-Survey
♻ ☆ Enhanced Object Detection: A Study on Vast Vocabulary Object Detection Track for V3Det Challenge 2024
In this technical report, we present our findings from the research conducted on the Vast Vocabulary Visual Detection (V3Det) dataset for Supervised Vast Vocabulary Visual Detection task. How to deal with complex categories and detection boxes has become a difficulty in this track. The original supervised detector is not suitable for this task. We have designed a series of improvements, including adjustments to the network structure, changes to the loss function, and design of training strategies. Our model has shown improvement over the baseline and achieved excellent rankings on the Leaderboard for both the Vast Vocabulary Object Detection (Supervised) track and the Open Vocabulary Object Detection (OVD) track of the V3Det Challenge 2024.
♻ ☆ Open-Vocabulary SAM3D: Understand Any 3D Scene
Open-vocabulary 3D scene understanding presents a significant challenge in the field. Recent advancements have sought to transfer knowledge embedded in vision language models from the 2D domain to 3D domain. However, these approaches often require learning prior knowledge from specific 3D scene datasets, which limits their applicability in open-world scenarios. The Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities, prompting us to investigate its potential for comprehending 3D scenes without the need for training. In this paper, we introduce OV-SAM3D, a universal framework for open-vocabulary 3D scene understanding. This framework is designed to perform understanding tasks for any 3D scene without requiring prior knowledge of the scene. Specifically, our method is composed of two key sub-modules: First, we initiate the process by generating superpoints as the initial 3D prompts and refine these prompts using segment masks derived from SAM. Moreover, we then integrate a specially designed overlapping score table with open tags from the Recognize Anything Model (RAM) to produce final 3D instances with open-world label. Empirical evaluations conducted on the ScanNet200 and nuScenes datasets demonstrate that our approach surpasses existing open-vocabulary methods in unknown open-world environments.
comment: Project page: https://hithqd.github.io/projects/OV-SAM3D
♻ ☆ Deep-learning-based groupwise registration for motion correction of cardiac $T_1$ mapping MICCAI 2024
Quantitative $T_1$ mapping by MRI is an increasingly important tool for clinical assessment of cardiovascular diseases. The cardiac $T_1$ map is derived by fitting a known signal model to a series of baseline images, while the quality of this map can be deteriorated by involuntary respiratory and cardiac motion. To correct motion, a template image is often needed to register all baseline images, but the choice of template is nontrivial, leading to inconsistent performance sensitive to image contrast. In this work, we propose a novel deep-learning-based groupwise registration framework, which omits the need for a template, and registers all baseline images simultaneously. We design two groupwise losses for this registration framework: the first is a linear principal component analysis (PCA) loss that enforces alignment of baseline images irrespective of the intensity variation, and the second is an auxiliary relaxometry loss that enforces adherence of intensity profile to the signal model. We extensively evaluated our method, termed ``PCA-Relax'', and other baseline methods on an in-house cardiac MRI dataset including both pre- and post-contrast $T_1$ sequences. All methods were evaluated under three distinct training-and-evaluation strategies, namely, standard, one-shot, and test-time-adaptation. The proposed PCA-Relax showed further improved performance of registration and mapping over well-established baselines. The proposed groupwise framework is generic and can be adapted to applications involving multiple images.
comment: MICCAI 2024. Contents may slightly differ from the camera-ready version
♻ ☆ RoMe: Towards Large Scale Road Surface Reconstruction via Mesh Representation
In autonomous driving applications, accurate and efficient road surface reconstruction is paramount. This paper introduces RoMe, a novel framework designed for the robust reconstruction of large-scale road surfaces. Leveraging a unique mesh representation, RoMe ensures that the reconstructed road surfaces are accurate and seamlessly aligned with semantics. To address challenges in computational efficiency, we propose a waypoint sampling strategy, enabling RoMe to reconstruct vast environments by focusing on sub-areas and subsequently merging them. Furthermore, we incorporate an extrinsic optimization module to enhance the robustness against inaccuracies in extrinsic calibration. Our extensive evaluations of both public datasets and wild data underscore RoMe's superiority in terms of speed, accuracy, and robustness. For instance, it costs only 2 GPU hours to recover a road surface of 600*600 square meters from thousands of images. Notably, RoMe's capability extends beyond mere reconstruction, offering significant value for autolabeling tasks in autonomous driving applications. All related data and code are available at https://github.com/DRosemei/RoMe.
comment: Published in: IEEE Transactions on Intelligent Vehicles
♻ ☆ GridFormer: Residual Dense Transformer with Grid Structure for Image Restoration in Adverse Weather Conditions
Image restoration in adverse weather conditions is a difficult task in computer vision. In this paper, we propose a novel transformer-based framework called GridFormer which serves as a backbone for image restoration under adverse weather conditions. GridFormer is designed in a grid structure using a residual dense transformer block, and it introduces two core designs. First, it uses an enhanced attention mechanism in the transformer layer. The mechanism includes stages of the sampler and compact self-attention to improve efficiency, and a local enhancement stage to strengthen local information. Second, we introduce a residual dense transformer block (RDTB) as the final GridFormer layer. This design further improves the network's ability to learn effective features from both preceding and current local features. The GridFormer framework achieves state-of-the-art results on five diverse image restoration tasks in adverse weather conditions, including image deraining, dehazing, deraining \& dehazing, desnowing, and multi-weather restoration. The source code and pre-trained models are available at https://github.com/TaoWangzj/GridFormer.
comment: 20 pages, 15 figures, accepted by IJCV
♻ ☆ Asynchronous Large Language Model Enhanced Planner for Autonomous Driving
Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.
♻ ☆ TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones ICML
In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.
comment: Accepted by ICML workshop 2024
♻ ☆ E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion
Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.
comment: 9 pages, 5 figures, Under review
♻ ☆ VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension. To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.
comment: 18 pages
♻ ☆ ZoomNeXt: A Unified Collaborative Pyramid Network for Camouflaged Object Detection
Recent camouflaged object detection (COD) attempts to segment objects visually blended into their surroundings, which is extremely complex and difficult in real-world scenarios. Apart from the high intrinsic similarity between camouflaged objects and their background, objects are usually diverse in scale, fuzzy in appearance, and even severely occluded. To this end, we propose an effective unified collaborative pyramid network that mimics human behavior when observing vague images and videos, \ie zooming in and out. Specifically, our approach employs the zooming strategy to learn discriminative mixed-scale semantics by the multi-head scale integration and rich granularity perception units, which are designed to fully explore imperceptible clues between candidate objects and background surroundings. The former's intrinsic multi-head aggregation provides more diverse visual patterns. The latter's routing mechanism can effectively propagate inter-frame differences in spatiotemporal scenarios and be adaptively deactivated and output all-zero results for static representations. They provide a solid foundation for realizing a unified architecture for static and dynamic COD. Moreover, considering the uncertainty and ambiguity derived from indistinguishable textures, we construct a simple yet effective regularization, uncertainty awareness loss, to encourage predictions with higher confidence in candidate regions. Our highly task-friendly framework consistently outperforms existing state-of-the-art methods in image and video COD benchmarks.
comment: Extensions to the conference version accepted by TPAMI 2024. arXiv admin note: substantial text overlap with arXiv:2203.02688
♻ ☆ $α$-SSC: Uncertainty-Aware Camera-based 3D Semantic Scene Completion
In the realm of autonomous vehicle (AV) perception, comprehending 3D scenes is paramount for tasks such as planning and mapping. Semantic scene completion (SSC) aims to infer scene geometry and semantics from limited observations. While camera-based SSC has gained popularity due to affordability and rich visual cues, existing methods often neglect the inherent uncertainty in models. To address this, we propose an uncertainty-aware camera-based 3D semantic scene completion method ($\alpha$-SSC). Our approach includes an uncertainty propagation framework from depth models (Depth-UP) to enhance geometry completion (up to 11.58% improvement) and semantic segmentation (up to 14.61% improvement). Additionally, we propose a hierarchical conformal prediction (HCP) method to quantify SSC uncertainty, effectively addressing high-level class imbalance in SSC datasets. On the geometry level, we present a novel KL divergence-based score function that significantly improves the occupied recall of safety-critical classes (45% improvement) with minimal performance overhead (3.4% reduction). For uncertainty quantification, we demonstrate the ability to achieve smaller prediction set sizes while maintaining a defined coverage guarantee. Compared with baselines, it achieves up to 85% reduction in set sizes. Our contributions collectively signify significant advancements in SSC accuracy and robustness, marking a noteworthy step forward in autonomous perception systems.
comment: 12 pages, 4 figures
♻ ☆ What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfactual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model responses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues.
comment: Project page: https://ivy-lvlm.github.io/Counterfactual-Inception/
♻ ☆ Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available through the online platform and API after further optimization and security checks.
♻ ☆ Leveraging Fine-Grained Information and Noise Decoupling for Remote Sensing Change Detection
Change detection aims to identify remote sense object changes by analyzing data between bitemporal image pairs. Due to the large temporal and spatial span of data collection in change detection image pairs, there are often a significant amount of task-specific and task-agnostic noise. Previous effort has focused excessively on denoising, with this goes a great deal of loss of fine-grained information. In this paper, we revisit the importance of fine-grained features in change detection and propose a series of operations for fine-grained information compensation and noise decoupling (FINO). First, the context is utilized to compensate for the fine-grained information in the feature space. Next, a shape-aware and a brightness-aware module are designed to improve the capacity for representation learning. The shape-aware module guides the backbone for more precise shape estimation, guiding the backbone network in extracting object shape features. The brightness-aware module learns a overall brightness estimation to improve the model's robustness to task-agnostic noise. Finally, a task-specific noise decoupling structure is designed as a way to improve the model's ability to separate noise interference from feature similarity. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in multiple change detection benchmarks. The code will be made available.
♻ ☆ Vision Language Modeling of Content, Distortion and Appearance for Image Quality Assessment
The visual quality of an image is confounded by a number of intertwined factors including its semantic content, distortion characteristics and appearance properties such as brightness, contrast, sharpness, and colourfulness. Distilling high level knowledge about all these quality bearing attributes is crucial for developing objective Image Quality Assessment (IQA).While existing solutions have modeled some of these aspects, a comprehensive solution that involves all these important quality related attributes has not yet been developed. In this paper, we present a new blind IQA (BIQA) model termed Self-supervision and Vision-Language supervision Image QUality Evaluator (SLIQUE) that features a joint vision-language and visual contrastive representation learning framework for acquiring high level knowledge about the images semantic contents, distortion characteristics and appearance properties for IQA. For training SLIQUE, we have developed a systematic approach to constructing a first of its kind large image database annotated with all three categories of quality relevant texts. The Text Annotated Distortion, Appearance and Content (TADAC) database has over 1.6 million images annotated with textual descriptions of their semantic contents, distortion characteristics and appearance properties. The method for constructing TADAC and the database itself will be particularly useful for exploiting vision-language modeling for advanced IQA applications. Extensive experimental results show that SLIQUE has superior performances over state of the art, demonstrating the soundness of its design principle and the effectiveness of its implementation.
♻ ☆ Dynamic 3D Point Cloud Sequences as 2D Videos
Dynamic 3D point cloud sequences serve as one of the most common and practical representation modalities of dynamic real-world environments. However, their unstructured nature in both spatial and temporal domains poses significant challenges to effective and efficient processing. Existing deep point cloud sequence modeling approaches imitate the mature 2D video learning mechanisms by developing complex spatio-temporal point neighbor grouping and feature aggregation schemes, often resulting in methods lacking effectiveness, efficiency, and expressive power. In this paper, we propose a novel generic representation called \textit{Structured Point Cloud Videos} (SPCVs). Intuitively, by leveraging the fact that 3D geometric shapes are essentially 2D manifolds, SPCV re-organizes a point cloud sequence as a 2D video with spatial smoothness and temporal consistency, where the pixel values correspond to the 3D coordinates of points. The structured nature of our SPCV representation allows for the seamless adaptation of well-established 2D image/video techniques, enabling efficient and effective processing and analysis of 3D point cloud sequences. To achieve such re-organization, we design a self-supervised learning pipeline that is geometrically regularized and driven by self-reconstructive and deformation field learning objectives. Additionally, we construct SPCV-based frameworks for both low-level and high-level 3D point cloud sequence processing and analysis tasks, including action recognition, temporal interpolation, and compression. Extensive experiments demonstrate the versatility and superiority of the proposed SPCV, which has the potential to offer new possibilities for deep learning on unstructured 3D point cloud sequences. Code will be released at https://github.com/ZENGYIMING-EAMON/SPCV.
comment: The manuscript has been accepted by IEEE TPAMI in 2024
♻ ☆ Composite Concept Extraction through Backdooring
Learning composite concepts, such as \textquotedbl red car\textquotedbl , from individual examples -- like a white car representing the concept of \textquotedbl car\textquotedbl{} and a red strawberry representing the concept of \textquotedbl red\textquotedbl -- is inherently challenging. This paper introduces a novel method called Composite Concept Extractor (CoCE), which leverages techniques from traditional backdoor attacks to learn these composite concepts in a zero-shot setting, requiring only examples of individual concepts. By repurposing the trigger-based model backdooring mechanism, we create a strategic distortion in the manifold of the target object (e.g., \textquotedbl car\textquotedbl ) induced by example objects with the target property (e.g., \textquotedbl red\textquotedbl ) from objects \textquotedbl red strawberry\textquotedbl , ensuring the distortion selectively affects the target objects with the target property. Contrastive learning is then employed to further refine this distortion, and a method is formulated for detecting objects that are influenced by the distortion. Extensive experiments with in-depth analysis across different datasets demonstrate the utility and applicability of our proposed approach.
♻ ☆ Connect Later: Improving Fine-tuning for Robustness with Targeted Augmentations ICML 2024
Models trained on a labeled source domain (e.g., labeled images from wildlife camera traps) often generalize poorly when deployed on an out-of-distribution (OOD) target domain (e.g., images from new camera trap locations). In the domain adaptation setting where unlabeled target data is available, self-supervised pretraining (e.g., masked autoencoding or contrastive learning) is a promising method to mitigate this performance drop. Pretraining improves OOD error when the generic data augmentations used (e.g., masking or cropping) connect the source and target domains, which may be far apart in the input space. In this paper, we show on real-world tasks that standard fine-tuning after pretraining does not consistently improve OOD error over simply training from scratch on labeled source data. To better leverage pretraining for distribution shifts, we propose Connect Later: after pretraining with generic augmentations, fine-tune with targeted augmentations designed with knowledge of the distribution shift. Pretraining learns good representations within the source and target domains, while targeted augmentations connect the domains better during fine-tuning. Connect Later improves average OOD error over standard fine-tuning and supervised learning with targeted augmentations on 4 real-world datasets: Connect Later achieves the state-of-the-art on astronomical time-series classification (AstroClassification) by 2.5%, wildlife species identification (iWildCam-WILDS) with ResNet-50 by 0.9%, and tumor identification (Camelyon17-WILDS) with DenseNet121 by 1.1%; as well as best performance on a new dataset for astronomical time-series redshift prediction (Redshifts) by 0.03 RMSE (11% relative). Code and datasets are available at https://github.com/helenqu/connect-later.
comment: ICML 2024
♻ ☆ SHAN: Object-Level Privacy Detection via Inference on Scene Heterogeneous Graph
With the rise of social platforms, protecting privacy has become an important issue. Privacy object detection aims to accurately locate private objects in images. It is the foundation of safeguarding individuals' privacy rights and ensuring responsible data handling practices in the digital age. Since privacy of object is not shift-invariant, the essence of the privacy object detection task is inferring object privacy based on scene information. However, privacy object detection has long been studied as a subproblem of common object detection tasks. Therefore, existing methods suffer from serious deficiencies in accuracy, generalization, and interpretability. Moreover, creating large-scale privacy datasets is difficult due to legal constraints and existing privacy datasets lack label granularity. The granularity of existing privacy detection methods remains limited to the image level. To address the above two issues, we introduce two benchmark datasets for object-level privacy detection and propose SHAN, Scene Heterogeneous graph Attention Network, a model constructs a scene heterogeneous graph from an image and utilizes self-attention mechanisms for scene inference to obtain object privacy. Through experiments, we demonstrated that SHAN performs excellently in privacy object detection tasks, with all metrics surpassing those of the baseline model.
comment: I have already updated a new version, you can check it here: arXiv:2406.12736
♻ ☆ Image anomaly detection and prediction scheme based on SSA optimized ResNet50-BiGRU model
Image anomaly detection is a popular research direction, with many methods emerging in recent years due to rapid advancements in computing. The use of artificial intelligence for image anomaly detection has been widely studied. By analyzing images of athlete posture and movement, it is possible to predict injury status and suggest necessary adjustments. Most existing methods rely on convolutional networks to extract information from irrelevant pixel data, limiting model accuracy. This paper introduces a network combining Residual Network (ResNet) and Bidirectional Gated Recurrent Unit (BiGRU), which can predict potential injury types and provide early warnings by analyzing changes in muscle and bone poses from video images. To address the high complexity of this network, the Sparrow search algorithm was used for optimization. Experiments conducted on four datasets demonstrated that our model has the smallest error in image anomaly detection compared to other models, showing strong adaptability. This provides a new approach for anomaly detection and predictive analysis in images, contributing to the sustainable development of human health and performance.
♻ ☆ MonoBox: Tightness-free Box-supervised Polyp Segmentation using Monotonicity Constraint
We propose MonoBox, an innovative box-supervised segmentation method constrained by monotonicity to liberate its training from the user-unfriendly box-tightness assumption. In contrast to conventional box-supervised segmentation, where the box edges must precisely touch the target boundaries, MonoBox leverages imprecisely-annotated boxes to achieve robust pixel-wise segmentation. The 'linchpin' is that, within the noisy zones around box edges, MonoBox discards the traditional misguiding multiple-instance learning loss, and instead optimizes a carefully-designed objective, termed monotonicity constraint. Along directions transitioning from the foreground to background, this new constraint steers responses to adhere to a trend of monotonically decreasing values. Consequently, the originally unreliable learning within the noisy zones is transformed into a correct and effective monotonicity optimization. Moreover, an adaptive label correction is introduced, enabling MonoBox to enhance the tightness of box annotations using predicted masks from the previous epoch and dynamically shrink the noisy zones as training progresses. We verify MonoBox in the box-supervised segmentation task of polyps, where satisfying box-tightness is challenging due to the vague boundaries between the polyp and normal tissues. Experiments on both public synthetic and in-house real noisy datasets demonstrate that MonoBox exceeds other anti-noise state-of-the-arts by improving Dice by at least 5.5% and 3.3%, respectively. Codes are at https://github.com/Huster-Hq/MonoBox.
♻ ☆ GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models
Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications.
comment: Code and Benchmark are available at: https://github.com/aim-uofa/GeoBench
♻ ☆ Towards Enhanced Analysis of Lung Cancer Lesions in EBUS-TBNA -- A Semi-Supervised Video Object Detection Method
This study aims to establish a computer-aided diagnostic system for lung lesions using endobronchial ultrasound (EBUS) to assist physicians in identifying lesion areas. During EBUS-transbronchial needle aspiration (EBUS-TBNA) procedures, hysicians rely on grayscale ultrasound images to determine the location of lesions. However, these images often contain significant noise and can be influenced by surrounding tissues or blood vessels, making identification challenging. Previous research has lacked the application of object detection models to EBUS-TBNA, and there has been no well-defined solution for the lack of annotated data in the EBUS-TBNA dataset. In related studies on ultrasound images, although models have been successful in capturing target regions for their respective tasks, their training and predictions have been based on two-dimensional images, limiting their ability to leverage temporal features for improved predictions. This study introduces a three-dimensional video-based object detection model. It first generates a set of improved queries using a diffusion model, then captures temporal correlations through an attention mechanism. A filtering mechanism selects relevant information from previous frames to pass to the current frame. Subsequently, a teacher-student model training approach is employed to further optimize the model using unlabeled data. By incorporating various data augmentation and feature alignment, the model gains robustness against interference. Test results demonstrate that this model, which captures spatiotemporal information and employs semi-supervised learning methods, achieves an Average Precision (AP) of 48.7 on the test dataset, outperforming other models. It also achieves an Average Recall (AR) of 79.2, significantly leading over existing models.
♻ ☆ UniUSNet: A Promptable Framework for Universal Ultrasound Disease Prediction and Tissue Segmentation
Ultrasound is a widely used imaging modality in clinical practice due to its low cost, portability, and safety. Current research in general AI for healthcare focuses on large language models and general segmentation models, with insufficient attention to solutions addressing both disease prediction and tissue segmentation. In this study, we propose a novel universal framework for ultrasound, namely UniUSNet, which is a promptable framework for ultrasound image classification and segmentation. The universality of this model is derived from its versatility across various aspects. It proficiently manages any ultrasound nature, any anatomical position, any input type and excelling not only in segmentation tasks but also in classification tasks. We introduce a novel module that incorporates this information as a prompt and seamlessly embedding it within the model's learning process. To train and validate our proposed model, we curated a comprehensive ultrasound dataset from publicly accessible sources, encompassing up to 7 distinct anatomical positions with over 9.7K annotations. Experimental results demonstrate that our model achieves performance comparable to state-of-the-art models, and surpasses both a model trained on a single dataset and an ablated version of the network lacking prompt guidance. Additionally, we conducted zero-shot and fine-tuning experiments on new datasets, which proved that our model possesses strong generalization capabilities and can be effectively adapted to new data at low cost through its adapter module. We will continuously expand the dataset and optimize the task specific prompting mechanism towards the universality in medical ultrasound. Model weights, data processing workflows, and code will be open source to the public (https://github.com/Zehui-Lin/UniUSNet).
♻ ☆ Weakly Supervised YOLO Network for Surgical Instrument Localization in Endoscopic Videos ICRA 2024
In minimally invasive surgery, surgical instrument localization is a crucial task for endoscopic videos, which enables various applications for improving surgical outcomes. However, annotating the instrument localization in endoscopic videos is tedious and labor-intensive. In contrast, obtaining the category information is easy and efficient in real-world applications. To fully utilize the category information and address the localization problem, we propose a weakly supervised localization framework named WS-YOLO for surgical instruments. By leveraging the instrument category information as the weak supervision, our WS-YOLO framework adopts an unsupervised multi-round training strategy for the localization capability training. We validate our WS-YOLO framework on the Endoscopic Vision Challenge 2023 dataset, which achieves remarkable performance in the weakly supervised surgical instrument localization. The source code is available at https://github.com/Breezewrf/WS-YOLO.
comment: Accepted by ICRA 2024 Workshop on C4 Surgical Robotic Systems in the Embodied AI Era; Surgical Tool Localization in Endoscopic Videos Challenge of MICCAI2023
♻ ☆ CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images Classification
Hyperspectral remote sensing (HIS) enables the detailed capture of spectral information from the Earth's surface, facilitating precise classification and identification of surface crops due to its superior spectral diagnostic capabilities. However, current convolutional neural networks (CNNs) focus on local features in hyperspectral data, leading to suboptimal performance when classifying intricate crop types and addressing imbalanced sample distributions. In contrast, the Transformer framework excels at extracting global features from hyperspectral imagery. To leverage the strengths of both approaches, this research introduces the Convolutional Meet Transformer Network (CMTNet). This innovative model includes a spectral-spatial feature extraction module for shallow feature capture, a dual-branch structure combining CNN and Transformer branches for local and global feature extraction, and a multi-output constraint module that enhances classification accuracy through multi-output loss calculations and cross constraints across local, international, and joint features. Extensive experiments conducted on three datasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that CTDBNet significantly outperforms other state-of-the-art networks in classification performance, validating its effectiveness in hyperspectral crop classification.
comment: 15 pages, 11figures
♻ ☆ Gap-closing Matters: Perceptual Quality Evaluation and Optimization of Low-Light Image Enhancement
There is a growing consensus in the research community that the optimization of low-light image enhancement approaches should be guided by the visual quality perceived by end users. Despite the substantial efforts invested in the design of low-light enhancement algorithms, there has been comparatively limited focus on assessing subjective and objective quality systematically. To mitigate this gap and provide a clear path towards optimizing low-light image enhancement for better visual quality, we propose a gap-closing framework. In particular, our gap-closing framework starts with the creation of a large-scale dataset for Subjective QUality Assessment of REconstructed LOw-Light Images (SQUARE-LOL). This database serves as the foundation for studying the quality of enhanced images and conducting a comprehensive subjective user study. Subsequently, we propose an objective quality assessment measure that plays a critical role in bridging the gap between visual quality and enhancement. Finally, we demonstrate that our proposed objective quality measure can be incorporated into the process of optimizing the learning of the enhancement model toward perceptual optimality. We validate the effectiveness of our proposed framework through both the accuracy of quality prediction and the perceptual quality of image enhancement. Our database and codes are publicly available at https://github.com/Baoliang93/IACA_For_Lowlight_IQA.
comment: Basis Angle Consistency in Sec.3.2 will be revised
Computation and Language 107
☆ A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Keyword mnemonics are memorable explanations that link new terms to simpler keywords. Prior works generate mnemonics for students, but they do not guide models toward mnemonics students prefer and aid learning. We build SMART, a mnemonic generator trained on feedback from real students learning new terms. To train SMART, we first fine-tune LLaMA-2 on a curated set of user-written mnemonics. We then use LLM alignment to enhance SMART: we deploy mnemonics generated by SMART in a flashcard app to find preferences on mnemonics students favor. We gather 2684 preferences from 45 students across two types: expressed (inferred from ratings) and observed (inferred from student learning), yielding three key findings. First, expressed and observed preferences disagree; what students think is helpful does not fully capture what is truly helpful. Second, Bayesian models can synthesize complementary data from multiple preference types into a single effectiveness signal. SMART is tuned via Direct Preference Optimization on this signal, which we show resolves ties and missing labels in the typical method of pairwise comparisons, augmenting data for LLM output quality gains. Third, mnemonic experts assess SMART as matching GPT-4, at much lower deployment costs, showing the utility of capturing diverse student feedback to align LLMs in education.
comment: In-Progress Preprint
☆ Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
☆ Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance
Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to identify important parameters during training. Based on the insight that gradients inherently contain information on task-specific data, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters during training based on their gradient information. Specifically, we compute the absolute values of the gradients and apply masking to those with relatively smaller magnitudes. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla SFT.
☆ LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
comment: Technical Report
☆ STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals
Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: https://github.com/oneal2000/STARD/tree/main
☆ NLP-KG: A System for Exploratory Search of Scientific Literature in Natural Language Processing ACL 2024
Scientific literature searches are often exploratory, whereby users are not yet familiar with a particular field or concept but are interested in learning more about it. However, existing systems for scientific literature search are typically tailored to keyword-based lookup searches, limiting the possibilities for exploration. We propose NLP-KG, a feature-rich system designed to support the exploration of research literature in unfamiliar natural language processing (NLP) fields. In addition to a semantic search, NLP-KG allows users to easily find survey papers that provide a quick introduction to a field of interest. Further, a Fields of Study hierarchy graph enables users to familiarize themselves with a field and its related areas. Finally, a chat interface allows users to ask questions about unfamiliar concepts or specific articles in NLP and obtain answers grounded in knowledge retrieved from scientific publications. Our system provides users with comprehensive exploration possibilities, supporting them in investigating the relationships between different fields, understanding unfamiliar concepts in NLP, and finding relevant research literature. Demo, video, and code are available at: https://github.com/NLP-Knowledge-Graph/NLP-KG-WebApp.
comment: Accepted to ACL 2024 System Demonstrations
☆ The Greek podcast corpus: Competitive speech models for low-resourced languages with weakly supervised data
The development of speech technologies for languages with limited digital representation poses significant challenges, primarily due to the scarcity of available data. This issue is exacerbated in the era of large, data-intensive models. Recent research has underscored the potential of leveraging weak supervision to augment the pool of available data. In this study, we compile an 800-hour corpus of Modern Greek from podcasts and employ Whisper large-v3 to generate silver transcriptions. This corpus is utilized to fine-tune our models, aiming to assess the efficacy of this approach in enhancing ASR performance. Our analysis spans 16 distinct podcast domains, alongside evaluations on established datasets for Modern Greek. The findings indicate consistent WER improvements, correlating with increases in both data volume and model size. Our study confirms that assembling large, weakly supervised corpora serves as a cost-effective strategy for advancing speech technologies in under-resourced languages.
comment: To be presented at Interspeech 2024
☆ Cross-Modality Safety Alignment
As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
☆ Cognitive Map for Language Models: Optimal Planning via Verbally Representing the World Model
Language models have demonstrated impressive capabilities across various natural language processing tasks, yet they struggle with planning tasks requiring multi-step simulations. Inspired by human cognitive processes, this paper investigates the optimal planning power of language models that can construct a cognitive map of a given environment. Our experiments demonstrate that cognitive map significantly enhances the performance of both optimal and reachable planning generation ability in the Gridworld path planning task. We observe that our method showcases two key characteristics similar to human cognition: \textbf{generalization of its planning ability to extrapolated environments and rapid adaptation with limited training data.} We hope our findings in the Gridworld task provide insights into modeling human cognitive processes in language models, potentially leading to the development of more advanced and robust systems that better resemble human cognition.
☆ Evaluating Diversity in Automatic Poetry Generation
Natural Language Generation (NLG), and more generally generative AI, are among the currently most impactful research fields. Creative NLG, such as automatic poetry generation, is a fascinating niche in this area. While most previous research has focused on forms of the Turing test when evaluating automatic poetry generation - can humans distinguish between automatic and human generated poetry - we evaluate the diversity of automatically generated poetry, by comparing distributions of generated poetry to distributions of human poetry along structural, lexical, semantic and stylistic dimensions, assessing different model types (word vs. character-level, general purpose LLMs vs. poetry-specific models), including the very recent LLaMA3, and types of fine-tuning (conditioned vs. unconditioned). We find that current automatic poetry systems are considerably underdiverse along multiple dimensions - they often do not rhyme sufficiently, are semantically too uniform and even do not match the length distribution of human poetry. Our experiments reveal, however, that style-conditioning and character-level modeling clearly increases diversity across virtually all dimensions we explore. Our identified limitations may serve as the basis for more genuinely diverse future poetry generation models.
comment: init version
☆ Perception of Phonological Assimilation by Neural Speech Recognition Models
Human listeners effortlessly compensate for phonological changes during speech perception, often unconsciously inferring the intended sounds. For example, listeners infer the underlying /n/ when hearing an utterance such as "clea[m] pan", where [m] arises from place assimilation to the following labial [p]. This article explores how the neural speech recognition model Wav2Vec2 perceives assimilated sounds, and identifies the linguistic knowledge that is implemented by the model to compensate for assimilation during Automatic Speech Recognition (ASR). Using psycholinguistic stimuli, we systematically analyze how various linguistic context cues influence compensation patterns in the model's output. Complementing these behavioral experiments, our probing experiments indicate that the model shifts its interpretation of assimilated sounds from their acoustic form to their underlying form in its final layers. Finally, our causal intervention experiments suggest that the model relies on minimal phonological context cues to accomplish this shift. These findings represent a step towards better understanding the similarities and differences in phonological processing between neural ASR models and humans.
comment: Accepted for publication in Computational Linguistics (Special Issue on Language Learning, Representation, and Processing in Humans and Machines)
☆ Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics
Large language models (LLMs) often produce unsupported or unverifiable information, known as "hallucinations." To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.
comment: 12 pages, 3 figures
☆ Unsupervised Morphological Tree Tokenizer
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.
☆ Detecting Synthetic Lyrics with Few-Shot Inference
In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
comment: Under review
☆ A LLM-Based Ranking Method for the Evaluation of Automatic Counter-Narrative Generation
The proliferation of misinformation and harmful narratives in online discourse has underscored the critical need for effective Counter Narrative (CN) generation techniques. However, existing automatic evaluation methods often lack interpretability and fail to capture the nuanced relationship between generated CNs and human perception. Aiming to achieve a higher correlation with human judgments, this paper proposes a novel approach to asses generated CNs that consists on the use of a Large Language Model (LLM) as a evaluator. By comparing generated CNs pairwise in a tournament-style format, we establish a model ranking pipeline that achieves a correlation of $0.88$ with human preference. As an additional contribution, we leverage LLMs as zero-shot (ZS) CN generators and conduct a comparative analysis of chat, instruct, and base models, exploring their respective strengths and limitations. Through meticulous evaluation, including fine-tuning experiments, we elucidate the differences in performance and responsiveness to domain-specific data. We conclude that chat-aligned models in ZS are the best option for carrying out the task, provided they do not refuse to generate an answer due to security concerns.
☆ Unsupervised Extraction of Dialogue Policies from Conversations
Dialogue policies play a crucial role in developing task-oriented dialogue systems, yet their development and maintenance are challenging and typically require substantial effort from experts in dialogue modeling. While in many situations, large amounts of conversational data are available for the task at hand, people lack an effective solution able to extract dialogue policies from this data. In this paper, we address this gap by first illustrating how Large Language Models (LLMs) can be instrumental in extracting dialogue policies from datasets, through the conversion of conversations into a unified intermediate representation consisting of canonical forms. We then propose a novel method for generating dialogue policies utilizing a controllable and interpretable graph-based methodology. By combining canonical forms across conversations into a flow network, we find that running graph traversal algorithms helps in extracting dialogue flows. These flows are a better representation of the underlying interactions than flows extracted by prompting LLMs. Our technique focuses on giving conversation designers greater control, offering a productivity tool to improve the process of developing dialogue policies.
☆ How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?
We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
comment: Accepted at Learnersourcing: Student-Generated Content @ Scale 2024
☆ Reward Steering with Evolutionary Heuristics for Decoding-time Alignment
The widespread applicability and increasing omnipresence of LLMs have instigated a need to align LLM responses to user and stakeholder preferences. Many preference optimization approaches have been proposed that fine-tune LLM parameters to achieve good alignment. However, such parameter tuning is known to interfere with model performance on many tasks. Moreover, keeping up with shifting user preferences is tricky in such a situation. Decoding-time alignment with reward model guidance solves these issues at the cost of increased inference time. However, most of such methods fail to strike the right balance between exploration and exploitation of reward -- often due to the conflated formulation of these two aspects - to give well-aligned responses. To remedy this we decouple these two aspects and implement them in an evolutionary fashion: exploration is enforced by decoding from mutated instructions and exploitation is represented as the periodic replacement of poorly-rewarded generations with well-rewarded ones. Empirical evidences indicate that this strategy outperforms many preference optimization and decode-time alignment approaches on two widely accepted alignment benchmarks AlpacaEval 2 and MT-Bench. Our implementation will be available at: https://darwin-alignment.github.io.
☆ Hybrid Alignment Training for Large Language Models ACL
Alignment training is crucial for enabling large language models (LLMs) to cater to human intentions and preferences. It is typically performed based on two stages with different objectives: instruction-following alignment and human-preference alignment. However, aligning LLMs with these objectives in sequence suffers from an inherent problem: the objectives may conflict, and the LLMs cannot guarantee to simultaneously align with the instructions and human preferences well. To response to these, in this work, we propose a Hybrid Alignment Training (Hbat) approach, based on alternating alignment and modified elastic weight consolidation methods. The basic idea is to alternate between different objectives during alignment training, so that better collaboration can be achieved between the two alignment tasks.We experiment with Hbat on summarization and dialogue tasks. Experimental results show that the proposed \textsc{Hbat} can significantly outperform all baselines. Notably, Hbat yields consistent performance gains over the traditional two-stage alignment training when using both proximal policy optimization and direct preference optimization.
comment: accepted by ACL (Findings) 2024
☆ Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss
Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
☆ A Syntax-Injected Approach for Faster and More Accurate Sentiment Analysis
Sentiment Analysis (SA) is a crucial aspect of Natural Language Processing (NLP), addressing subjective assessments in textual content. Syntactic parsing is useful in SA because explicit syntactic information can improve accuracy while providing explainability, but it tends to be a computational bottleneck in practice due to the slowness of parsing algorithms. This paper addresses said bottleneck by using a SEquence Labeling Syntactic Parser (SELSP) to inject syntax into SA. By treating dependency parsing as a sequence labeling problem, we greatly enhance the speed of syntax-based SA. SELSP is trained and evaluated on a ternary polarity classification task, demonstrating its faster performance and better accuracy in polarity prediction tasks compared to conventional parsers like Stanza and to heuristic approaches that use shallow syntactic rules for SA like VADER. This increased speed and improved accuracy make SELSP particularly appealing to SA practitioners in both research and industry. In addition, we test several sentiment dictionaries on our SELSP to see which one improves the performance in polarity prediction tasks. Moreover, we compare the SELSP with Transformer-based models trained on a 5-label classification task. The results show that dictionaries that capture polarity judgment variation provide better results than dictionaries that ignore polarity judgment variation. Moreover, we show that SELSP is considerably faster than Transformer-based models in polarity prediction tasks.
☆ Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks
The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
☆ Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network
Large Language Models (LLMs) have been shown to be effective models of the human language system, with some models predicting most explainable variance of brain activity in current datasets. Even in untrained models, the representations induced by architectural priors can exhibit reasonable alignment to brain data. In this work, we investigate the key architectural components driving the surprising alignment of untrained models. To estimate LLM-to-brain similarity, we first select language-selective units within an LLM, similar to how neuroscientists identify the language network in the human brain. We then benchmark the brain alignment of these LLM units across five different brain recording datasets. By isolating critical components of the Transformer architecture, we identify tokenization strategy and multihead attention as the two major components driving brain alignment. A simple form of recurrence further improves alignment. We further demonstrate this quantitative brain alignment of our model by reproducing landmark studies in the language neuroscience field, showing that localized model units -- just like language voxels measured empirically in the human brain -- discriminate more reliably between lexical than syntactic differences, and exhibit similar response profiles under the same experimental conditions. Finally, we demonstrate the utility of our model's representations for language modeling, achieving improved sample and parameter efficiency over comparable architectures. Our model's estimates of surprisal sets a new state-of-the-art in the behavioral alignment to human reading times. Taken together, we propose a highly brain- and behaviorally-aligned model that conceptualizes the human language system as an untrained shallow feature encoder, with structural priors, combined with a trained decoder to achieve efficient and performant language processing.
comment: Preprint
☆ A Unified Framework for Input Feature Attribution Analysis
Explaining the decision-making process of machine learning models is crucial for ensuring their reliability and fairness. One popular explanation form highlights key input features, such as i) tokens (e.g., Shapley Values and Integrated Gradients), ii) interactions between tokens (e.g., Bivariate Shapley and Attention-based methods), or iii) interactions between spans of the input (e.g., Louvain Span Interactions). However, these explanation types have only been studied in isolation, making it difficult to judge their respective applicability. To bridge this gap, we propose a unified framework that facilitates a direct comparison between highlight and interactive explanations comprised of four diagnostic properties. Through extensive analysis across these three types of input feature explanations--each utilizing three different explanation techniques--across two datasets and two models, we reveal that each explanation type excels in terms of different diagnostic properties. In our experiments, highlight explanations are the most faithful to a model's prediction, and interactive explanations provide better utility for learning to simulate a model's predictions. These insights further highlight the need for future research to develop combined methods that enhance all diagnostic properties.
☆ Cross-lingual paraphrase identification
The paraphrase identification task involves measuring semantic similarity between two short sentences. It is a tricky task, and multilingual paraphrase identification is even more challenging. In this work, we train a bi-encoder model in a contrastive manner to detect hard paraphrases across multiple languages. This approach allows us to use model-produced embeddings for various tasks, such as semantic search. We evaluate our model on downstream tasks and also assess embedding space quality. Our performance is comparable to state-of-the-art cross-encoders, with only a minimal relative drop of 7-10% on the chosen dataset, while keeping decent quality of embeddings.
☆ PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Evaluation of multilingual Large Language Models (LLMs) is challenging due to a variety of factors -- the lack of benchmarks with sufficient linguistic diversity, contamination of popular benchmarks into LLM pre-training data and the lack of local, cultural nuances in translated benchmarks. In this work, we study human and LLM-based evaluation in a multilingual, multi-cultural setting. We evaluate 30 models across 10 Indic languages by conducting 90K human evaluations and 30K LLM-based evaluations and find that models such as GPT-4o and Llama-3 70B consistently perform best for most Indic languages. We build leaderboards for two evaluation settings - pairwise comparison and direct assessment and analyse the agreement between humans and LLMs. We find that humans and LLMs agree fairly well in the pairwise setting but the agreement drops for direct assessment evaluation especially for languages such as Bengali and Odia. We also check for various biases in human and LLM-based evaluation and find evidence of self-bias in the GPT-based evaluator. Our work presents a significant step towards scaling up multilingual evaluation of LLMs.
comment: Work in progress
☆ Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
☆ Harnessing Knowledge Retrieval with Large Language Models for Clinical Report Error Correction
This study proposes an approach for error correction in clinical radiology reports, leveraging large language models (LLMs) and retrieval-augmented generation (RAG) techniques. The proposed framework employs internal and external retrieval mechanisms to extract relevant medical entities and relations from the report and external knowledge sources. A three-stage inference process is introduced, decomposing the task into error detection, localization, and correction subtasks, which enhances the explainability and performance of the system. The effectiveness of the approach is evaluated using a benchmark dataset created by corrupting real-world radiology reports with realistic errors, guided by domain experts. Experimental results demonstrate the benefits of the proposed methods, with the combination of internal and external retrieval significantly improving the accuracy of error detection, localization, and correction across various state-of-the-art LLMs. The findings contribute to the development of more robust and reliable error correction systems for clinical documentation.
☆ Online detection and infographic explanation of spam reviews with data drift adaptation
Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.
☆ GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions
Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
comment: 14 pages, 4 figures, 6 Tables
☆ MedOdyssey: A Medical Domain Benchmark for Long Context Evaluation Up to 200K Tokens
Numerous advanced Large Language Models (LLMs) now support context lengths up to 128K, and some extend to 200K. Some benchmarks in the generic domain have also followed up on evaluating long-context capabilities. In the medical domain, tasks are distinctive due to the unique contexts and need for domain expertise, necessitating further evaluation. However, despite the frequent presence of long texts in medical scenarios, evaluation benchmarks of long-context capabilities for LLMs in this field are still rare. In this paper, we propose MedOdyssey, the first medical long-context benchmark with seven length levels ranging from 4K to 200K tokens. MedOdyssey consists of two primary components: the medical-context "needles in a haystack" task and a series of tasks specific to medical applications, together comprising 10 datasets. The first component includes challenges such as counter-intuitive reasoning and novel (unknown) facts injection to mitigate knowledge leakage and data contamination of LLMs. The second component confronts the challenge of requiring professional medical expertise. Especially, we design the ``Maximum Identical Context'' principle to improve fairness by guaranteeing that different LLMs observe as many identical contexts as possible. Our experiment evaluates advanced proprietary and open-source LLMs tailored for processing long contexts and presents detailed performance analyses. This highlights that LLMs still face challenges and need for further research in this area. Our code and data are released in the repository: \url{https://github.com/JOHNNY-fans/MedOdyssey.}
☆ GraLMatch: Matching Groups of Entities with Graphs and Language Models EDBT 2025
In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records.
comment: 12 pages, 4 figures, accepted as research paper at EDBT 2025
☆ Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations
Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.
☆ Disability Representations: Finding Biases in Automatic Image Generation CVPR 2024
Recent advancements in image generation technology have enabled widespread access to AI-generated imagery, prominently used in advertising, entertainment, and progressively in every form of visual content. However, these technologies often perpetuate societal biases. This study investigates the representation biases in popular image generation models towards people with disabilities (PWD). Through a comprehensive experiment involving several popular text-to-image models, we analyzed the depiction of disability. The results indicate a significant bias, with most generated images portraying disabled individuals as old, sad, and predominantly using manual wheelchairs. These findings highlight the urgent need for more inclusive AI development, ensuring diverse and accurate representation of PWD in generated images. This research underscores the importance of addressing and mitigating biases in AI models to foster equitable and realistic representations.
comment: Presented at AVA Workshop of CVPR 2024
☆ SpreadsheetBench: Towards Challenging Real World Spreadsheet Manipulation
We introduce SpreadsheetBench, a challenging spreadsheet manipulation benchmark exclusively derived from real-world scenarios, designed to immerse current large language models (LLMs) in the actual workflow of spreadsheet users. Unlike existing benchmarks that rely on synthesized queries and simplified spreadsheet files, SpreadsheetBench is built from 912 real questions gathered from online Excel forums, which reflect the intricate needs of users. The associated spreadsheets from the forums contain a variety of tabular data such as multiple tables, non-standard relational tables, and abundant non-textual elements. Furthermore, we propose a more reliable evaluation metric akin to online judge platforms, where multiple spreadsheet files are created as test cases for each instruction, ensuring the evaluation of robust solutions capable of handling spreadsheets with varying values. Our comprehensive evaluation of various LLMs under both single-round and multi-round inference settings reveals a substantial gap between the state-of-the-art (SOTA) models and human performance, highlighting the benchmark's difficulty.
comment: Homepage: https://spreadsheetbench.github.io/
☆ Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs' apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.
☆ Retrieve-Plan-Generation: An Iterative Planning and Answering Framework for Knowledge-Intensive LLM Generation
Despite the significant progress of large language models (LLMs) in various tasks, they often produce factual errors due to their limited internal knowledge. Retrieval-Augmented Generation (RAG), which enhances LLMs with external knowledge sources, offers a promising solution. However, these methods can be misled by irrelevant paragraphs in retrieved documents. Due to the inherent uncertainty in LLM generation, inputting the entire document may introduce off-topic information, causing the model to deviate from the central topic and affecting the relevance of the generated content. To address these issues, we propose the Retrieve-Plan-Generation (RPG) framework. RPG generates plan tokens to guide subsequent generation in the plan stage. In the answer stage, the model selects relevant fine-grained paragraphs based on the plan and uses them for further answer generation. This plan-answer process is repeated iteratively until completion, enhancing generation relevance by focusing on specific topics. To implement this framework efficiently, we utilize a simple but effective multi-task prompt-tuning method, enabling the existing LLMs to handle both planning and answering. We comprehensively compare RPG with baselines across 5 knowledge-intensive generation tasks, demonstrating the effectiveness of our approach.
☆ A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
☆ Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
comment: 8 pages, 6 figures
☆ Unlocking the Global Synergies in Low-Rank Adapters ICML2024
Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.
comment: Accepted at ICML2024 ES-FoMo-II Workshop
☆ ICLEval: Evaluating In-Context Learning Ability of Large Language Models
In-Context Learning (ICL) is a critical capability of Large Language Models (LLMs) as it empowers them to comprehend and reason across interconnected inputs. Evaluating the ICL ability of LLMs can enhance their utilization and deepen our understanding of how this ability is acquired at the training stage. However, existing evaluation frameworks primarily focus on language abilities and knowledge, often overlooking the assessment of ICL ability. In this work, we introduce the ICLEval benchmark to evaluate the ICL abilities of LLMs, which encompasses two key sub-abilities: exact copying and rule learning. Through the ICLEval benchmark, we demonstrate that ICL ability is universally present in different LLMs, and model size is not the sole determinant of ICL efficacy. Surprisingly, we observe that ICL abilities, particularly copying, develop early in the pretraining process and stabilize afterward. Our source codes and benchmark are released at https://github.com/yiye3/ICLEval.
☆ ESC-Eval: Evaluating Emotion Support Conversations in Large Language Models
Emotion Support Conversation (ESC) is a crucial application, which aims to reduce human stress, offer emotional guidance, and ultimately enhance human mental and physical well-being. With the advancement of Large Language Models (LLMs), many researchers have employed LLMs as the ESC models. However, the evaluation of these LLM-based ESCs remains uncertain. Inspired by the awesome development of role-playing agents, we propose an ESC Evaluation framework (ESC-Eval), which uses a role-playing agent to interact with ESC models, followed by a manual evaluation of the interactive dialogues. In detail, we first re-organize 2,801 role-playing cards from seven existing datasets to define the roles of the role-playing agent. Second, we train a specific role-playing model called ESC-Role which behaves more like a confused person than GPT-4. Third, through ESC-Role and organized role cards, we systematically conduct experiments using 14 LLMs as the ESC models, including general AI-assistant LLMs (ChatGPT) and ESC-oriented LLMs (ExTES-Llama). We conduct comprehensive human annotations on interactive multi-turn dialogues of different ESC models. The results show that ESC-oriented LLMs exhibit superior ESC abilities compared to general AI-assistant LLMs, but there is still a gap behind human performance. Moreover, to automate the scoring process for future ESC models, we developed ESC-RANK, which trained on the annotated data, achieving a scoring performance surpassing 35 points of GPT-4. Our data and code are available at https://github.com/haidequanbu/ESC-Eval.
comment: Pre-print
☆ Towards Retrieval Augmented Generation over Large Video Libraries
Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.
comment: Accepted in IEEE HSI 2024
☆ Autonomous Agents for Collaborative Task under Information Asymmetry
Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents' communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access the information of its human user. Previous MAS struggle to complete tasks under this condition. To address this, we propose a new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems. In iAgents, the human social network is mirrored in the agent network, where agents proactively exchange human information necessary for task resolution, thereby overcoming information asymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, to navigate agents' communication towards effective information exchange. Together with InfoNav, iAgents organizes human information in a mixed memory to provide agents with accurate and comprehensive information for exchange. Additionally, we introduce InformativeBench, the first benchmark tailored for evaluating LLM agents' task-solving ability under information asymmetry. Experimental results show that iAgents can collaborate within a social network of 140 individuals and 588 relationships, autonomously communicate over 30 turns, and retrieve information from nearly 70,000 messages to complete tasks within 3 minutes.
comment: 16 pages, 8 figures, 5 tables, Work in progress
☆ LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking
The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature's exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction and boosts decode throughput by $5.5-6.7 \times$ for 7B and 13B dense models on a single GPU, with minimal impact on performance.
comment: 10 pages
☆ Safely Learning with Private Data: A Federated Learning Framework for Large Language Model
Private data, being larger and quality-higher than public data, can greatly improve large language models (LLM). However, due to privacy concerns, this data is often dispersed in multiple silos, making its secure utilization for LLM training a challenge. Federated learning (FL) is an ideal solution for training models with distributed private data, but traditional frameworks like FedAvg are unsuitable for LLM due to their high computational demands on clients. An alternative, split learning, offloads most training parameters to the server while training embedding and output layers locally, making it more suitable for LLM. Nonetheless, it faces significant challenges in security and efficiency. Firstly, the gradients of embeddings are prone to attacks, leading to potential reverse engineering of private data. Furthermore, the server's limitation of handle only one client's training request at a time hinders parallel training, severely impacting training efficiency. In this paper, we propose a Federated Learning framework for LLM, named FL-GLM, which prevents data leakage caused by both server-side and peer-client attacks while improving training efficiency. Specifically, we first place the input block and output block on local client to prevent embedding gradient attacks from server. Secondly, we employ key-encryption during client-server communication to prevent reverse engineering attacks from peer-clients. Lastly, we employ optimization methods like client-batching or server-hierarchical, adopting different acceleration methods based on the actual computational capabilities of the server. Experimental results on NLU and generation tasks demonstrate that FL-GLM achieves comparable metrics to centralized chatGLM model, validating the effectiveness of our federated learning framework.
☆ Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition
Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.
☆ Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering ACL 2024
Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
comment: ACL 2024 (main conference)
☆ InterBiasing: Boost Unseen Word Recognition through Biasing Intermediate Predictions
Despite recent advances in end-to-end speech recognition methods, their output is biased to the training data's vocabulary, resulting in inaccurate recognition of unknown terms or proper nouns. To improve the recognition accuracy for a given set of such terms, we propose an adaptation parameter-free approach based on Self-conditioned CTC. Our method improves the recognition accuracy of misrecognized target keywords by substituting their intermediate CTC predictions with corrected labels, which are then passed on to the subsequent layers. First, we create pairs of correct labels and recognition error instances for a keyword list using Text-to-Speech and a recognition model. We use these pairs to replace intermediate prediction errors by the labels. Conditioning the subsequent layers of the encoder on the labels, it is possible to acoustically evaluate the target keywords. Experiments conducted in Japanese demonstrated that our method successfully improved the F1 score for unknown words.
comment: Accepted to Interspeech 2024
☆ InternLM-Law: An Open Source Chinese Legal Large Language Model
While large language models (LLMs) have showcased impressive capabilities, they struggle with addressing legal queries due to the intricate complexities and specialized expertise required in the legal field. In this paper, we introduce InternLM-Law, a specialized LLM tailored for addressing diverse legal queries related to Chinese laws, spanning from responding to standard legal questions (e.g., legal exercises in textbooks) to analyzing complex real-world legal situations. We meticulously construct a dataset in the Chinese legal domain, encompassing over 1 million queries, and implement a data filtering and processing pipeline to ensure its diversity and quality. Our training approach involves a novel two-stage process: initially fine-tuning LLMs on both legal-specific and general-purpose content to equip the models with broad knowledge, followed by exclusive fine-tuning on high-quality legal data to enhance structured output generation. InternLM-Law achieves the highest average performance on LawBench, outperforming state-of-the-art models, including GPT-4, on 13 out of 20 subtasks. We make InternLM-Law and our dataset publicly available to facilitate future research in applying LLMs within the legal domain.
comment: Our dataset, code and models will be released at https://github.com/InternLM/InternLM-Law
☆ FlowBench: Revisiting and Benchmarking Workflow-Guided Planning for LLM-based Agents
LLM-based agents have emerged as promising tools, which are crafted to fulfill complex tasks by iterative planning and action. However, these agents are susceptible to undesired planning hallucinations when lacking specific knowledge for expertise-intensive tasks. To address this, preliminary attempts are made to enhance planning reliability by incorporating external workflow-related knowledge. Despite the promise, such infused knowledge is mostly disorganized and diverse in formats, lacking rigorous formalization and comprehensive comparisons. Motivated by this, we formalize different formats of workflow knowledge and present FlowBench, the first benchmark for workflow-guided planning. FlowBench covers 51 different scenarios from 6 domains, with knowledge presented in diverse formats. To assess different LLMs on FlowBench, we design a multi-tiered evaluation framework. We evaluate the efficacy of workflow knowledge across multiple formats, and the results indicate that current LLM agents need considerable improvements for satisfactory planning. We hope that our challenging benchmark can pave the way for future agent planning research.
☆ OATH-Frames: Characterizing Online Attitudes Towards Homelessness with LLM Assistants
Warning: Contents of this paper may be upsetting. Public attitudes towards key societal issues, expressed on online media, are of immense value in policy and reform efforts, yet challenging to understand at scale. We study one such social issue: homelessness in the U.S., by leveraging the remarkable capabilities of large language models to assist social work experts in analyzing millions of posts from Twitter. We introduce a framing typology: Online Attitudes Towards Homelessness (OATH) Frames: nine hierarchical frames capturing critiques, responses and perceptions. We release annotations with varying degrees of assistance from language models, with immense benefits in scaling: 6.5x speedup in annotation time while only incurring a 3 point F1 reduction in performance with respect to the domain experts. Our experiments demonstrate the value of modeling OATH-Frames over existing sentiment and toxicity classifiers. Our large-scale analysis with predicted OATH-Frames on 2.4M posts on homelessness reveal key trends in attitudes across states, time periods and vulnerable populations, enabling new insights on the issue. Our work provides a general framework to understand nuanced public attitudes at scale, on issues beyond homelessness.
comment: Project website: https://dill-lab.github.io/oath-frames/
☆ 70B-parameter large language models in Japanese medical question-answering
Since the rise of large language models (LLMs), the domain adaptation has been one of the hot topics in various domains. Many medical LLMs trained with English medical dataset have made public recently. However, Japanese LLMs in medical domain still lack its research. Here we utilize multiple 70B-parameter LLMs for the first time and show that instruction tuning using Japanese medical question-answering dataset significantly improves the ability of Japanese LLMs to solve Japanese medical license exams, surpassing 50\% in accuracy. In particular, the Japanese-centric models exhibit a more significant leap in improvement through instruction tuning compared to their English-centric counterparts. This underscores the importance of continual pretraining and the adjustment of the tokenizer in our local language. We also examine two slightly different prompt formats, resulting in non-negligible performance improvement.
comment: 7 pages, 2 figures, 4 Tables
☆ Sports Intelligence: Assessing the Sports Understanding Capabilities of Language Models through Question Answering from Text to Video
Understanding sports is crucial for the advancement of Natural Language Processing (NLP) due to its intricate and dynamic nature. Reasoning over complex sports scenarios has posed significant challenges to current NLP technologies which require advanced cognitive capabilities. Toward addressing the limitations of existing benchmarks on sports understanding in the NLP field, we extensively evaluated mainstream large language models for various sports tasks. Our evaluation spans from simple queries on basic rules and historical facts to complex, context-specific reasoning, leveraging strategies from zero-shot to few-shot learning, and chain-of-thought techniques. In addition to unimodal analysis, we further assessed the sports reasoning capabilities of mainstream video language models to bridge the gap in multimodal sports understanding benchmarking. Our findings highlighted the critical challenges of sports understanding for NLP. We proposed a new benchmark based on a comprehensive overview of existing sports datasets and provided extensive error analysis which we hope can help identify future research priorities in this field.
☆ Direct Multi-Turn Preference Optimization for Language Agents
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
☆ DistiLRR: Transferring Code Repair for Low-Resource Programming Languages
Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework's efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.
☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
☆ From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
☆ Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models
Recent studies have demonstrated the effectiveness of using large language language models (LLMs) in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. {The Code is available at \url{https://github.com/liuqi6777/pe_rank}.}
☆ ToVo: Toxicity Taxonomy via Voting
Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
☆ Efficient Continual Pre-training by Mitigating the Stability Gap
Continual pre-training has increasingly become the predominant approach for adapting Large Language Models (LLMs) to new domains. This process involves updating the pre-trained LLM with a corpus from a new domain, resulting in a shift in the training distribution. To study the behavior of LLMs during this shift, we measured the model's performance throughout the continual pre-training process. we observed a temporary performance drop at the beginning, followed by a recovery phase, a phenomenon known as the "stability gap," previously noted in vision models classifying new classes. To address this issue and enhance LLM performance within a fixed compute budget, we propose three effective strategies: (1) Continually pre-training the LLM on a subset with a proper size for multiple epochs, resulting in faster performance recovery than pre-training the LLM on a large corpus in a single epoch; (2) Pre-training the LLM only on high-quality sub-corpus, which rapidly boosts domain performance; and (3) Using a data mixture similar to the pre-training data to reduce distribution gap. We conduct various experiments on Llama-family models to validate the effectiveness of our strategies in both medical continual pre-training and instruction tuning. For example, our strategies improve the average medical task performance of the OpenLlama-3B model from 36.2% to 40.7% with only 40% of the original training budget and enhance the average general task performance without causing forgetting. Furthermore, we apply our strategies to the Llama-3-8B model. The resulting model, Llama-3-Physician, achieves the best medical performance among current open-source models, and performs comparably to or even better than GPT-4 on several medical benchmarks. We release our models at \url{https://huggingface.co/YiDuo1999/Llama-3-Physician-8B-Instruct}.
☆ Is this a bad table? A Closer Look at the Evaluation of Table Generation from Text
Understanding whether a generated table is of good quality is important to be able to use it in creating or editing documents using automatic methods. In this work, we underline that existing measures for table quality evaluation fail to capture the overall semantics of the tables, and sometimes unfairly penalize good tables and reward bad ones. We propose TabEval, a novel table evaluation strategy that captures table semantics by first breaking down a table into a list of natural language atomic statements and then compares them with ground truth statements using entailment-based measures. To validate our approach, we curate a dataset comprising of text descriptions for 1,250 diverse Wikipedia tables, covering a range of topics and structures, in contrast to the limited scope of existing datasets. We compare TabEval with existing metrics using unsupervised and supervised text-to-table generation methods, demonstrating its stronger correlation with human judgments of table quality across four datasets.
☆ Word Matters: What Influences Domain Adaptation in Summarization?
Domain adaptation aims to enable Large Language Models (LLMs) to generalize domain datasets unseen effectively during the training phase. However, factors such as the size of the model parameters and the scale of training data are general influencers and do not reflect the nuances of domain adaptation performance. This paper investigates the fine-grained factors affecting domain adaptation performance, analyzing the specific impact of `words' in training data on summarization tasks. We propose quantifying dataset learning difficulty as the learning difficulty of generative summarization, which is determined by two indicators: word-based compression rate and abstraction level. Our experiments conclude that, when considering dataset learning difficulty, the cross-domain overlap and the performance gain in summarization tasks exhibit an approximate linear relationship, which is not directly related to the number of words. Based on this finding, predicting a model's performance on unknown domain datasets is possible without undergoing training.
☆ TemPrompt: Multi-Task Prompt Learning for Temporal Relation Extraction in RAG-based Crowdsourcing Systems
Temporal relation extraction (TRE) aims to grasp the evolution of events or actions, and thus shape the workflow of associated tasks, so it holds promise in helping understand task requests initiated by requesters in crowdsourcing systems. However, existing methods still struggle with limited and unevenly distributed annotated data. Therefore, inspired by the abundant global knowledge stored within pre-trained language models (PLMs), we propose a multi-task prompt learning framework for TRE (TemPrompt), incorporating prompt tuning and contrastive learning to tackle these issues. To elicit more effective prompts for PLMs, we introduce a task-oriented prompt construction approach that thoroughly takes the myriad factors of TRE into consideration for automatic prompt generation. In addition, we present temporal event reasoning as a supplement to bolster the model's focus on events and temporal cues. The experimental results demonstrate that TemPrompt outperforms all compared baselines across the majority of metrics under both standard and few-shot settings. A case study is provided to validate its effectiveness in crowdsourcing scenarios.
comment: 12 pages, 9 figures
☆ How Well Do LLMs Represent Values Across Cultures? Empirical Analysis of LLM Responses Based on Hofstede Cultural Dimensions
Large Language Models (LLMs) attempt to imitate human behavior by responding to humans in a way that pleases them, including by adhering to their values. However, humans come from diverse cultures with different values. It is critical to understand whether LLMs showcase different values to the user based on the stereotypical values of a user's known country. We prompt different LLMs with a series of advice requests based on 5 Hofstede Cultural Dimensions -- a quantifiable way of representing the values of a country. Throughout each prompt, we incorporate personas representing 36 different countries and, separately, languages predominantly tied to each country to analyze the consistency in the LLMs' cultural understanding. Through our analysis of the responses, we found that LLMs can differentiate between one side of a value and another, as well as understand that countries have differing values, but will not always uphold the values when giving advice, and fail to understand the need to answer differently based on different cultural values. Rooted in these findings, we present recommendations for training value-aligned and culturally sensitive LLMs. More importantly, the methodology and the framework developed here can help further understand and mitigate culture and language alignment issues with LLMs.
♻ ☆ MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs
Translation Quality Evaluation (TQE) is an essential step of the modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. The ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimisation. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task. On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that fine-tuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited. Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.
comment: Accepted by EAMT2024: The 25th Annual Conference of The European Association for Machine Translation
♻ ☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
♻ ☆ Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
♻ ☆ CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models EMNLP 2023
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability and the instability of implicit reasoning, particularly when both planning and execution are involved. To overcome these limitations, we propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. CREATOR disentangles abstract tool creation and concrete decision execution, resulting in improved performance. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems and diverse tabular contents. Remarkably, CREATOR outperforms existing chain-of-thought, program-of-thought, and tool-using baselines. Additionally, we introduce the Creation Challenge dataset, featuring 2K diverse questions, to emphasize the necessity and benefits of LLMs' tool creation ability. Further research demonstrates that leveraging LLMs as tool creators facilitates knowledge transfer, and LLMs exhibit varying levels of tool creation abilities, enabling them to adapt to diverse situations. The tool creation ability revolutionizes the LLM's problem-solving paradigm, driving us closer to the next frontier of artificial intelligence. All the codes and data are released.
comment: Findings of EMNLP 2023
♻ ☆ DUAL-REFLECT: Enhancing Large Language Models for Reflective Translation through Dual Learning Feedback Mechanisms ACL 2024
Recently, large language models (LLMs) enhanced by self-reflection have achieved promising performance on machine translation. The key idea is guiding LLMs to generate translation with human-like feedback. However, existing self-reflection methods lack effective feedback information, limiting the translation performance. To address this, we introduce a DUAL-REFLECT framework, leveraging the dual learning of translation tasks to provide effective feedback, thereby enhancing the models' self-reflective abilities and improving translation performance. The application of this method across various translation tasks has proven its effectiveness in improving translation accuracy and eliminating ambiguities, especially in translation tasks with low-resource language pairs.
comment: Accepted to ACL 2024 main conference
♻ ☆ Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models
Do large language models (LLMs) know the law? These models are increasingly being used to augment legal practice, education, and research, yet their revolutionary potential is threatened by the presence of hallucinations -- textual output that is not consistent with legal facts. We present the first systematic evidence of these hallucinations, documenting LLMs' varying performance across jurisdictions, courts, time periods, and cases. Our work makes four key contributions. First, we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. Second, we find that legal hallucinations are alarmingly prevalent, occurring between 58% of the time with ChatGPT 4 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. Third, we illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. Fourth, we provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, our findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.
♻ ☆ XNLP: An Interactive Demonstration System for Universal Structured NLP ACL 2024
Structured Natural Language Processing (XNLP) is an important subset of NLP that entails understanding the underlying semantic or syntactic structure of texts, which serves as a foundational component for many downstream applications. Despite certain recent efforts to explore universal solutions for specific categories of XNLP tasks, a comprehensive and effective approach for unifying all XNLP tasks long remains underdeveloped. In the meanwhile, while XNLP demonstration systems are vital for researchers exploring various XNLP tasks, existing platforms can be limited to, e.g., supporting few XNLP tasks, lacking interactivity and universalness. To this end, we propose an advanced XNLP demonstration platform, where we propose leveraging LLM to achieve universal XNLP, with one model for all with high generalizability. Overall, our system advances in multiple aspects, including universal XNLP modeling, high performance, interpretability, scalability, and interactivity, providing a unified platform for exploring diverse XNLP tasks in the community. XNLP is online: https://xnlp.haofei.vip
comment: ACL 2024 Demonstration Paper
♻ ☆ ApiQ: Finetuning of 2-Bit Quantized Large Language Model
Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.
comment: more benchmarks and new method, block-wise ApiQ. code: https://github.com/BaohaoLiao/ApiQ
♻ ☆ CantTalkAboutThis: Aligning Language Models to Stay on Topic in Dialogues
Recent advancements in instruction-tuning datasets have predominantly focused on specific tasks like mathematical or logical reasoning. There has been a notable gap in data designed for aligning language models to maintain topic relevance in conversations - a critical aspect for deploying chatbots to production. We introduce the CantTalkAboutThis dataset to help language models remain focused on the subject at hand during task-oriented interactions. It consists of synthetic dialogues on a wide range of conversation topics from different domains. These dialogues are interspersed with distractor turns that intentionally divert the chatbot from the predefined topic. Fine-tuning language models on this dataset helps make them resilient to deviating from the role assigned and improves their ability to maintain topical coherence compared to general-purpose instruction-tuned LLMs like GPT-4-turbo and Mixtral-Instruct. Additionally, preliminary observations suggest that training models on this dataset also enhance their performance on fine-grained instruction following tasks, including safety alignment.
♻ ☆ Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction INTERSPEECH 2024
Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downstream task of predicting pathological speech features. We find that selecting an optimal layer can greatly improve performance (~15.8% increase in balanced accuracy per feature as compared to worst layer, ~13.6% increase as compared to final layer), though the best layer varies by predicted feature and does not always generalize well to unseen data. A learned weighted sum offers comparable performance to the average best layer in-distribution (only ~1.2% lower) and had strong generalization for out-of-distribution data (only 1.5% lower than the average best layer).
comment: Accepted to INTERSPEECH 2024
♻ ☆ Chain-of-Thought Unfaithfulness as Disguised Accuracy
Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.
comment: TMLR accepted paper camera-ready version. First two authors contributed equally. 8 pages main, 13 pages appendix
♻ ☆ Information Guided Regularization for Fine-tuning Language Models
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
♻ ☆ Critical Learning Periods: Leveraging Early Training Dynamics for Efficient Data Pruning ACL 2024
Neural Machine Translation models are extremely data and compute-hungry. However, not all data points contribute equally to model training and generalization. Data pruning to remove the low-value data points has the benefit of drastically reducing the compute budget without significant drop in model performance. In this paper, we propose a new data pruning technique: Checkpoints Across Time (CAT), that leverages early model training dynamics to identify the most relevant data points for model performance. We benchmark CAT against several data pruning techniques including COMET-QE, LASER and LaBSE. We find that CAT outperforms the benchmarks on Indo-European languages on multiple test sets. When applied to English-German, English-French and English-Swahili translation tasks, CAT achieves comparable performance to using the full dataset, while pruning up to 50% of training data. We inspect the data points that CAT selects and find that it tends to favour longer sentences and sentences with unique or rare words.
comment: Accepted to ACL 2024 Findings
♻ ☆ CORM: Cache Optimization with Recent Message for Large Language Model Inference
Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
♻ ☆ GestureGPT: Toward Zero-shot Interactive Gesture Understanding and Grounding with Large Language Model Agents
Current gesture interfaces typically demand users to learn and perform gestures from a predefined set, which leads to a less natural experience. Interfaces supporting user-defined gestures eliminate the learning process, but users still need to demonstrate and associate the gesture to a specific system function themselves. We introduce GestureGPT, a free-form hand gesture understanding framework that does not require users to learn, demonstrate, or associate gestures. Our framework leverages the large language model's (LLM) astute common sense and strong inference ability to understand a spontaneously performed gesture from its natural language descriptions, and automatically maps it to a function provided by the interface. More specifically, our triple-agent framework involves a Gesture Description Agent that automatically segments and formulates natural language descriptions of hand poses and movements based on hand landmark coordinates. The description is deciphered by a Gesture Inference Agent through self-reasoning and querying about the interaction context (e.g., interaction history, gaze data), which a Context Management Agent organizes and provides. Following iterative exchanges, the Gesture Inference Agent discerns user intent, grounding it to an interactive function. We validated our conceptual framework under two real-world scenarios: smart home controlling and online video streaming. The average zero-shot Top-5 grounding accuracies are 83.59% for smart home tasks and 73.44% for video streaming. We also provided an extensive discussion of our framework including model selection rationale, generated description quality, generalizability etc.
♻ ☆ AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.
comment: Code available https://github.com/alanaai/EVUD
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Jellyfish: A Large Language Model for Data Preprocessing
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .
comment: a.k.a. "Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing''
♻ ☆ How to Build an AI Tutor that Can Adapt to Any Course and Provide Accurate Answers Using Large Language Model and Retrieval-Augmented Generation
This paper proposes a low-code solution to build an AI tutor that leverages advanced AI techniques to provide accurate and contextually relevant responses in a personalized learning environment. The OpenAI Assistants API allows AI Tutor to easily embed, store, retrieve, and manage files and chat history, enabling a low-code solution. Large Language Models (LLMs) and Retrieval-Augmented Generation (RAG) technology generate sophisticated answers based on course-specific materials. The application efficiently organizes and retrieves relevant information through vector embedding and similarity-based retrieval algorithms. The AI Tutor prototype demonstrates its ability to generate relevant, accurate answers with source citations. It represents a significant advancement in technology-enhanced tutoring systems, democratizing access to high-quality, customized educational support in higher education.
comment: 7 pages, 5 figures
♻ ☆ Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Automating data generation with Large Language Models (LLMs) has become increasingly popular. In this work, we investigate the feasibility and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automating attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by human verification and identification of attribution spans. Using this approach, we created MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.
♻ ☆ On the Worst Prompt Performance of Large Language Models
The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance; for instance, a difference of 45.48% between the worst and best performance for the Llama-2-70B-chat model, with its worst performance dipping as low as 9.38%. We further illustrate the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives, emphasizing the absence of a shortcut to characterize the worst prompt. We also attempt to enhance the worst prompt performance using existing prompt engineering and prompt consistency methods, but find that their impact is limited. These findings underscore the need to create more resilient LLMs that can maintain high performance across diverse prompts. Data and code are available at https://github.com/cbwbuaa/On-the-Worst-Prompt- Performance-of-LLMs.
♻ ☆ GenDistiller: Distilling Pre-trained Language Models based on an Autoregressive Generative Model
Pre-trained speech language models such as HuBERT and WavLM leverage unlabeled speech data for self-supervised learning and offer powerful representations for numerous downstream tasks. Despite the success of these models, their high requirements for memory and computing resource hinder their application on resource restricted devices. Therefore, this paper introduces GenDistiller, a novel knowledge distillation framework which generates the hidden representations of the pre-trained teacher model directly by a much smaller student network. The proposed method takes the previous hidden layer as history and implements a layer-by-layer prediction of the teacher model autoregressively. Experiments on SUPERB reveal the advantage of GenDistiller over the baseline distilling method without an autoregressive framework, with 33% fewer parameters, similar time consumption and better performance on most of the SUPERB tasks. Ultimately, the proposed GenDistiller reduces the size of WavLM by 82%.
comment: arXiv admin note: text overlap with arXiv:2310.13418
♻ ☆ Eight challenges in developing theory of intelligence
A good theory of mathematical beauty is more practical than any current observation, as new predictions of physical reality can be verified self-consistently. This belief applies to the current status of understanding deep neural networks including large language models and even the biological intelligence. Toy models provide a metaphor of physical reality, allowing mathematically formulating that reality (i.e., the so-called theory), which can be updated as more conjectures are justified or refuted. One does not need to pack all details into a model, but rather, more abstract models are constructed, as complex systems like brains or deep networks have many sloppy dimensions but much less stiff dimensions that strongly impact macroscopic observables. This kind of bottom-up mechanistic modeling is still promising in the modern era of understanding the natural or artificial intelligence. Here, we shed light on eight challenges in developing theory of intelligence following this theoretical paradigm. Theses challenges are representation learning, generalization, adversarial robustness, continual learning, causal learning, internal model of the brain, next-token prediction, and finally the mechanics of subjective experience.
comment: 24 pages, 131 references, revised version to journal
♻ ☆ From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent
Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as manual intervention, interactive scene orchestration, and narrative consistency. This framework enables efficient production of interactive and consistent narratives across multiple modalities, democratizing content creation and enhancing engagement. Our results demonstrate the framework's capability to produce coherent digital stories without reference videos, marking a significant advancement in automated digital storytelling.
comment: 16 pages, 13 figures
♻ ☆ TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt
comment: Our code and data are available at https://github.com/fzp0424/self_correct_mt
♻ ☆ LLMs Are Zero-Shot Context-Aware Simultaneous Translators
The advent of transformers has fueled progress in machine translation. More recently large language models (LLMs) have come to the spotlight thanks to their generality and strong performance in a wide range of language tasks, including translation. Here we show that open-source LLMs perform on par with or better than some state-of-the-art baselines in simultaneous machine translation (SiMT) tasks, zero-shot. We also demonstrate that injection of minimal background information, which is easy with an LLM, brings further performance gains, especially on challenging technical subject-matter. This highlights LLMs' potential for building next generation of massively multilingual, context-aware and terminologically accurate SiMT systems that require no resource-intensive training or fine-tuning.
♻ ☆ Conversational SimulMT: Efficient Simultaneous Translation with Large Language Models
Simultaneous machine translation (SimulMT) presents a challenging trade-off between translation quality and latency. Recent studies have shown that LLMs can achieve good performance in SimulMT tasks. However, this often comes at the expense of high inference cost and latency. In this paper, we propose a conversational SimulMT framework to enhance the inference efficiency of LLM-based SimulMT through multi-turn-dialogue-based decoding. Our experiments with Llama2-7b-chat on two SimulMT benchmarks demonstrate the superiority of LLM in translation quality while achieving comparable computational latency to specialized SimulMT models.
♻ ☆ TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones ICML
In recent years, multimodal large language models (MLLMs) such as GPT-4V have demonstrated remarkable advancements, excelling in a variety of vision-language tasks. Despite their prowess, the closed-source nature and computational demands of such models limit their accessibility and applicability. This study introduces TinyGPT-V, a novel open-source MLLM, designed for efficient training and inference across various vision-language tasks, including image captioning (IC) and visual question answering (VQA). Leveraging a compact yet powerful architecture, TinyGPT-V integrates the Phi-2 language model with pre-trained vision encoders, utilizing a unique mapping module for visual and linguistic information fusion. With a training regimen optimized for small backbones and employing a diverse dataset amalgam, TinyGPT-V requires significantly lower computational resources 24GB for training and as little as 8GB for inference without compromising on performance. Our experiments demonstrate that TinyGPT-V, with its language model 2.8 billion parameters, achieves comparable results in VQA and image inference tasks to its larger counterparts while being uniquely suited for deployment on resource-constrained devices through innovative quantization techniques. This work not only paves the way for more accessible and efficient MLLMs but also underscores the potential of smaller, optimized models in bridging the gap between high performance and computational efficiency in real-world applications. Additionally, this paper introduces a new approach to multimodal large language models using smaller backbones. Our code and training weights are available in the supplementary material.
comment: Accepted by ICML workshop 2024
♻ ☆ What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfactual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model responses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues.
comment: Project page: https://ivy-lvlm.github.io/Counterfactual-Inception/
♻ ☆ Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts NAACL 2024
Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance.
comment: Findings of the Association for Computational Linguistics: NAACL 2024
♻ ☆ Black-Box Prompt Optimization: Aligning Large Language Models without Model Training ACL 2024
Large language models (LLMs) have shown impressive success in various applications. However, these models are often not well aligned with human intents, which calls for additional treatments on them; that is, the alignment problem. To make LLMs better follow user instructions, existing alignment methods primarily focus on further training them. However, the extra training of LLMs is usually expensive in terms of GPU computing; even worse, some LLMs are not accessible for user-demanded training, such as GPTs. In this work, we take a different perspective -- Black-Box Prompt Optimization (BPO) -- to perform alignments. The idea is to optimize user prompts to suit LLMs' input understanding, so as to best realize users' intents without updating LLMs' parameters. BPO leverages human preferences to optimize prompts, thus making it superior to LLM (e.g., ChatGPT) as a prompt engineer. Moreover, BPO is model-agnostic, and the empirical results demonstrate that the BPO-aligned ChatGPT yields a 22% increase in the win rate against its original version and 10% for GPT-4. Notably, the BPO-aligned LLMs can outperform the same models aligned by PPO and DPO, and it also brings additional performance gains when combining BPO with PPO or DPO. Code and datasets are released at https://github.com/thu-coai/BPO.
comment: Accepted to ACL 2024
♻ ☆ Linguistic Steganalysis via LLMs: Two Modes for Efficient Detection of Strongly Concealed Stego
To detect stego (steganographic text) in complex scenarios, linguistic steganalysis (LS) with various motivations has been proposed and achieved excellent performance. However, with the development of generative steganography, some stegos have strong concealment, especially after the emergence of LLMs-based steganography, the existing LS has low detection or cannot detect them. We designed a novel LS with two modes called LSGC. In the generation mode, we created an LS-task "description" and used the generation ability of LLM to explain whether texts to be detected are stegos. On this basis, we rethought the principle of LS and LLMs, and proposed the classification mode. In this mode, LSGC deleted the LS-task "description" and used the "causalLM" LLMs to extract steganographic features. The LS features can be extracted by only one pass of the model, and a linear layer with initialization weights is added to obtain the classification probability. Experiments on strongly concealed stegos show that LSGC significantly improves detection and reaches SOTA performance. Additionally, LSGC in classification mode greatly reduces training time while maintaining high performance.
♻ ☆ Annotation alignment: Comparing LLM and human annotations of conversational safety
To what extent to do LLMs align with human perceptions of safety? We study this question via *annotation alignment*, the extent to which LLMs and humans agree when annotating the safety of user-chatbot conversations. We leverage the recent DICES dataset (Aroyo et al., 2023), in which 350 conversations are each rated for safety by 112 annotators spanning 10 race-gender groups. GPT-4 achieves a Pearson correlation of $r = 0.59$ with the average annotator rating, higher than the median annotator's correlation with the average ($r=0.51$). We show that larger datasets are needed to resolve whether GPT-4 exhibits disparities in how well it correlates with demographic groups. Also, there is substantial idiosyncratic variation in correlation *within* groups, suggesting that race & gender do not fully capture differences in alignment. Finally, we find that GPT-4 cannot predict when one demographic group finds a conversation more unsafe than another.
comment: Working draft, short paper. 5 pages, 1 figure
♻ ☆ Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models SIGIR24
Effective passage retrieval and reranking methods have been widely utilized to identify suitable candidates in open-domain question answering tasks, recent studies have resorted to LLMs for reranking the retrieved passages by the log-likelihood of the question conditioned on each passage. Although these methods have demonstrated promising results, the performance is notably sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs can be computationally intensive and time-consuming. Furthermore, this approach limits the leverage of question-passage relevance pairs and passage-specific knowledge to enhance the ranking capabilities of LLMs. In this paper, we propose passage-specific prompt tuning for reranking in open-domain question answering (PSPT): a parameter-efficient method that fine-tunes learnable passage-specific soft prompts, incorporating passage-specific knowledge from a limited set of question-passage relevance pairs. The method involves ranking retrieved passages based on the log-likelihood of the model generating the question conditioned on each passage and the learned soft prompt. We conducted extensive experiments utilizing the Llama-2-chat-7B model across three publicly available open-domain question answering datasets and the results demonstrate the effectiveness of the proposed approach.
comment: Accepted at Gen-IR@SIGIR24
♻ ☆ Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework
Structured tabular data is a fundamental data type in numerous fields, and the capacity to reason over tables is crucial for answering questions and validating hypotheses. However, constructing labeled data for complex reasoning tasks is labor intensive, and the quantity of annotated data remains insufficient to support the intricate demands of real-world applications. To address the insufficient annotation challenge, we present a self-training framework for unsupervised complex tabular reasoning (UCTR-ST) by generating diverse synthetic data with complex logic. Specifically, UCTR-ST incorporates several essential techniques: we aggregate diverse programs and execute them on tables based on a "Program-Management" component, and we bridge the gap between programs and text with a powerful "Program-Transformation" module that generates natural language sentences with complex logic. Furthermore, we optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios. The entire framework utilizes self-training techniques to leverage the unlabeled training data, which results in significant performance improvements when tested on real-world data. Experimental results demonstrate that UCTRST achieves above 90% of the supervised model performance on different tasks and domains, reducing the dependence on manual annotation. Additionally, our approach can serve as a data augmentation technique, significantly boosting the performance of supervised models in low-resourced domains.
comment: Submitted to TKDE, preprint version
♻ ☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
♻ ☆ Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs ACL 2024
Large language models (LLMs) have achieved impressive human-like performance across various reasoning tasks. However, their mastery of underlying inferential rules still falls short of human capabilities. To investigate this, we propose a logic scaffolding inferential rule generation framework, to construct an inferential rule base, ULogic, comprising both primitive and compositional rules across five domains. Our analysis of GPT-series models over a rule subset reveals significant gaps in LLMs' logic understanding compared to human performance, especially in compositional and structural complex rules with certain bias patterns. We further distill these rules into a smaller-scale inference engine for flexible rule generation and enhancing downstream reasoning. Through a multi-judger evaluation, our inference engine proves effective in generating accurate, complex and abstract conclusions and premises, and improve various commonsense reasoning tasks. Overall, our work sheds light on LLMs' limitations in grasping inferential rule and suggests ways to enhance their logical reasoning abilities~\footnote{Code and data are available at \url{https://github.com/SiyuanWangw/ULogic}.}.
comment: Accepted as a long paper to ACL 2024 Main
♻ ☆ Uncovering Safety Risks of Large Language Models through Concept Activation Vector
Despite careful safety alignment, current large language models (LLMs) remain vulnerable to various attacks. To further unveil the safety risks of LLMs, we introduce a Safety Concept Activation Vector (SCAV) framework, which effectively guides the attacks by accurately interpreting LLMs' safety mechanisms. We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks with automatically selected perturbation hyperparameters. Both automatic and human evaluations demonstrate that our attack method significantly improves the attack success rate and response quality while requiring less training data. Additionally, we find that our generated attack prompts may be transferable to GPT-4, and the embedding-level attacks may also be transferred to other white-box LLMs whose parameters are known. Our experiments further uncover the safety risks present in current LLMs. For example, we find that six out of seven open-source LLMs that we attack consistently provide relevant answers to more than 85\% malicious instructions. Finally, we provide insights into the safety mechanism of LLMs.
♻ ☆ SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
comment: v2: fix Table 1
♻ ☆ [WIP] Jailbreak Paradox: The Achilles' Heel of LLMs
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
Machine Learning 150
☆ NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https://github.com/autonomousvision/navsim.
☆ Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach
Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models due to the lack of sufficient BG data from Continuous Glucose Monitoring (CGM), presenting a significant "cold start" problem in patient care. Utilizing population models to address this challenge is a potential solution, but collecting patient data for training population models in a privacy-conscious manner is challenging, especially given that such data is often stored on personal devices. Considering the privacy protection and addressing the "cold start" problem in diabetes care, we propose "GluADFL", blood Glucose prediction by Asynchronous Decentralized Federated Learning. We compared GluADFL with eight baseline methods using four distinct T1D datasets, comprising 298 participants, which demonstrated its superior performance in accurately predicting BG levels for cross-patient analysis. Furthermore, patients' data might be stored and shared across various communication networks in GluADFL, ranging from highly interconnected (e.g., random, performs the best among others) to more structured topologies (e.g., cluster and ring), suitable for various social networks. The asynchronous training framework supports flexible participation. By adjusting the ratios of inactive participants, we found it remains stable if less than 70% are inactive. Our results confirm that GluADFL offers a practical, privacy-preserving solution for BG prediction in T1D, significantly enhancing the quality of diabetes management.
☆ GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \url{https://github.com/Liu-Hy/GenoTex}.
comment: 25 pages, 3 figures
☆ Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
☆ Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild
Virtual Try-On (VTON) is a highly active line of research, with increasing demand. It aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity. Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation. In this paper, we present a novel zero-shot training-free method for inpainting a clothing garment by reference. Our approach employs the prior of a diffusion model with no additional training, fully leveraging its native generalization capabilities. The method employs extended attention to transfer image information from reference to target images, overcoming two significant challenges. We first initially warp the reference garment over the target human using deep features, alleviating "texture sticking". We then leverage the extended attention mechanism with careful masking, eliminating leakage of reference background and unwanted influence. Through a user study, qualitative, and quantitative comparison to state-of-the-art approaches, we demonstrate superior image quality and garment preservation compared unseen clothing pieces or human figures.
comment: Project page available at https://nadavorzech.github.io/max4zero.github.io/
☆ Fine-grained Attention in Hierarchical Transformers for Tabular Time-series
Tabular data is ubiquitous in many real-life systems. In particular, time-dependent tabular data, where rows are chronologically related, is typically used for recording historical events, e.g., financial transactions, healthcare records, or stock history. Recently, hierarchical variants of the attention mechanism of transformer architectures have been used to model tabular time-series data. At first, rows (or columns) are encoded separately by computing attention between their fields. Subsequently, encoded rows (or columns) are attended to one another to model the entire tabular time-series. While efficient, this approach constrains the attention granularity and limits its ability to learn patterns at the field-level across separate rows, or columns. We take a first step to address this gap by proposing Fieldy, a fine-grained hierarchical model that contextualizes fields at both the row and column levels. We compare our proposal against state of the art models on regression and classification tasks using public tabular time-series datasets. Our results show that combining row-wise and column-wise attention improves performance without increasing model size. Code and data are available at https://github.com/raphaaal/fieldy.
comment: 9 pages
☆ Learning Spatio-Temporal Patterns of Polar Ice Layers With Physics-Informed Graph Neural Network
Learning spatio-temporal patterns of polar ice layers is crucial for monitoring the change in ice sheet balance and evaluating ice dynamic processes. While a few researchers focus on learning ice layer patterns from echogram images captured by airborne snow radar sensors via different convolutional neural networks, the noise in the echogram images proves to be a major obstacle. Instead, we focus on geometric deep learning based on graph neural networks to learn the spatio-temporal patterns from thickness information of shallow ice layers and make predictions for deep layers. In this paper, we propose a physics-informed hybrid graph neural network that combines the GraphSAGE framework for graph feature learning with the long short-term memory (LSTM) structure for learning temporal changes, and introduce measurements of physical ice properties from Model Atmospheric Regional (MAR) weather model as physical node features. We found that our proposed network can consistently outperform the current non-inductive or non-physical model in predicting deep ice layer thickness.
☆ Pessimistic asynchronous sampling in high-cost Bayesian optimization
Asynchronous Bayesian optimization is a recently implemented technique that allows for parallel operation of experimental systems and disjointed workflows. Contrasting with serial Bayesian optimization which individually selects experiments one at a time after conducting a measurement for each experiment, asynchronous policies sequentially assign multiple experiments before measurements can be taken and evaluate new measurements continuously as they are made available. This technique allows for faster data generation and therefore faster optimization of an experimental space. This work extends the capabilities of asynchronous optimization methods beyond prior studies by evaluating four additional policies that incorporate pessimistic predictions in the training data set. Combined with a conventional greedy policy, the five total policies were evaluated in a simulated environment and benchmarked with serial sampling. Under some conditions and parameter space dimensionalities, the pessimistic asynchronous policy reached optimum experimental conditions in significantly fewer experiments than equivalent serial policies and proved to be less susceptible to convergence onto local optima at higher dimensions. Without accounting for the faster sampling rate, the pessimistic asynchronous algorithm presented in this work could result in more efficient algorithm driven optimization of high-cost experimental spaces. Accounting for sampling rate, the presented asynchronous algorithm could allow for faster optimization in experimental spaces where multiple experiments can be run before results are collected.
☆ FT-AED: Benchmark Dataset for Early Freeway Traffic Anomalous Event Detection
Early and accurate detection of anomalous events on the freeway, such as accidents, can improve emergency response and clearance. However, existing delays and errors in event identification and reporting make it a difficult problem to solve. Current large-scale freeway traffic datasets are not designed for anomaly detection and ignore these challenges. In this paper, we introduce the first large-scale lane-level freeway traffic dataset for anomaly detection. Our dataset consists of a month of weekday radar detection sensor data collected in 4 lanes along an 18-mile stretch of Interstate 24 heading toward Nashville, TN, comprising over 3.7 million sensor measurements. We also collect official crash reports from the Nashville Traffic Management Center and manually label all other potential anomalies in the dataset. To show the potential for our dataset to be used in future machine learning and traffic research, we benchmark numerous deep learning anomaly detection models on our dataset. We find that unsupervised graph neural network autoencoders are a promising solution for this problem and that ignoring spatial relationships leads to decreased performance. We demonstrate that our methods can reduce reporting delays by over 10 minutes on average while detecting 75% of crashes. Our dataset and all preprocessing code needed to get started are publicly released at https://vu.edu/ft-aed/ to facilitate future research.
☆ Open Problem: Order Optimal Regret Bounds for Kernel-Based Reinforcement Learning COLT 2024
Reinforcement Learning (RL) has shown great empirical success in various application domains. The theoretical aspects of the problem have been extensively studied over past decades, particularly under tabular and linear Markov Decision Process structures. Recently, non-linear function approximation using kernel-based prediction has gained traction. This approach is particularly interesting as it naturally extends the linear structure, and helps explain the behavior of neural-network-based models at their infinite width limit. The analytical results however do not adequately address the performance guarantees for this case. We will highlight this open problem, overview existing partial results, and discuss related challenges.
comment: Open problem track. Conference on Learning Theory (COLT 2024)
☆ Unsupervised Morphological Tree Tokenizer
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. The code will be released later.
☆ Large Batch Analysis for Adagrad Under Anisotropic Smoothness
Adaptive gradient algorithms have been widely adopted in training large-scale deep neural networks, especially large foundation models. Despite their huge success in practice, their theoretical advantages over stochastic gradient descent (SGD) have not been fully understood, especially in the large batch-size setting commonly used in practice. This is because the only theoretical result that can demonstrate the benefit of Adagrad over SGD was obtained in the original paper of Adagrad for nonsmooth objective functions. However, for nonsmooth objective functions, there can be a linear slowdown of convergence when batch size increases, and thus a convergence analysis based on nonsmooth assumption cannot be used for large batch algorithms. In this work, we resolve this gap between theory and practice by providing a new analysis of Adagrad on both convex and nonconvex smooth objectives suitable for the large batch setting. It is shown that under the anisotropic smoothness and noise conditions, increased batch size does not slow down convergence for Adagrad, and thus it can still achieve a faster convergence guarantee over SGD even in the large batch setting. We present detailed comparisons between SGD and Adagrad to provide a better understanding of the benefits of adaptive gradient methods. Experiments in logistic regression and instruction following fine-tuning tasks provide strong evidence to support our theoretical analysis.
☆ Detecting Synthetic Lyrics with Few-Shot Inference
In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
comment: Under review
☆ ExDAG: Exact learning of DAGs
There has been a growing interest in causal learning in recent years. Commonly used representations of causal structures, including Bayesian networks and structural equation models (SEM), take the form of directed acyclic graphs (DAGs). We provide a novel mixed-integer quadratic programming formulation and associated algorithm that identifies DAGs on up to 50 vertices, where these are identifiable. We call this method ExDAG, which stands for Exact learning of DAGs. Although there is a superexponential number of constraints that prevent the formation of cycles, the algorithm adds constraints violated by solutions found, rather than imposing all constraints in each continuous-valued relaxation. Our empirical results show that ExDAG outperforms local state-of-the-art solvers in terms of precision and outperforms state-of-the-art global solvers with respect to scaling, when considering Gaussian noise. We also provide validation with respect to other noise distributions.
comment: 13 pages
☆ Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors
Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence society and spread propaganda. In this paper, we demonstrate the possibility of such a bias injection threat by an adversary who backdoors such models with a small number of malicious data samples; the implemented backdoor is activated when special triggers exist in the input prompt of the backdoored models. On the other hand, the model's utility is preserved in the absence of the triggers, making the attack highly undetectable. We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack. Our extensive experiments using over 1 million generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases can bypass conventional detection mechanisms, highlighting the challenges in proving the existence of biases within operational constraints. Our cost analysis confirms the low financial barrier to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in text-to-image generation models.
☆ Causal Learning in Biomedical Applications
We present a benchmark for methods in causal learning. Specifically, we consider training a rich class of causal models from time-series data, and we suggest the use of the Krebs cycle and models of metabolism more broadly.
☆ Perks and Pitfalls of Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs
As Graph Neural Networks (GNNs) become more pervasive, it becomes paramount to build robust tools for computing explanations of their predictions. A key desideratum is that these explanations are faithful, i.e., that they portray an accurate picture of the GNN's reasoning process. A number of different faithfulness metrics exist, begging the question of what faithfulness is exactly, and what its properties are. We begin by showing that existing metrics are not interchangeable -- i.e., explanations attaining high faithfulness according to one metric may be unfaithful according to others -- and can be systematically insensitive to important properties of the explanation, and suggest how to address these issues. We proceed to show that, surprisingly, optimizing for faithfulness is not always a sensible design goal. Specifically, we show that for injective regular GNN architectures, perfectly faithful explanations are completely uninformative. The situation is different for modular GNNs, such as self-explainable and domain-invariant architectures, where optimizing faithfulness does not compromise informativeness, and is also unexpectedly tied to out-of-distribution generalization.
☆ Generative Topological Networks
Generative models have seen significant advancements in recent years, yet often remain challenging and costly to train and use. We introduce Generative Topological Networks (GTNs) -- a new class of generative models that addresses these shortcomings. GTNs are trained deterministically using a simple supervised learning approach grounded in topology theory. GTNs are fast to train, and require only a single forward pass in a standard feedforward neural network to generate samples. We demonstrate the strengths of GTNs in several datasets, including MNIST, celebA and the Hands and Palm Images dataset. Finally, the theory behind GTNs offers insights into how to train generative models for improved performance.
☆ KalMamba: Towards Efficient Probabilistic State Space Models for RL under Uncertainty
Probabilistic State Space Models (SSMs) are essential for Reinforcement Learning (RL) from high-dimensional, partial information as they provide concise representations for control. Yet, they lack the computational efficiency of their recent deterministic counterparts such as S4 or Mamba. We propose KalMamba, an efficient architecture to learn representations for RL that combines the strengths of probabilistic SSMs with the scalability of deterministic SSMs. KalMamba leverages Mamba to learn the dynamics parameters of a linear Gaussian SSM in a latent space. Inference in this latent space amounts to standard Kalman filtering and smoothing. We realize these operations using parallel associative scanning, similar to Mamba, to obtain a principled, highly efficient, and scalable probabilistic SSM. Our experiments show that KalMamba competes with state-of-the-art SSM approaches in RL while significantly improving computational efficiency, especially on longer interaction sequences.
☆ Embracing Federated Learning: Enabling Weak Client Participation via Partial Model Training
In Federated Learning (FL), clients may have weak devices that cannot train the full model or even hold it in their memory space. To implement large-scale FL applications, thus, it is crucial to develop a distributed learning method that enables the participation of such weak clients. We propose EmbracingFL, a general FL framework that allows all available clients to join the distributed training regardless of their system resource capacity. The framework is built upon a novel form of partial model training method in which each client trains as many consecutive output-side layers as its system resources allow. Our study demonstrates that EmbracingFL encourages each layer to have similar data representations across clients, improving FL efficiency. The proposed partial model training method guarantees convergence to a neighbor of stationary points for non-convex and smooth problems. We evaluate the efficacy of EmbracingFL under a variety of settings with a mixed number of strong, moderate (~40% memory), and weak (~15% memory) clients, datasets (CIFAR-10, FEMNIST, and IMDB), and models (ResNet20, CNN, and LSTM). Our empirical study shows that EmbracingFL consistently achieves high accuracy as like all clients are strong, outperforming the state-of-the-art width reduction methods (i.e. HeteroFL and FjORD).
☆ A Provably Efficient Option-Based Algorithm for both High-Level and Low-Level Learning
Hierarchical Reinforcement Learning (HRL) approaches have shown successful results in solving a large variety of complex, structured, long-horizon problems. Nevertheless, a full theoretical understanding of this empirical evidence is currently missing. In the context of the \emph{option} framework, prior research has devised efficient algorithms for scenarios where options are fixed, and the high-level policy selecting among options only has to be learned. However, the fully realistic scenario in which both the high-level and the low-level policies are learned is surprisingly disregarded from a theoretical perspective. This work makes a step towards the understanding of this latter scenario. Focusing on the finite-horizon problem, we present a meta-algorithm alternating between regret minimization algorithms instanced at different (high and low) temporal abstractions. At the higher level, we treat the problem as a Semi-Markov Decision Process (SMDP), with fixed low-level policies, while at a lower level, inner option policies are learned with a fixed high-level policy. The bounds derived are compared with the lower bound for non-hierarchical finite-horizon problems, allowing to characterize when a hierarchical approach is provably preferable, even without pre-trained options.
☆ Brain-Like Language Processing via a Shallow Untrained Multihead Attention Network
Large Language Models (LLMs) have been shown to be effective models of the human language system, with some models predicting most explainable variance of brain activity in current datasets. Even in untrained models, the representations induced by architectural priors can exhibit reasonable alignment to brain data. In this work, we investigate the key architectural components driving the surprising alignment of untrained models. To estimate LLM-to-brain similarity, we first select language-selective units within an LLM, similar to how neuroscientists identify the language network in the human brain. We then benchmark the brain alignment of these LLM units across five different brain recording datasets. By isolating critical components of the Transformer architecture, we identify tokenization strategy and multihead attention as the two major components driving brain alignment. A simple form of recurrence further improves alignment. We further demonstrate this quantitative brain alignment of our model by reproducing landmark studies in the language neuroscience field, showing that localized model units -- just like language voxels measured empirically in the human brain -- discriminate more reliably between lexical than syntactic differences, and exhibit similar response profiles under the same experimental conditions. Finally, we demonstrate the utility of our model's representations for language modeling, achieving improved sample and parameter efficiency over comparable architectures. Our model's estimates of surprisal sets a new state-of-the-art in the behavioral alignment to human reading times. Taken together, we propose a highly brain- and behaviorally-aligned model that conceptualizes the human language system as an untrained shallow feature encoder, with structural priors, combined with a trained decoder to achieve efficient and performant language processing.
comment: Preprint
☆ HLQ: Fast and Efficient Backpropagation via Hadamard Low-rank Quantization
With the rapid increase in model size and the growing importance of various fine-tuning applications, lightweight training has become crucial. Since the backward pass is twice as expensive as the forward pass, optimizing backpropagation is particularly important. However, modifications to this process can lead to suboptimal convergence, so training optimization should minimize perturbations, which is a highly challenging task. In this study, we introduce a novel optimization strategy called Hadamard Low-rank Quantization (HLQ), focusing on reducing the cost of backpropagation in convolutional and linear layers. We first analyze the sensitivity of gradient computation with respect to activation and weight, and judiciously design the HLQ pipeline to apply 4-bit Hadamard quantization to the activation gradient and Hadamard low-rank approximation to the weight gradient. This combination was found to be the best for maximizing benefits, and our extensive experiments demonstrate the outstanding performance of HLQ in both training from scratch and fine-tuning, achieving significant memory savings and acceleration on real GPUs with negligible quality degradation.
☆ How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series
Mixed-type time series (MTTS) is a bimodal data type that is common in many domains, such as healthcare, finance, environmental monitoring, and social media. It consists of regularly sampled continuous time series and irregularly sampled categorical event sequences. The integration of both modalities through multimodal fusion is a promising approach for processing MTTS. However, the question of how to effectively fuse both modalities remains open. In this paper, we present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting. Our comparison includes three fusion types (early, intermediate, and late) and five fusion methods (concatenation, weighted mean, weighted mean with correlation, gating, and feature sharing). We evaluate these fusion approaches on three distinct datasets, one of which was generated using a novel framework. This framework allows for the control of key data properties, such as the strength and direction of intermodal interactions, modality imbalance, and the degree of randomness in each modality, providing a more controlled environment for testing fusion approaches. Our findings show that the performance of different fusion approaches can be substantially influenced by the direction and strength of intermodal interactions. The study reveals that early and intermediate fusion approaches excel at capturing fine-grained and coarse-grained cross-modal features, respectively. These findings underscore the crucial role of intermodal interactions in determining the most effective fusion strategy for MTTS forecasting.
☆ Towards General Negotiation Strategies with End-to-End Reinforcement Learning
The research field of automated negotiation has a long history of designing agents that can negotiate with other agents. Such negotiation strategies are traditionally based on manual design and heuristics. More recently, reinforcement learning approaches have also been used to train agents to negotiate. However, negotiation problems are diverse, causing observation and action dimensions to change, which cannot be handled by default linear policy networks. Previous work on this topic has circumvented this issue either by fixing the negotiation problem, causing policies to be non-transferable between negotiation problems or by abstracting the observations and actions into fixed-size representations, causing loss of information and expressiveness due to feature design. We developed an end-to-end reinforcement learning method for diverse negotiation problems by representing observations and actions as a graph and applying graph neural networks in the policy. With empirical evaluations, we show that our method is effective and that we can learn to negotiate with other agents on never-before-seen negotiation problems. Our result opens up new opportunities for reinforcement learning in negotiation agents.
comment: Accepted at the Reinforcement Learning Conference (RLC) 2024
☆ GOAL: A Generalist Combinatorial Optimization Agent Learner
Machine Learning-based heuristics have recently shown impressive performance in solving a variety of hard combinatorial optimization problems (COPs). However they generally rely on a separate neural model, specialized and trained for each single problem. Any variation of a problem requires adjustment of its model and re-training from scratch. In this paper, we propose GOAL (for Generalist combinatorial Optimization Agent Learning), a generalist model capable of efficiently solving multiple COPs and which can be fine-tuned to solve new COPs. GOAL consists of a single backbone plus light-weight problem-specific adapters, mostly for input and output processing. The backbone is based on a new form of mixed-attention blocks which allows to handle problems defined on graphs with arbitrary combinations of node, edge and instance-level features. Additionally, problems which involve heterogeneous nodes or edges, such as in multi-partite graphs, are handled through a novel multi-type transformer architecture, where the attention blocks are duplicated to attend only the relevant combination of types while relying on the same shared parameters. We train GOAL on a set of routing, scheduling and classic graph problems and show that it is only slightly inferior to the specialized baselines while being the first multi-task model that solves a variety of COPs. Finally, we showcase the strong transfer learning capacity of GOAL by fine-tuning or learning the adapters for new problems, with only few shots and little data.
☆ Neural Incremental Data Assimilation
Data assimilation is a central problem in many geophysical applications, such as weather forecasting. It aims to estimate the state of a potentially large system, such as the atmosphere, from sparse observations, supplemented by prior physical knowledge. The size of the systems involved and the complexity of the underlying physical equations make it a challenging task from a computational point of view. Neural networks represent a promising method of emulating the physics at low cost, and therefore have the potential to considerably improve and accelerate data assimilation. In this work, we introduce a deep learning approach where the physical system is modeled as a sequence of coarse-to-fine Gaussian prior distributions parametrized by a neural network. This allows us to define an assimilation operator, which is trained in an end-to-end fashion to minimize the reconstruction error on a dataset with different observation processes. We illustrate our approach on chaotic dynamical physical systems with sparse observations, and compare it to traditional variational data assimilation methods.
☆ Tempora-Fusion: Time-Lock Puzzle with Efficient Verifiable Homomorphic Linear Combination
To securely transmit sensitive information into the future, Time-Lock Puzzles (TLPs) have been developed. Their applications include scheduled payments, timed commitments, e-voting, and sealed-bid auctions. Homomorphic TLP is a key variant of TLP that enables computation on puzzles from different clients. This allows a solver/server to tackle only a single puzzle encoding the computation's result. However, existing homomorphic TLPs lack support for verifying the correctness of the computation results. We address this limitation by introducing Tempora-Fusion, a TLP that allows a server to perform homomorphic linear combinations of puzzles from different clients while ensuring verification of computation correctness. This scheme avoids asymmetric-key cryptography for verification, thus paving the way for efficient implementations. We discuss our scheme's application in various domains, such as federated learning, scheduled payments in online banking, and e-voting.
☆ Latent Space Translation via Inverse Relative Projection
The emergence of similar representations between independently trained neural models has sparked significant interest in the representation learning community, leading to the development of various methods to obtain communication between latent spaces. "Latent space communication" can be achieved in two ways: i) by independently mapping the original spaces to a shared or relative one; ii) by directly estimating a transformation from a source latent space to a target one. In this work, we combine the two into a novel method to obtain latent space translation through the relative space. By formalizing the invertibility of angle-preserving relative representations and assuming the scale invariance of decoder modules in neural models, we can effectively use the relative space as an intermediary, independently projecting onto and from other semantically similar spaces. Extensive experiments over various architectures and datasets validate our scale invariance assumption and demonstrate the high accuracy of our method in latent space translation. We also apply our method to zero-shot stitching between arbitrary pre-trained text and image encoders and their classifiers, even across modalities. Our method has significant potential for facilitating the reuse of models in a practical manner via compositionality.
comment: arXiv admin note: text overlap with arXiv:2311.00664, arXiv:2406.11014
☆ Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
☆ From Overfitting to Robustness: Quantity, Quality, and Variety Oriented Negative Sample Selection in Graph Contrastive Learning
Graph contrastive learning (GCL) aims to contrast positive-negative counterparts to learn the node embeddings, whereas graph data augmentation methods are employed to generate these positive-negative samples. The variation, quantity, and quality of negative samples compared to positive samples play crucial roles in learning meaningful embeddings for node classification downstream tasks. Less variation, excessive quantity, and low-quality negative samples cause the model to be overfitted for particular nodes, resulting in less robust models. To solve the overfitting problem in the GCL paradigm, this study proposes a novel Cumulative Sample Selection (CSS) algorithm by comprehensively considering negative samples' quality, variations, and quantity. Initially, three negative sample pools are constructed: easy, medium, and hard negative samples, which contain 25%, 50%, and 25% of the total available negative samples, respectively. Then, 10% negative samples are selected from each of these three negative sample pools for training the model. After that, a decision agent module evaluates model training results and decides whether to explore more negative samples from three negative sample pools by increasing the ratio or keep exploiting the current sampling ratio. The proposed algorithm is integrated into a proposed graph contrastive learning framework named NegAmplify. NegAmplify is compared with the SOTA methods on nine graph node classification datasets, with seven achieving better node classification accuracy with up to 2.86% improvement.
☆ Discovering Common Information in Multi-view Data
We introduce an innovative and mathematically rigorous definition for computing common information from multi-view data, drawing inspiration from G\'acs-K\"orner common information in information theory. Leveraging this definition, we develop a novel supervised multi-view learning framework to capture both common and unique information. By explicitly minimizing a total correlation term, the extracted common information and the unique information from each view are forced to be independent of each other, which, in turn, theoretically guarantees the effectiveness of our framework. To estimate information-theoretic quantities, our framework employs matrix-based R{\'e}nyi's $\alpha$-order entropy functional, which forgoes the need for variational approximation and distributional estimation in high-dimensional space. Theoretical proof is provided that our framework can faithfully discover both common and unique information from multi-view data. Experiments on synthetic and seven benchmark real-world datasets demonstrate the superior performance of our proposed framework over state-of-the-art approaches.
comment: Manuscript accepted by Information Fusion (\url{https://www.sciencedirect.com/science/article/pii/S1566253524001787}). We have updated a few descriptions for clarity. Code is available at \url{https://github.com/archy666/CUMI}
☆ Behaviour Distillation ICLR 2024
Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
comment: Published as a conference paper at ICLR 2024
☆ Online detection and infographic explanation of spam reviews with data drift adaptation
Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.
☆ Using Neural Networks for Data Cleaning in Weather Datasets ICML 2024
In climate science, we often want to compare across different datasets. Difficulties can arise in doing this due to inevitable mismatches that arise between observational and reanalysis data, or even between different reanalyses. This misalignment can raise problems for any work that seeks to make inferences about one dataset from another. We considered tropical cyclone location as an example task with one dataset providing atmospheric conditions (ERA5) and another providing storm tracks (IBTrACS). We found that while the examples often aligned well, there were a considerable proportion (around 25%) which were not well aligned. We trained a neural network to map from the wind field to the storm location; in this setting misalignment in the datasets appears as "label noise" (i.e. the labelled storm location does not correspond to the underlying wind field). We found that this neural network trained only on the often noisy labels from IBTrACS had a denoising effect, and performed better than the IBTrACS labels themselves, as measured by human preferences. Remarkably, this even held true for training points, on which we might have expected the network to overfit to the IBTrACS predictions.
comment: 6 pages, 2 figures, ICML 2024 Workshop on Machine Learning for Earth System Modeling
☆ SiT: Symmetry-Invariant Transformers for Generalisation in Reinforcement Learning ICML2024
An open challenge in reinforcement learning (RL) is the effective deployment of a trained policy to new or slightly different situations as well as semantically-similar environments. We introduce Symmetry-Invariant Transformer (SiT), a scalable vision transformer (ViT) that leverages both local and global data patterns in a self-supervised manner to improve generalisation. Central to our approach is Graph Symmetric Attention, which refines the traditional self-attention mechanism to preserve graph symmetries, resulting in invariant and equivariant latent representations. We showcase SiT's superior generalization over ViTs on MiniGrid and Procgen RL benchmarks, and its sample efficiency on Atari 100k and CIFAR10.
comment: 9 main pages, accepted to ICML2024
☆ Dislocation cartography: Representations and unsupervised classification of dislocation networks with unique fingerprints
Detecting structure in data is the first step to arrive at meaningful representations for systems. This is particularly challenging for dislocation networks evolving as a consequence of plastic deformation of crystalline systems. Our study employs Isomap, a manifold learning technique, to unveil the intrinsic structure of high-dimensional density field data of dislocation structures from different compression axis. The resulting maps provide a systematic framework for quantitatively comparing dislocation structures, offering unique fingerprints based on density fields. Our novel, unbiased approach contributes to the quantitative classification of dislocation structures which can be systematically extended.
comment: 26 pages, 7 figures
☆ Probabilistic and Differentiable Wireless Simulation with Geometric Transformers
Modelling the propagation of electromagnetic signals is critical for designing modern communication systems. While there are precise simulators based on ray tracing, they do not lend themselves to solving inverse problems or the integration in an automated design loop. We propose to address these challenges through differentiable neural surrogates that exploit the geometric aspects of the problem. We first introduce the Wireless Geometric Algebra Transformer (Wi-GATr), a generic backbone architecture for simulating wireless propagation in a 3D environment. It uses versatile representations based on geometric algebra and is equivariant with respect to E(3), the symmetry group of the underlying physics. Second, we study two algorithmic approaches to signal prediction and inverse problems based on differentiable predictive modelling and diffusion models. We show how these let us predict received power, localize receivers, and reconstruct the 3D environment from the received signal. Finally, we introduce two large, geometry-focused datasets of wireless signal propagation in indoor scenes. In experiments, we show that our geometry-forward approach achieves higher-fidelity predictions with less data than various baselines.
☆ Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System
Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage. While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions. In this work, we introduce a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Our system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback. Secondly, we present Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations. Our methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of our system in teaching robots dexterous manipulations with enhanced adaptability and safety.
☆ Hierarchical thematic classification of major conference proceedings
In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies.
☆ Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
comment: 8 pages, 6 figures
☆ Uni-Mol2: Exploring Molecular Pretraining Model at Scale
In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.
☆ Optimised Grouped-Query Attention Mechanism for Transformers ICML2024
Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
comment: Accepted at ICML2024 ES-FoMo-II Workshop
☆ Unlocking the Global Synergies in Low-Rank Adapters ICML2024
Low-rank Adaption (LoRA) has been the de-facto parameter-efficient fine-tuning technique for large language models. We present HeteroLoRA, a light-weight search algorithm that leverages zero-cost proxies to allocate the limited LoRA trainable parameters across the model for better fine-tuned performance. In addition to the allocation for the standard LoRA-adapted models, we also demonstrate the efficacy of HeteroLoRA by performing the allocation in a more challenging search space that includes LoRA modules and LoRA-adapted shortcut connections. Experiments show that HeteroLoRA enables improvements in model performance given the same parameter budge. For example, on MRPC, we see an improvement of 1.6% in accuracy with similar training parameter budget. We will open-source our algorithm once the paper is accepted.
comment: Accepted at ICML2024 ES-FoMo-II Workshop
☆ Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health
Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model's performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual's cardiovascular health status. Our code will be available at https://github.com/Ngk03/AI-vascular-age.
☆ An Idiosyncrasy of Time-discretization in Reinforcement Learning
Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.
comment: RLC 2024
☆ On the growth of the parameters of approximating ReLU neural networks
This work focuses on the analysis of fully connected feed forward ReLU neural networks as they approximate a given, smooth function. In contrast to conventionally studied universal approximation properties under increasing architectures, e.g., in terms of width or depth of the networks, we are concerned with the asymptotic growth of the parameters of approximating networks. Such results are of interest, e.g., for error analysis or consistency results for neural network training. The main result of our work is that, for a ReLU architecture with state of the art approximation error, the realizing parameters grow at most polynomially. The obtained rate with respect to a normalized network size is compared to existing results and is shown to be superior in most cases, in particular for high dimensional input.
☆ Efficient Graph Similarity Computation with Alignment Regularization NeurIPS 2022
We consider the graph similarity computation (GSC) task based on graph edit distance (GED) estimation. State-of-the-art methods treat GSC as a learning-based prediction task using Graph Neural Networks (GNNs). To capture fine-grained interactions between pair-wise graphs, these methods mostly contain a node-level matching module in the end-to-end learning pipeline, which causes high computational costs in both the training and inference stages. We show that the expensive node-to-node matching module is not necessary for GSC, and high-quality learning can be attained with a simple yet powerful regularization technique, which we call the Alignment Regularization (AReg). In the training stage, the AReg term imposes a node-graph correspondence constraint on the GNN encoder. In the inference stage, the graph-level representations learned by the GNN encoder are directly used to compute the similarity score without using AReg again to speed up inference. We further propose a multi-scale GED discriminator to enhance the expressive ability of the learned representations. Extensive experiments on real-world datasets demonstrate the effectiveness, efficiency and transferability of our approach.
comment: NeurIPS 2022
☆ LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking
The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature's exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks
In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN's performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.
☆ Towards Dynamic Resource Allocation and Client Scheduling in Hierarchical Federated Learning: A Two-Phase Deep Reinforcement Learning Approach
Federated learning (FL) is a viable technique to train a shared machine learning model without sharing data. Hierarchical FL (HFL) system has yet to be studied regrading its multiple levels of energy, computation, communication, and client scheduling, especially when it comes to clients relying on energy harvesting to power their operations. This paper presents a new two-phase deep deterministic policy gradient (DDPG) framework, referred to as ``TP-DDPG'', to balance online the learning delay and model accuracy of an FL process in an energy harvesting-powered HFL system. The key idea is that we divide optimization decisions into two groups, and employ DDPG to learn one group in the first phase, while interpreting the other group as part of the environment to provide rewards for training the DDPG in the second phase. Specifically, the DDPG learns the selection of participating clients, and their CPU configurations and the transmission powers. A new straggler-aware client association and bandwidth allocation (SCABA) algorithm efficiently optimizes the other decisions and evaluates the reward for the DDPG. Experiments demonstrate that with substantially reduced number of learnable parameters, the TP-DDPG can quickly converge to effective polices that can shorten the training time of HFL by 39.4% compared to its benchmarks, when the required test accuracy of HFL is 0.9.
☆ MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction and boosts decode throughput by $5.5-6.7 \times$ for 7B and 13B dense models on a single GPU, with minimal impact on performance.
comment: 10 pages
☆ Enhancing reliability in prediction intervals using point forecasters: Heteroscedastic Quantile Regression and Width-Adaptive Conformal Inference
Building prediction intervals for time series forecasting problems presents a complex challenge, particularly when relying solely on point predictors, a common scenario for practitioners in the industry. While research has primarily focused on achieving increasingly efficient valid intervals, we argue that, when evaluating a set of intervals, traditional measures alone are insufficient. There are additional crucial characteristics: the intervals must vary in length, with this variation directly linked to the difficulty of the prediction, and the coverage of the interval must remain independent of the difficulty of the prediction for practical utility. We propose the Heteroscedastic Quantile Regression (HQR) model and the Width-Adaptive Conformal Inference (WACI) method, providing theoretical coverage guarantees, to overcome those issues, respectively. The methodologies are evaluated in the context of Electricity Price Forecasting and Wind Power Forecasting, representing complex scenarios in time series forecasting. The results demonstrate that HQR and WACI not only improve or achieve typical measures of validity and efficiency but also successfully fulfil the commonly ignored mentioned characteristics.
☆ Pathformer: Recursive Path Query Encoding for Complex Logical Query Answering
Complex Logical Query Answering (CLQA) over incomplete knowledge graphs is a challenging task. Recently, Query Embedding (QE) methods are proposed to solve CLQA by performing multi-hop logical reasoning. However, most of them only consider historical query context information while ignoring future information, which leads to their failure to capture the complex dependencies behind the elements of a query. In recent years, the transformer architecture has shown a strong ability to model long-range dependencies between words. The bidirectional attention mechanism proposed by the transformer can solve the limitation of these QE methods regarding query context. Still, as a sequence model, it is difficult for the transformer to model complex logical queries with branch structure computation graphs directly. To this end, we propose a neural one-point embedding method called Pathformer based on the tree-like computation graph, i.e., query computation tree. Specifically, Pathformer decomposes the query computation tree into path query sequences by branches and then uses the transformer encoder to recursively encode these path query sequences to obtain the final query embedding. This allows Pathformer to fully utilize future context information to explicitly model the complex interactions between various parts of the path query. Experimental results show that Pathformer outperforms existing competitive neural QE methods, and we found that Pathformer has the potential to be applied to non-one-point embedding space.
comment: This work has been submitted to the IEEE
☆ MOS: Model Synergy for Test-Time Adaptation on LiDAR-Based 3D Object Detection
LiDAR-based 3D object detection is pivotal across many applications, yet the performance of such detection systems often degrades after deployment, especially when faced with unseen test point clouds originating from diverse locations or subjected to corruption. In this work, we introduce a new online adaptation framework for detectors named Model Synergy (MOS). Specifically, MOS dynamically assembles best-fit supermodels for each test batch from a bank of historical checkpoints, leveraging long-term knowledge to guide model updates without forgetting. The model assembly is directed by the proposed synergy weights (SW), employed for weighted averaging of the selected checkpoints to minimize redundancy in the composite supermodel. These weights are calculated by evaluating the similarity of predicted bounding boxes on test data and the feature independence among model pairs in the bank. To maintain an informative yet compact model bank, we pop out checkpoints with the lowest average SW scores and insert newly updated model weights. Our method was rigorously tested against prior test-time domain adaptation strategies on three datasets and under eight types of corruptions, demonstrating its superior adaptability to changing scenes and conditions. Remarkably, our approach achieved a 67.3% increase in performance in a complex "cross-corruption" scenario, which involves cross-dataset inconsistencies and real-world scene corruptions, providing a more realistic testbed of adaptation capabilities. The code is available at https://github.com/zhuoxiao-chen/MOS.
☆ Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization ICML 2024
Active learning is increasingly adopted for expensive multi-objective combinatorial optimization problems, but it involves a challenging subset selection problem, optimizing the batch acquisition score that quantifies the goodness of a batch for evaluation. Due to the excessively large search space of the subset selection problem, prior methods optimize the batch acquisition on the latent space, which has discrepancies with the actual space, or optimize individual acquisition scores without considering the dependencies among candidates in a batch instead of directly optimizing the batch acquisition. To manage the vast search space, a simple and effective approach is the greedy method, which decomposes the problem into smaller subproblems, yet it has difficulty in parallelization since each subproblem depends on the outcome from the previous ones. To this end, we introduce a novel greedy-style subset selection algorithm that optimizes batch acquisition directly on the combinatorial space by sequential greedy sampling from the greedy policy, specifically trained to address all greedy subproblems concurrently. Notably, our experiments on the red fluorescent proteins design task show that our proposed method achieves the baseline performance in 1.69x fewer queries, demonstrating its efficiency.
comment: ICML 2024; Codes at https://github.com/snu-mllab/GreedyPolicyForMOCO
☆ I don't trust you (anymore)! -- The effect of students' LLM use on Lecturer-Student-Trust in Higher Education
Trust plays a pivotal role in Lecturer-Student-Collaboration, encompassing teaching and research aspects. The advent of Large Language Models (LLMs) in platforms like Open AI's ChatGPT, coupled with their cost-effectiveness and high-quality results, has led to their rapid adoption among university students. However, discerning genuine student input from LLM-generated output poses a challenge for lecturers. This dilemma jeopardizes the trust relationship between lecturers and students, potentially impacting university downstream activities, particularly collaborative research initiatives. Despite attempts to establish guidelines for student LLM use, a clear framework mutually beneficial for lecturers and students in higher education remains elusive. This study addresses the research question: How does the use of LLMs by students impact Informational and Procedural Justice, influencing Team Trust and Expected Team Performance? Methodically, we applied a quantitative construct-based survey, evaluated using techniques of Structural Equation Modelling (PLS- SEM) to examine potential relationships among these constructs. Our findings based on 23 valid respondents from Ndejje University indicate that lecturers are less concerned about the fairness of LLM use per se but are more focused on the transparency of student utilization, which significantly influences Team Trust positively. This research contributes to the global discourse on integrating and regulating LLMs and subsequent models in education. We propose that guidelines should support LLM use while enforcing transparency in Lecturer-Student- Collaboration to foster Team Trust and Performance. The study contributes valuable insights for shaping policies enabling ethical and transparent LLMs usage in education to ensure effectiveness of collaborative learning environments.
comment: Working Paper
☆ Direct Multi-Turn Preference Optimization for Language Agents
Adapting Large Language Models (LLMs) for agent tasks is critical in developing language agents. Direct Preference Optimization (DPO) is a promising technique for this adaptation with the alleviation of compounding errors, offering a means to directly optimize Reinforcement Learning (RL) objectives. However, applying DPO to multi-turn tasks presents challenges due to the inability to cancel the partition function. Overcoming this obstacle involves making the partition function independent of the current state and addressing length disparities between preferred and dis-preferred trajectories. In this light, we replace the policy constraint with the state-action occupancy measure constraint in the RL objective and add length normalization to the Bradley-Terry model, yielding a novel loss function named DMPO for multi-turn agent tasks with theoretical explanations. Extensive experiments on three multi-turn agent task datasets confirm the effectiveness and superiority of the DMPO loss.
☆ DistiLRR: Transferring Code Repair for Low-Resource Programming Languages
Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework's efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.
☆ A review of feature selection strategies utilizing graph data structures and knowledge graphs
Feature selection in Knowledge Graphs (KGs) are increasingly utilized in diverse domains, including biomedical research, Natural Language Processing (NLP), and personalized recommendation systems. This paper delves into the methodologies for feature selection within KGs, emphasizing their roles in enhancing machine learning (ML) model efficacy, hypothesis generation, and interpretability. Through this comprehensive review, we aim to catalyze further innovation in feature selection for KGs, paving the way for more insightful, efficient, and interpretable analytical models across various domains. Our exploration reveals the critical importance of scalability, accuracy, and interpretability in feature selection techniques, advocating for the integration of domain knowledge to refine the selection process. We highlight the burgeoning potential of multi-objective optimization and interdisciplinary collaboration in advancing KG feature selection, underscoring the transformative impact of such methodologies on precision medicine, among other fields. The paper concludes by charting future directions, including the development of scalable, dynamic feature selection algorithms and the integration of explainable AI principles to foster transparency and trust in KG-driven models.
☆ LatentExplainer: Explaining Latent Representations in Deep Generative Models with Multi-modal Foundation Models
Deep generative models like VAEs and diffusion models have advanced various generation tasks by leveraging latent variables to learn data distributions and generate high-quality samples. Despite the field of explainable AI making strides in interpreting machine learning models, understanding latent variables in generative models remains challenging. This paper introduces LatentExplainer, a framework for automatically generating semantically meaningful explanations of latent variables in deep generative models. LatentExplainer tackles three main challenges: inferring the meaning of latent variables, aligning explanations with inductive biases, and handling varying degrees of explainability. By perturbing latent variables and interpreting changes in generated data, the framework provides a systematic approach to understanding and controlling the data generation process, enhancing the transparency and interpretability of deep generative models. We evaluate our proposed method on several real-world and synthetic datasets, and the results demonstrate superior performance in generating high-quality explanations of latent variables.
☆ Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis
Limited access to neurological care leads to missed diagnoses of Parkinson's disease (PD), leaving many individuals unidentified and untreated. We trained a novel neural network-based fusion architecture to detect Parkinson's disease (PD) by analyzing features extracted from webcam recordings of three tasks: finger tapping, facial expression (smiling), and speech (uttering a sentence containing all letters of the alphabet). Additionally, the model incorporated Monte Carlo Dropout to improve prediction accuracy by considering uncertainties. The study participants (n = 845, 272 with PD) were randomly split into three sets: 60% for training, 20% for model selection (hyper-parameter tuning), and 20% for final performance evaluation. The dataset consists of 1102 sessions, each session containing videos of all three tasks. Our proposed model achieved significantly better accuracy, area under the ROC curve (AUROC), and sensitivity at non-inferior specificity compared to any single-task model. Withholding uncertain predictions further boosted the performance, achieving 88.0% (95% CI: 87.7% - 88.4%) accuracy, 93.0% (92.8% - 93.2%) AUROC, 79.3% (78.4% - 80.2%) sensitivity, and 92.6% (92.3% - 92.8%) specificity, at the expense of not being able to predict for 2.3% (2.0% - 2.6%) data. Further analysis suggests that the trained model does not exhibit any detectable bias across sex and ethnic subgroups and is most effective for individuals aged between 50 and 80. This accessible, low-cost approach requiring only an internet-enabled device with a webcam and microphone paves the way for convenient PD screening at home, particularly in regions with limited access to clinical specialists.
☆ Graph Edge Representation via Tensor Product Graph Convolutional Representation
Graph Convolutional Networks (GCNs) have been widely studied. The core of GCNs is the definition of convolution operators on graphs. However, existing Graph Convolution (GC) operators are mainly defined on adjacency matrix and node features and generally focus on obtaining effective node embeddings which cannot be utilized to address the graphs with (high-dimensional) edge features. To address this problem, by leveraging tensor contraction representation and tensor product graph diffusion theories, this paper analogously defines an effective convolution operator on graphs with edge features which is named as Tensor Product Graph Convolution (TPGC). The proposed TPGC aims to obtain effective edge embeddings. It provides a complementary model to traditional graph convolutions (GCs) to address the more general graph data analysis with both node and edge features. Experimental results on several graph learning tasks demonstrate the effectiveness of the proposed TPGC.
☆ DN-CL: Deep Symbolic Regression against Noise via Contrastive Learning
Noise ubiquitously exists in signals due to numerous factors including physical, electronic, and environmental effects. Traditional methods of symbolic regression, such as genetic programming or deep learning models, aim to find the most fitting expressions for these signals. However, these methods often overlook the noise present in real-world data, leading to reduced fitting accuracy. To tackle this issue, we propose \textit{\textbf{D}eep Symbolic Regression against \textbf{N}oise via \textbf{C}ontrastive \textbf{L}earning (DN-CL)}. DN-CL employs two parameter-sharing encoders to embed data points from various data transformations into feature shields against noise. This model treats noisy data and clean data as different views of the ground-truth mathematical expressions. Distances between these features are minimized, utilizing contrastive learning to distinguish between 'positive' noise-corrected pairs and 'negative' contrasting pairs. Our experiments indicate that DN-CL demonstrates superior performance in handling both noisy and clean data, presenting a promising method of symbolic regression.
☆ TabularMark: Watermarking Tabular Datasets for Machine Learning
Watermarking is broadly utilized to protect ownership of shared data while preserving data utility. However, existing watermarking methods for tabular datasets fall short on the desired properties (detectability, non-intrusiveness, and robustness) and only preserve data utility from the perspective of data statistics, ignoring the performance of downstream ML models trained on the datasets. Can we watermark tabular datasets without significantly compromising their utility for training ML models while preventing attackers from training usable ML models on attacked datasets? In this paper, we propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding, which is adaptable for numerical and categorical attributes while preserving the data utility. For detection, a custom-threshold one proportion z-test is employed, which can reliably determine the presence of the watermark. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
☆ Bayesian neural networks for predicting uncertainty in full-field material response
Stress and material deformation field predictions are among the most important tasks in computational mechanics. These predictions are typically made by solving the governing equations of continuum mechanics using finite element analysis, which can become computationally prohibitive considering complex microstructures and material behaviors. Machine learning (ML) methods offer potentially cost effective surrogates for these applications. However, existing ML surrogates are either limited to low-dimensional problems and/or do not provide uncertainty estimates in the predictions. This work proposes an ML surrogate framework for stress field prediction and uncertainty quantification for diverse materials microstructures. A modified Bayesian U-net architecture is employed to provide a data-driven image-to-image mapping from initial microstructure to stress field with prediction (epistemic) uncertainty estimates. The Bayesian posterior distributions for the U-net parameters are estimated using three state-of-the-art inference algorithms: the posterior sampling-based Hamiltonian Monte Carlo method and two variational approaches, the Monte-Carlo Dropout method and the Bayes by Backprop algorithm. A systematic comparison of the predictive accuracy and uncertainty estimates for these methods is performed for a fiber reinforced composite material and polycrystalline microstructure application. It is shown that the proposed methods yield predictions of high accuracy compared to the FEA solution, while uncertainty estimates depend on the inference approach. Generally, the Hamiltonian Monte Carlo and Bayes by Backprop methods provide consistent uncertainty estimates. Uncertainty estimates from Monte Carlo Dropout, on the other hand, are more difficult to interpret and depend strongly on the method's design.
☆ ToVo: Toxicity Taxonomy via Voting
Existing toxic detection models face significant limitations, such as lack of transparency, customization, and reproducibility. These challenges stem from the closed-source nature of their training data and the paucity of explanations for their evaluation mechanism. To address these issues, we propose a dataset creation mechanism that integrates voting and chain-of-thought processes, producing a high-quality open-source dataset for toxic content detection. Our methodology ensures diverse classification metrics for each sample and includes both classification scores and explanatory reasoning for the classifications. We utilize the dataset created through our proposed mechanism to train our model, which is then compared against existing widely-used detectors. Our approach not only enhances transparency and customizability but also facilitates better fine-tuning for specific use cases. This work contributes a robust framework for developing toxic content detection models, emphasizing openness and adaptability, thus paving the way for more effective and user-specific content moderation solutions.
☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
☆ On the estimation rate of Bayesian PINN for inverse problems
Solving partial differential equations (PDEs) and their inverse problems using Physics-informed neural networks (PINNs) is a rapidly growing approach in the physics and machine learning community. Although several architectures exist for PINNs that work remarkably in practice, our theoretical understanding of their performances is somewhat limited. In this work, we study the behavior of a Bayesian PINN estimator of the solution of a PDE from $n$ independent noisy measurement of the solution. We focus on a class of equations that are linear in their parameters (with unknown coefficients $\theta_\star$). We show that when the partial differential equation admits a classical solution (say $u_\star$), differentiable to order $\beta$, the mean square error of the Bayesian posterior mean is at least of order $n^{-2\beta/(2\beta + d)}$. Furthermore, we establish a convergence rate of the linear coefficients of $\theta_\star$ depending on the order of the underlying differential operator. Last but not least, our theoretical results are validated through extensive simulations.
comment: 35 Pages, 3 figures, and 2 tables
☆ Probabilistic Emulation of a Global Climate Model with Spherical DYffusion
Data-driven deep learning models are on the verge of transforming global weather forecasting. It is an open question if this success can extend to climate modeling, where long inference rollouts and data complexity pose significant challenges. Here, we present the first conditional generative model able to produce global climate ensemble simulations that are accurate and physically consistent. Our model runs at 6-hourly time steps and is shown to be stable for 10-year-long simulations. Our approach beats relevant baselines and nearly reaches a gold standard for successful climate model emulation. We discuss the key design choices behind our dynamics-informed diffusion model-based approach which enables this significant step towards efficient, data-driven climate simulations that can help us better understand the Earth and adapt to a changing climate.
☆ MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.
♻ ☆ Provable Guarantees for Model Performance via Mechanistic Interpretability
In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
♻ ☆ Large Reasoning Models for 3D Floorplanning in EDA: Learning from Imperfections
In this paper, we introduce Dreamweaver, which belongs to a new class of auto-regressive decision-making models known as large reasoning models (LRMs). Dreamweaver is designed to improve 3D floorplanning in electronic design automation (EDA) via an architecture that melds advancements in sequence-to-sequence reinforcement learning algorithms. A significant advantage of our approach is its ability to effectively reason over large discrete action spaces, which is essential for handling the numerous potential positions for various functional blocks in floorplanning. Additionally, Dreamweaver demonstrates strong performance even when trained on entirely random trajectories, showcasing its capacity to leverage sub-optimal or non-expert trajectories to enhance its results. This innovative approach contributes to streamlining the integrated circuit (IC) design flow and reducing the high computational costs typically associated with floorplanning. We evaluate its performance against a current state-of-the-art method, highlighting notable improvements.
comment: Under review
♻ ☆ Testing Calibration in Nearly-Linear Time
In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given $n$ draws from a distribution $\mathcal{D}$ on $(predictions, binary outcomes)$, our goal is to distinguish between the case where $\mathcal{D}$ is perfectly calibrated, and the case where $\mathcal{D}$ is $\varepsilon$-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time $O(n\log^2(n))$, and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring $\Omega(n^\omega)$ time, where $\omega > 2$ is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.
♻ ☆ Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
♻ ☆ Impact of Decentralized Learning on Player Utilities in Stackelberg Games ICML 2024
When deployed in the world, a learning agent such as a recommender system or a chatbot often repeatedly interacts with another learning agent (such as a user) over time. In many such two-agent systems, each agent learns separately and the rewards of the two agents are not perfectly aligned. To better understand such cases, we examine the learning dynamics of the two-agent system and the implications for each agent's objective. We model these systems as Stackelberg games with decentralized learning and show that standard regret benchmarks (such as Stackelberg equilibrium payoffs) result in worst-case linear regret for at least one player. To better capture these systems, we construct a relaxed regret benchmark that is tolerant to small learning errors by agents. We show that standard learning algorithms fail to provide sublinear regret, and we develop algorithms to achieve near-optimal $O(T^{2/3})$ regret for both players with respect to these benchmarks. We further design relaxed environments under which faster learning ($O(\sqrt{T})$) is possible. Altogether, our results take a step towards assessing how two-agent interactions in sequential and decentralized learning environments affect the utility of both agents.
comment: To appear at ICML 2024; this is the full version
♻ ☆ Offline Diversity Maximization Under Imitation Constraints
There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.
comment: RLC 2024
♻ ☆ The Normal Distributions Indistinguishability Spectrum and its Application to Privacy-Preserving Machine Learning
Differential Privacy (DP) (and its variants) is the most common method for machine learning (ML) on privacy-sensitive data. In big data analytics, one often uses randomized sketching/aggregation algorithms to make processing high-dimensional data tractable. Intuitively, such ML algorithms should provide some inherent privacy, yet most existing DP mechanisms do not leverage or under-utilize this inherent randomness, resulting in potentially redundant noising. The motivating question of our work is: (How) can we improve the utility of DP mechanisms for randomized ML queries, by leveraging the randomness of the query itself? Towards a (positive) answer, our key contribution is (proving) what we call the NDIS theorem, a theoretical result with several practical implications. In a nutshell, NDIS is a closed-form analytic computation for the (varepsilon,delta)-indistinguishability-spectrum (IS) of two arbitrary normal distributions N1 and N2, i.e., the optimal delta (for any given varepsilon) such that N1 and N2 are (varepsilon,delta)-close according to the DP distance. The importance of the NDIS theorem lies in that (1) it yields efficient estimators for IS, and (2) it allows us to analyze DP-mechanism with normally-distributed outputs, as well as more general mechanisms by leveraging their behavior on large inputs. We apply the NDIS theorem to derive DP mechanisms for queries with normally-distributed outputs--i.e., Gaussian Random Projections (RP)--and for more general queries--i.e., Ordinary Least Squares (OLS). Compared to existing techniques, our new DP mechanisms achieve superior privacy/utility trade-offs by leveraging the randomness of the underlying algorithms. We then apply the NDIS theorem to a data-driven DP notion--in particular relative DP introduced by Lu et al. [S&P 2024]. Our method identifies the range of (varepsilon,delta) for which no additional noising is needed.
♻ ☆ Deep hybrid models: infer and plan in the real world
Determining an optimal plan to accomplish a goal is a hard problem in realistic scenarios, which often comprise dynamic and causal relationships between several entities. Although traditionally such problems have been tackled with optimal control and reinforcement learning, a recent biologically-motivated proposal casts planning and control as an inference process. Among these new approaches, one is particularly promising: active inference. This new paradigm assumes that action and perception are two complementary aspects of life whereby the role of the former is to fulfill the predictions inferred by the latter. In this study, we present an effective solution, based on active inference, to complex control tasks. The proposed architecture exploits hybrid (discrete and continuous) processing to construct a hierarchical and dynamic representation of the self and the environment, which is then used to produce a flexible plan consisting of subgoals at different temporal scales. We evaluate this deep hybrid model on a non-trivial task: reaching a moving object after having picked a moving tool. This study extends past work on planning as inference and advances an alternative direction to optimal control and reinforcement learning.
♻ ☆ Directly Fine-Tuning Diffusion Models on Differentiable Rewards ICLR 2024
We present Direct Reward Fine-Tuning (DRaFT), a simple and effective method for fine-tuning diffusion models to maximize differentiable reward functions, such as scores from human preference models. We first show that it is possible to backpropagate the reward function gradient through the full sampling procedure, and that doing so achieves strong performance on a variety of rewards, outperforming reinforcement learning-based approaches. We then propose more efficient variants of DRaFT: DRaFT-K, which truncates backpropagation to only the last K steps of sampling, and DRaFT-LV, which obtains lower-variance gradient estimates for the case when K=1. We show that our methods work well for a variety of reward functions and can be used to substantially improve the aesthetic quality of images generated by Stable Diffusion 1.4. Finally, we draw connections between our approach and prior work, providing a unifying perspective on the design space of gradient-based fine-tuning algorithms.
comment: Published at ICLR 2024
♻ ☆ A Tiny Transformer for Low-Power Arrhythmia Classification on Microcontrollers
Wearable systems for the continuous and real-time monitoring of cardiovascular diseases are becoming widespread and valuable assets in diagnosis and therapy. A promising approach for real-time analysis of the electrocardiographic (ECG) signal and the detection of heart conditions, such as arrhythmia, is represented by the transformer machine learning model. Transformers are powerful models for the classification of time series, although efficient implementation in the wearable domain raises significant design challenges, to combine adequate accuracy and a suitable complexity. In this work, we present a tiny transformer model for the analysis of the ECG signal, requiring only 6k parameters and reaching 98.97% accuracy in the recognition of the 5 most common arrhythmia classes from the MIT-BIH Arrhythmia database, assessed considering 8-bit integer inference as required for efficient execution on low-power microcontroller-based devices. We explored an augmentation-based training approach for improving the robustness against electrode motion artifacts noise, resulting in a worst-case post-deployment performance assessment of 98.36% accuracy. Suitability for wearable monitoring solutions is finally demonstrated through efficient deployment on the parallel ultra-low-power GAP9 processor, where inference execution requires 4.28ms and 0.09mJ.
comment: 2024 IEEE Transactions on Biomedical Circuits and Systems
♻ ☆ Fast sampling from constrained spaces using the Metropolis-adjusted Mirror Langevin algorithm COLT 2024
We propose a new method called the Metropolis-adjusted Mirror Langevin algorithm for approximate sampling from distributions whose support is a compact and convex set. This algorithm adds an accept-reject filter to the Markov chain induced by a single step of the Mirror Langevin algorithm (Zhang et al., 2020), which is a basic discretisation of the Mirror Langevin dynamics. Due to the inclusion of this filter, our method is unbiased relative to the target, while known discretisations of the Mirror Langevin dynamics including the Mirror Langevin algorithm have an asymptotic bias. For this algorithm, we also give upper bounds for the number of iterations taken to mix to a constrained distribution whose potential is relatively smooth, convex, and Lipschitz continuous with respect to a self-concordant mirror function. As a consequence of the reversibility of the Markov chain induced by the inclusion of the Metropolis-Hastings filter, we obtain an exponentially better dependence on the error tolerance for approximate constrained sampling. We also present numerical experiments that corroborate our theoretical findings.
comment: 49 pages, 6 figures, 2 tables. Shorter version without experiments accepted to COLT 2024
♻ ☆ Equivariance via Minimal Frame Averaging for More Symmetries and Efficiency
We consider achieving equivariance in machine learning systems via frame averaging. Current frame averaging methods involve a costly sum over large frames or rely on sampling-based approaches that only yield approximate equivariance. Here, we propose Minimal Frame Averaging (MFA), a mathematical framework for constructing provably minimal frames that are exactly equivariant. The general foundations of MFA also allow us to extend frame averaging to more groups than previously considered, including the Lorentz group for describing symmetries in space-time, and the unitary group for complex-valued domains. Results demonstrate the efficiency and effectiveness of encoding symmetries via MFA across a diverse range of tasks, including $n$-body simulation, top tagging in collider physics, and relaxed energy prediction. Our code is available at https://github.com/divelab/MFA.
♻ ☆ Fine-grained analysis of non-parametric estimation for pairwise learning
In this paper, we are concerned with the generalization performance of non-parametric estimation for pairwise learning. Most of the existing work requires the hypothesis space to be convex or a VC-class, and the loss to be convex. However, these restrictive assumptions limit the applicability of the results in studying many popular methods, especially kernel methods and neural networks. We significantly relax these restrictive assumptions and establish a sharp oracle inequality of the empirical minimizer with a general hypothesis space for the Lipschitz continuous pairwise losses. Our results can be used to handle a wide range of pairwise learning problems including ranking, AUC maximization, pairwise regression, and metric and similarity learning. As an application, we apply our general results to study pairwise least squares regression and derive an excess generalization bound that matches the minimax lower bound for pointwise least squares regression up to a logrithmic term. The key novelty here is to construct a structured deep ReLU neural network as an approximation of the true predictor and design the targeted hypothesis space consisting of the structured networks with controllable complexity. This successful application demonstrates that the obtained general results indeed help us to explore the generalization performance on a variety of problems that cannot be handled by existing approaches.
comment: 30 pages, 1 figure
♻ ☆ Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series
Online unsupervised detection of anomalies is crucial to guarantee the correct operation of cyber-physical systems and the safety of humans interacting with them. State-of-the-art approaches based on deep learning via neural networks achieve outstanding performance at anomaly recognition, evaluating the discrepancy between a normal model of the system (with no anomalies) and the real-time stream of sensor time series. However, large training data and time are typically required, and explainability is still a challenge to identify the root of the anomaly and implement predictive maintainance. In this paper, we use causal discovery to learn a normal causal graph of the system, and we evaluate the persistency of causal links during real-time acquisition of sensor data to promptly detect anomalies. On two benchmark anomaly detection datasets, we show that our method has higher training efficiency, outperforms the accuracy of state-of-the-art neural architectures and correctly identifies the sources of $>10$ different anomalies. The code for experimental replication is at http://tinyurl.com/case24causal.
comment: In publication for IEEE Conference on Automation and Smart Engineering (CASE) 2024
♻ ☆ A Survey on Intelligent Internet of Things: Applications, Security, Privacy, and Future Directions
The rapid advances in the Internet of Things (IoT) have promoted a revolution in communication technology and offered various customer services. Artificial intelligence (AI) techniques have been exploited to facilitate IoT operations and maximize their potential in modern application scenarios. In particular, the convergence of IoT and AI has led to a new networking paradigm called Intelligent IoT (IIoT), which has the potential to significantly transform businesses and industrial domains. This paper presents a comprehensive survey of IIoT by investigating its significant applications in mobile networks, as well as its associated security and privacy issues. Specifically, we explore and discuss the roles of IIoT in a wide range of key application domains, from smart healthcare and smart cities to smart transportation and smart industries. Through such extensive discussions, we investigate important security issues in IIoT networks, where network attacks, confidentiality, integrity, and intrusion are analyzed, along with a discussion of potential countermeasures. Privacy issues in IIoT networks were also surveyed and discussed, including data, location, and model privacy leakage. Finally, we outline several key challenges and highlight potential research directions in this important area.
comment: This work has been accepted by IEEE Communications Surveys & Tutorials
♻ ☆ Incentivizing High-Quality Content in Online Recommender Systems
In content recommender systems such as TikTok and YouTube, the platform's recommendation algorithm shapes content producer incentives. Many platforms employ online learning, which generates intertemporal incentives, since content produced today affects recommendations of future content. We study the game between producers and analyze the content created at equilibrium. We show that standard online learning algorithms, such as Hedge and EXP3, unfortunately incentivize producers to create low-quality content, where producers' effort approaches zero in the long run for typical learning rate schedules. Motivated by this negative result, we design learning algorithms that incentivize producers to invest high effort and achieve high user welfare. At a conceptual level, our work illustrates the unintended impact that a platform's learning algorithm can have on content quality and introduces algorithmic approaches to mitigating these effects.
comment: Updated version with revised and expanded content
♻ ☆ Reinforcement-Learning based routing for packet-optical networks with hybrid telemetry
This article provides a methodology and open-source implementation of Reinforcement Learning algorithms for finding optimal routes in a packet-optical network scenario. The algorithm uses measurements provided by the physical layer (pre-FEC bit error rate and propagation delay) and the link layer (link load) to configure a set of latency-based rewards and penalties based on such measurements. Then, the algorithm executes Q-learning based on this set of rewards for finding the optimal routing strategies. It is further shown that the algorithm dynamically adapts to changing network conditions by re-calculating optimal policies upon either link load changes or link degradation as measured by pre-FEC BER.
♻ ☆ Multi-view Disentanglement for Reinforcement Learning with Multiple Cameras
The performance of image-based Reinforcement Learning (RL) agents can vary depending on the position of the camera used to capture the images. Training on multiple cameras simultaneously, including a first-person egocentric camera, can leverage information from different camera perspectives to improve the performance of RL. However, hardware constraints may limit the availability of multiple cameras in real-world deployment. Additionally, cameras may become damaged in the real-world preventing access to all cameras that were used during training. To overcome these hardware constraints, we propose Multi-View Disentanglement (MVD), which uses multiple cameras to learn a policy that is robust to a reduction in the number of cameras to generalise to any single camera from the training set. Our approach is a self-supervised auxiliary task for RL that learns a disentangled representation from multiple cameras, with a shared representation that is aligned across all cameras to allow generalisation to a single camera, and a private representation that is camera-specific. We show experimentally that an RL agent trained on a single third-person camera is unable to learn an optimal policy in many control tasks; but, our approach, benefiting from multiple cameras during training, is able to solve the task using only the same single third-person camera.
comment: Reinforcement Learning Conference (RLC), 2024
♻ ☆ ApiQ: Finetuning of 2-Bit Quantized Large Language Model
Memory-efficient finetuning of large language models (LLMs) has recently attracted huge attention with the increasing size of LLMs, primarily due to the constraints posed by GPU memory limitations and the effectiveness of these methods compared to full finetuning. Despite the advancements, current strategies for memory-efficient finetuning, such as QLoRA, exhibit inconsistent performance across diverse bit-width quantizations and multifaceted tasks. This inconsistency largely stems from the detrimental impact of the quantization process on preserved knowledge, leading to catastrophic forgetting and undermining the utilization of pretrained models for finetuning purposes. In this work, we introduce a novel quantization framework, ApiQ, designed to restore the lost information from quantization by concurrently initializing the LoRA components and quantizing the weights of LLMs. This approach ensures the maintenance of the original LLM's activation precision while mitigating the error propagation from shallower into deeper layers. Through comprehensive evaluations conducted on a spectrum of language tasks with various LLMs, ApiQ demonstrably minimizes activation error during quantization. Consequently, it consistently achieves superior finetuning results across various bit-widths.
comment: more benchmarks and new method, block-wise ApiQ. code: https://github.com/BaohaoLiao/ApiQ
♻ ☆ Are LLMs Naturally Good at Synthetic Tabular Data Generation?
Large language models (LLMs) have demonstrated their prowess in generating synthetic text and images; however, their potential for generating tabular data -- arguably the most common data type in business and scientific applications -- is largely underexplored. This paper demonstrates that LLMs, used as-is, or after traditional fine-tuning, are severely inadequate as synthetic table generators. Due to the autoregressive nature of LLMs, fine-tuning with random order permutation runs counter to the importance of modeling functional dependencies, and renders LLMs unable to model conditional mixtures of distributions (key to capturing real world constraints). We showcase how LLMs can be made to overcome some of these deficiencies by making them permutation-aware.
♻ ☆ Speech foundation models in healthcare: Effect of layer selection on pathological speech feature prediction INTERSPEECH 2024
Accurately extracting clinical information from speech is critical to the diagnosis and treatment of many neurological conditions. As such, there is interest in leveraging AI for automatic, objective assessments of clinical speech to facilitate diagnosis and treatment of speech disorders. We explore transfer learning using foundation models, focusing on the impact of layer selection for the downstream task of predicting pathological speech features. We find that selecting an optimal layer can greatly improve performance (~15.8% increase in balanced accuracy per feature as compared to worst layer, ~13.6% increase as compared to final layer), though the best layer varies by predicted feature and does not always generalize well to unseen data. A learned weighted sum offers comparable performance to the average best layer in-distribution (only ~1.2% lower) and had strong generalization for out-of-distribution data (only 1.5% lower than the average best layer).
comment: Accepted to INTERSPEECH 2024
♻ ☆ Branches: A Fast Dynamic Programming and Branch & Bound Algorithm for Optimal Decision Trees
Decision Tree Learning is a fundamental problem for Interpretable Machine Learning, yet it poses a formidable optimization challenge. Despite numerous efforts dating back to the early 1990's, practical algorithms have only recently emerged, primarily leveraging Dynamic Programming (DP) and Branch & Bound (B&B) techniques. These breakthroughs led to the development of two distinct approaches. Algorithms like DL8.5 and MurTree operate on the space of nodes (or branches), they are very fast, but do not penalise complex Decision Trees, i.e. they do not solve for sparsity. On the other hand, algorithms like OSDT and GOSDT operate on the space of Decision Trees, they solve for sparsity but at the detriment of speed. In this work, we introduce Branches, a novel algorithm that integrates the strengths of both paradigms. Leveraging DP and B&B, Branches achieves exceptional speed while also solving for sparsity. Central to its efficiency is a novel analytical bound enabling substantial pruning of the search space. Furthermore, Branches does not necessitate binary features. Theoretical analysis demonstrates that Branches has a lower complexity bound compared to state-of-the-art methods, a claim validated through extensive empirical evaluation. Our results illustrate that Branches outperforms the state of the art in terms of speed and number of iterations while consistently yielding optimal Decision Trees.
comment: This preprint is currently under review
♻ ☆ Uncertainty-Aware Probabilistic Graph Neural Networks for Road-Level Traffic Accident Prediction
Traffic accidents present substantial challenges to human safety and socioeconomic development in urban areas. Developing a reliable and responsible traffic accident prediction model is crucial to addressing growing public safety concerns and enhancing the safety of urban mobility systems. Traditional methods face limitations at fine spatiotemporal scales due to the sporadic nature of highrisk accidents and the predominance of nonaccident characteristics. Furthermore, while most current models show promising occurrence prediction, they overlook the uncertainties arising from the inherent nature of accidents, and then fail to adequately map the hierarchical ranking of accident risk values for more precise insights. To address these issues, we introduce the Spatiotemporal ZeroInflated Tweedie Graph Neural Network ,STZITDGNN, the first uncertainty-aware probabilistic graph deep learning model in roadlevel traffic accident prediction for multi-steps. This model integrates the interpretability of the statistical Tweedie family model and the expressive power of graph neural networks. Its decoder innovatively employs a compound Tweedie model, a Poisson distribution to model the frequency of accident occurrences and a Gamma distribution to assess injury severity, supplemented by a zeroinflated component to effectively identify exessive non-incident instances. Empirical tests using realworld traffic data from London, UK, demonstrate that the STZITDGNN surpasses other baseline models across multiple benchmarks and metrics, including accident risk value prediction, uncertainty minimisation, nonaccident road identification and accident occurrence accuracy. Our study demonstrates that STZTIDGNN can effectively inform targeted road monitoring, thereby improving urban road safety strategies.
♻ ☆ Chain-of-Thought Unfaithfulness as Disguised Accuracy
Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.
comment: TMLR accepted paper camera-ready version. First two authors contributed equally. 8 pages main, 13 pages appendix
♻ ☆ Planning to Go Out-of-Distribution in Offline-to-Online Reinforcement Learning
Offline pretraining with a static dataset followed by online fine-tuning (offline-to-online, or OtO) is a paradigm well matched to a real-world RL deployment process. In this scenario, we aim to find the best-performing policy within a limited budget of online interactions. Previous work in the OtO setting has focused on correcting for bias introduced by the policy-constraint mechanisms of offline RL algorithms. Such constraints keep the learned policy close to the behavior policy that collected the dataset, but we show this can unnecessarily limit policy performance if the behavior policy is far from optimal. Instead, we forgo constraints and frame OtO RL as an exploration problem that aims to maximize the benefit of online data-collection. We first study the major online RL exploration methods based on intrinsic rewards and UCB in the OtO setting, showing that intrinsic rewards add training instability through reward-function modification, and UCB methods are myopic and it is unclear which learned-component's ensemble to use for action selection. We then introduce an algorithm for planning to go out-of-distribution (PTGOOD) that avoids these issues. PTGOOD uses a non-myopic planning procedure that targets exploration in relatively high-reward regions of the state-action space unlikely to be visited by the behavior policy. By leveraging concepts from the Conditional Entropy Bottleneck, PTGOOD encourages data collected online to provide new information relevant to improving the final deployment policy without altering rewards. We show empirically in several continuous control tasks that PTGOOD significantly improves agent returns during online fine-tuning and avoids the suboptimal policy convergence that many of our baselines exhibit in several environments.
comment: 10 pages, 17 figures, published at RLC 2024
♻ ☆ Attention as a Hypernetwork
Transformers can under some circumstances generalize to novel problem instances whose constituent parts might have been encountered during training but whose compositions have not. What mechanisms underlie this ability for compositional generalization? By reformulating multi-head attention as a hypernetwork, we reveal that a low-dimensional latent code specifies key-query specific operations. We find empirically that this latent code is highly structured, capturing information about the subtasks performed by the network. Using the framework of attention as a hypernetwork we further propose a simple modification of multi-head linear attention that strengthens the ability for compositional generalization on a range of abstract reasoning tasks. In particular, we introduce a symbolic version of the Raven Progressive Matrices human intelligence test on which we demonstrate how scaling model size and data enables compositional generalization and gives rise to a functionally structured latent code in the transformer.
comment: Code available at https://github.com/smonsays/hypernetwork-attention
♻ ☆ Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How
Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. In search of empirical support for our proposition, we perform a series of pre-registered experiments combining replay-approximated joint objectives with gradient projection-based optimization routines. However, this first experimental attempt fails to show clear and consistent benefits. Nevertheless, our conceptual arguments, as well as some of our empirical results, demonstrate the distinctive importance of the optimization trajectory in continual learning, thereby opening up a new direction for continual learning research.
comment: Full paper version of pre-registered report accepted at the 1st ContinualAI Unconference. The originally submitted pre-registered proposal can be found at arXiv:2311.04898v1
♻ ☆ Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
comment: arXiv admin note: text overlap with arXiv:2402.02042; text overlap with arXiv:2202.00150 by other authors
♻ ☆ NeuroCUT: A Neural Approach for Robust Graph Partitioning KDD
Graph partitioning aims to divide a graph into disjoint subsets while optimizing a specific partitioning objective. The majority of formulations related to graph partitioning exhibit NP-hardness due to their combinatorial nature. Conventional methods, like approximation algorithms or heuristics, are designed for distinct partitioning objectives and fail to achieve generalization across other important partitioning objectives. Recently machine learning-based methods have been developed that learn directly from data. Further, these methods have a distinct advantage of utilizing node features that carry additional information. However, these methods assume differentiability of target partitioning objective functions and cannot generalize for an unknown number of partitions, i.e., they assume the number of partitions is provided in advance. In this study, we develop NeuroCUT with two key innovations over previous methodologies. First, by leveraging a reinforcement learning-based framework over node representations derived from a graph neural network and positional features, NeuroCUT can accommodate any optimization objective, even those with non-differentiable functions. Second, we decouple the parameter space and the partition count making NeuroCUT inductive to any unseen number of partition, which is provided at query time. Through empirical evaluation, we demonstrate that NeuroCUT excels in identifying high-quality partitions, showcases strong generalization across a wide spectrum of partitioning objectives, and exhibits strong generalization to unseen partition count.
comment: To appear in Knowledge Discovery and Data Mining(KDD), 2024
♻ ☆ Federated Learning over Connected Modes
Statistical heterogeneity in federated learning poses two major challenges: slow global training due to conflicting gradient signals, and the need of personalization for local distributions. In this work, we tackle both challenges by leveraging recent advances in \emph{linear mode connectivity} -- identifying a linearly connected low-loss region in the weight space of neural networks, which we call solution simplex. We propose federated learning over connected modes (\textsc{Floco}), where clients are assigned local subregions in this simplex based on their gradient signals, and together learn the shared global solution simplex. This allows personalization of the client models to fit their local distributions within the degrees of freedom in the solution simplex and homogenizes the update signals for the global simplex training. Our experiments show that \textsc{Floco} accelerates the global training process, and significantly improves the local accuracy with minimal computational overhead.
♻ ☆ Information Guided Regularization for Fine-tuning Language Models
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
♻ ☆ Sharp detection of low-dimensional structure in probability measures via dimensional logarithmic Sobolev inequalities
Identifying low-dimensional structure in high-dimensional probability measures is an essential pre-processing step for efficient sampling. We introduce a method for identifying and approximating a target measure $\pi$ as a perturbation of a given reference measure $\mu$ along a few significant directions of $\mathbb{R}^{d}$. The reference measure can be a Gaussian or a nonlinear transformation of a Gaussian, as commonly arising in generative modeling. Our method extends prior work on minimizing majorizations of the Kullback--Leibler divergence to identify optimal approximations within this class of measures. Our main contribution unveils a connection between the \emph{dimensional} logarithmic Sobolev inequality (LSI) and approximations with this ansatz. Specifically, when the target and reference are both Gaussian, we show that minimizing the dimensional LSI is equivalent to minimizing the KL divergence restricted to this ansatz. For general non-Gaussian measures, the dimensional LSI produces majorants that uniformly improve on previous majorants for gradient-based dimension reduction. We further demonstrate the applicability of this analysis to the squared Hellinger distance, where analogous reasoning shows that the dimensional Poincar\'e inequality offers improved bounds.
♻ ☆ Transferability of Graph Neural Networks using Graphon and Sampling Theories
Graph neural networks (GNNs) have become powerful tools for processing graph-based information in various domains. A desirable property of GNNs is transferability, where a trained network can swap in information from a different graph without retraining and retain its accuracy. A recent method of capturing transferability of GNNs is through the use of graphons, which are symmetric, measurable functions representing the limit of large dense graphs. In this work, we contribute to the application of graphons to GNNs by presenting an explicit two-layer graphon neural network (WNN) architecture. We prove its ability to approximate bandlimited graphon signals within a specified error tolerance using a minimal number of network weights. We then leverage this result, to establish the transferability of an explicit two-layer GNN over all sufficiently large graphs in a convergent sequence. Our work addresses transferability between both deterministic weighted graphs and simple random graphs and overcomes issues related to the curse of dimensionality that arise in other GNN results. The proposed WNN and GNN architectures offer practical solutions for handling graph data of varying sizes while maintaining performance guarantees without extensive retraining.
♻ ☆ CORM: Cache Optimization with Recent Message for Large Language Model Inference
Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
♻ ☆ Straight-Through meets Sparse Recovery: the Support Exploration Algorithm
The {\it straight-through estimator} (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical successes.To make a step forward in this comprehension, we apply STE to a well-understood problem: {\it sparse support recovery}. We introduce the {\it Support Exploration Algorithm} (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of $A$ are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix $A$ satisfies the {\it Restricted Isometry Property} (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.
♻ ☆ HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 774M parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.
comment: 48 pages, 69 figures, 10 tables
♻ ☆ Trust the Model Where It Trusts Itself -- Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption
Dyna-style model-based reinforcement learning (MBRL) combines model-free agents with predictive transition models through model-based rollouts. This combination raises a critical question: 'When to trust your model?'; i.e., which rollout length results in the model providing useful data? Janner et al. (2019) address this question by gradually increasing rollout lengths throughout the training. While theoretically tempting, uniform model accuracy is a fallacy that collapses at the latest when extrapolating. Instead, we propose asking the question 'Where to trust your model?'. Using inherent model uncertainty to consider local accuracy, we obtain the Model-Based Actor-Critic with Uncertainty-Aware Rollout Adaption (MACURA) algorithm. We propose an easy-to-tune rollout mechanism and demonstrate substantial improvements in data efficiency and performance compared to state-of-the-art deep MBRL methods on the MuJoCo benchmark.
♻ ☆ Benchmarking Pathology Feature Extractors for Whole Slide Image Classification
Weakly supervised whole slide image classification is a key task in computational pathology, which involves predicting a slide-level label from a set of image patches constituting the slide. Constructing models to solve this task involves multiple design choices, often made without robust empirical or conclusive theoretical justification. To address this, we conduct a comprehensive benchmarking of feature extractors to answer three critical questions: 1) Is stain normalisation still a necessary preprocessing step? 2) Which feature extractors are best for downstream slide-level classification? 3) How does magnification affect downstream performance? Our study constitutes the most comprehensive evaluation of publicly available pathology feature extractors to date, involving more than 10,000 training runs across 14 feature extractors, 9 tasks, 5 datasets, 3 downstream architectures, 2 levels of magnification, and various preprocessing setups. Our findings challenge existing assumptions: 1) We observe empirically, and by analysing the latent space, that skipping stain normalisation and image augmentations does not degrade performance, while significantly reducing memory and computational demands. 2) We develop a novel evaluation metric to compare relative downstream performance, and show that the choice of feature extractor is the most consequential factor for downstream performance. 3) We find that lower-magnification slides are sufficient for accurate slide-level classification. Contrary to previous patch-level benchmarking studies, our approach emphasises clinical relevance by focusing on slide-level biomarker prediction tasks in a weakly supervised setting with external validation cohorts. Our findings stand to streamline digital pathology workflows by minimising preprocessing needs and informing the selection of feature extractors.
comment: For the conference version see: arXiv:2311.11772v4. For the longer journal version with additional experiments see arXiv:2311.11772v5
♻ ☆ RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark NeurIPS 2023
Deep reinforcement learning (RL) has recently shown significant benefits in solving combinatorial optimization (CO) problems, reducing reliance on domain expertise, and improving computational efficiency. However, the field lacks a unified benchmark for easy development and standardized comparison of algorithms across diverse CO problems. To fill this gap, we introduce RL4CO, a unified and extensive benchmark with in-depth library coverage of 23 state-of-the-art methods and more than 20 CO problems. Built on efficient software libraries and best practices in implementation, RL4CO features modularized implementation and flexible configuration of diverse RL algorithms, neural network architectures, inference techniques, and environments. RL4CO allows researchers to seamlessly navigate existing successes and develop their unique designs, facilitating the entire research process by decoupling science from heavy engineering. We also provide extensive benchmark studies to inspire new insights and future work. RL4CO has attracted numerous researchers in the community and is open-sourced at https://github.com/ai4co/rl4co.
comment: A previous version was presented as a workshop paper at the NeurIPS 2023 GLFrontiers Workshop (Oral)
♻ ☆ A policy gradient approach for Finite Horizon Constrained Markov Decision Processes
The infinite horizon setting is widely adopted for problems of reinforcement learning (RL). These invariably result in stationary policies that are optimal. In many situations, finite horizon control problems are of interest and for such problems, the optimal policies are time-varying in general. Another setting that has become popular in recent times is of Constrained Reinforcement Learning, where the agent maximizes its rewards while it also aims to satisfy some given constraint criteria. However, this setting has only been studied in the context of infinite horizon MDPs where stationary policies are optimal. We present an algorithm for constrained RL in the Finite Horizon Setting where the horizon terminates after a fixed (finite) time. We use function approximation in our algorithm which is essential when the state and action spaces are large or continuous and use the policy gradient method to find the optimal policy. The optimal policy that we obtain depends on the stage and so is non-stationary in general. To the best of our knowledge, our paper presents the first policy gradient algorithm for the finite horizon setting with constraints. We show the convergence of our algorithm to a constrained optimal policy. We also compare and analyze the performance of our algorithm through experiments and show that our algorithm performs better than some other well known algorithms.
♻ ☆ Random Pareto front surfaces
The goal of multi-objective optimisation is to identify the Pareto front surface which is the set obtained by connecting the best trade-off points. Typically this surface is computed by evaluating the objectives at different points and then interpolating between the subset of the best evaluated trade-off points. In this work, we propose to parameterise the Pareto front surface using polar coordinates. More precisely, we show that any Pareto front surface can be equivalently represented using a scalar-valued length function which returns the projected length along any positive radial direction. We then use this representation in order to rigorously develop the theory and applications of stochastic Pareto front surfaces. In particular, we derive many Pareto front surface statistics of interest such as the expectation, covariance and quantiles. We then discuss how these can be used in practice within a design of experiments setting, where the goal is to both infer and use the Pareto front surface distribution in order to make effective decisions. Our framework allows for clear uncertainty quantification and we also develop advanced visualisation techniques for this purpose. Finally we discuss the applicability of our ideas within multivariate extreme value theory and illustrate our methodology in a variety of numerical examples, including a case study with a real-world air pollution data set.
comment: The code is available at: https://github.com/benmltu/scalarize
♻ ☆ Latent Functional Maps
Neural models learn data representations that lie on low-dimensional manifolds, yet modeling the relation between these representational spaces is an ongoing challenge. By integrating spectral geometry principles into neural modeling, we show that this problem can be better addressed in the functional domain, mitigating complexity, while enhancing interpretability and performances on downstream tasks. To this end, we introduce a multi-purpose framework to the representation learning community, which allows to: (i) compare different spaces in an interpretable way and measure their intrinsic similarity; (ii) find correspondences between them, both in unsupervised and weakly supervised settings, and (iii) to effectively transfer representations between distinct spaces. We validate our framework on various applications, ranging from stitching to retrieval tasks, demonstrating that latent functional maps can serve as a swiss-army knife for representation alignment.
♻ ☆ Jellyfish: A Large Language Model for Data Preprocessing
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .
comment: a.k.a. "Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing''
♻ ☆ Multi-Objective Quality-Diversity for Crystal Structure Prediction GECCO 2024
Crystal structures are indispensable across various domains, from batteries to solar cells, and extensive research has been dedicated to predicting their properties based on their atomic configurations. However, prevailing Crystal Structure Prediction methods focus on identifying the most stable solutions that lie at the global minimum of the energy function. This approach overlooks other potentially interesting materials that lie in neighbouring local minima and have different material properties such as conductivity or resistance to deformation. By contrast, Quality-Diversity algorithms provide a promising avenue for Crystal Structure Prediction as they aim to find a collection of high-performing solutions that have diverse characteristics. However, it may also be valuable to optimise for the stability of crystal structures alongside other objectives such as magnetism or thermoelectric efficiency. Therefore, in this work, we harness the power of Multi-Objective Quality-Diversity algorithms in order to find crystal structures which have diverse features and achieve different trade-offs of objectives. We analyse our approach on 5 crystal systems and demonstrate that it is not only able to re-discover known real-life structures, but also find promising new ones. Moreover, we propose a method for illuminating the objective space to gain an understanding of what trade-offs can be achieved.
comment: Accepted GECCO 2024
♻ ☆ The ULS23 Challenge: a Baseline Model and Benchmark Dataset for 3D Universal Lesion Segmentation in Computed Tomography
Size measurements of tumor manifestations on follow-up CT examinations are crucial for evaluating treatment outcomes in cancer patients. Efficient lesion segmentation can speed up these radiological workflows. While numerous benchmarks and challenges address lesion segmentation in specific organs like the liver, kidneys, and lungs, the larger variety of lesion types encountered in clinical practice demands a more universal approach. To address this gap, we introduced the ULS23 benchmark for 3D universal lesion segmentation in chest-abdomen-pelvis CT examinations. The ULS23 training dataset contains 38,693 lesions across this region, including challenging pancreatic, colon and bone lesions. For evaluation purposes, we curated a dataset comprising 775 lesions from 284 patients. Each of these lesions was identified as a target lesion in a clinical context, ensuring diversity and clinical relevance within this dataset. The ULS23 benchmark is publicly accessible via uls23.grand-challenge.org, enabling researchers worldwide to assess the performance of their segmentation methods. Furthermore, we have developed and publicly released our baseline semi-supervised 3D lesion segmentation model. This model achieved an average Dice coefficient of 0.703 $\pm$ 0.240 on the challenge test set. We invite ongoing submissions to advance the development of future ULS models.
♻ ☆ Discovering Dynamic Symbolic Policies with Genetic Programming
Artificial intelligence (AI) techniques are increasingly being applied to solve control problems. However, control systems developed in AI are often black-box methods, in that it is not clear how and why they generate their outputs. A lack of transparency can be problematic for control tasks in particular, because it complicates the identification of biases or errors, which in turn negatively influences the user's confidence in the system. To improve the interpretability and transparency in control systems, the black-box structure can be replaced with white-box symbolic policies described by mathematical expressions. Genetic programming offers a gradient-free method to optimise the structure of non-differentiable mathematical expressions. In this paper, we show that genetic programming can be used to discover symbolic control systems. This is achieved by learning a symbolic representation of a function that transforms observations into control signals. We consider both systems that implement static control policies without memory and systems that implement dynamic memory-based control policies. In case of the latter, the discovered function becomes the state equation of a differential equation, which allows for evidence integration. Our results show that symbolic policies are discovered that perform comparably with black-box policies on a variety of control tasks. Furthermore, the additional value of the memory capacity in the dynamic policies is demonstrated on experiments where static policies fall short. Overall, we demonstrate that white-box symbolic policies can be optimised with genetic programming, while offering interpretability and transparency that lacks in black-box models.
comment: 22 pages including references and appendix, 4 figures, 1 algorithm, 5 tables
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ ExcelFormer: Can a DNN be a Sure Bet for Tabular Prediction?
Data organized in tabular format is ubiquitous in real-world applications, and users often craft tables with biased feature definitions and flexibly set prediction targets of their interests. Thus, a rapid development of a robust, effective, dataset-versatile, user-friendly tabular prediction approach is highly desired. While Gradient Boosting Decision Trees (GBDTs) and existing deep neural networks (DNNs) have been extensively utilized by professional users, they present several challenges for casual users, particularly: (i) the dilemma of model selection due to their different dataset preferences, and (ii) the need for heavy hyperparameter searching, failing which their performances are deemed inadequate. In this paper, we delve into this question: Can we develop a deep learning model that serves as a "sure bet" solution for a wide range of tabular prediction tasks, while also being user-friendly for casual users? We delve into three key drawbacks of deep tabular models, encompassing: (P1) lack of rotational variance property, (P2) large data demand, and (P3) over-smooth solution. We propose ExcelFormer, addressing these challenges through a semi-permeable attention module that effectively constrains the influence of less informative features to break the DNNs' rotational invariance property (for P1), data augmentation approaches tailored for tabular data (for P2), and attentive feedforward network to boost the model fitting capability (for P3). These designs collectively make ExcelFormer a "sure bet" solution for diverse tabular datasets. Extensive and stratified experiments conducted on real-world datasets demonstrate that our model outperforms previous approaches across diverse tabular data prediction tasks, and this framework can be friendly to casual users, offering ease of use without the heavy hyperparameter tuning.
♻ ☆ Provable Privacy with Non-Private Pre-Processing
When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
♻ ☆ Active Few-Shot Fine-Tuning
We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
♻ ☆ Predictions Based on Pixel Data: Insights from PDEs and Finite Differences
As supported by abundant experimental evidence, neural networks are state-of-the-art for many approximation tasks in high-dimensional spaces. Still, there is a lack of a rigorous theoretical understanding of what they can approximate, at which cost, and at which accuracy. One network architecture of practical use, especially for approximation tasks involving images, is (residual) convolutional networks. However, due to the locality of the linear operators involved in these networks, their analysis is more complicated than that of fully connected neural networks. This paper deals with approximation of time sequences where each observation is a matrix. We show that with relatively small networks, we can represent exactly a class of numerical discretizations of PDEs based on the method of lines. We constructively derive these results by exploiting the connections between discrete convolution and finite difference operators. Our network architecture is inspired by those typically adopted in the approximation of time sequences. We support our theoretical results with numerical experiments simulating the linear advection, heat, and Fisher equations.
♻ ☆ Enhancing Actuarial Non-Life Pricing Models via Transformers
Currently, there is a lot of research in the field of neural networks for non-life insurance pricing. The usual goal is to improve the predictive power via neural networks while building upon the generalized linear model, which is the current industry standard. Our paper contributes to this current journey via novel methods to enhance actuarial non-life models with transformer models for tabular data. We build here upon the foundation laid out by the combined actuarial neural network as well as the localGLMnet and enhance those models via the feature tokenizer transformer. The manuscript demonstrates the performance of the proposed methods on a real-world claim frequency dataset and compares them with several benchmark models such as generalized linear models, feed-forward neural networks, combined actuarial neural networks, LocalGLMnet, and pure feature tokenizer transformer. The paper shows that the new methods can achieve better results than the benchmark models while preserving certain generalized linear model advantages. The paper also discusses the practical implications and challenges of applying transformer models in actuarial settings.
comment: This preprint manuscript was uploaded to arXiv on November 10, 2023. It has not undergone peer review or any post-submission improvements or corrections. The peer reviewed Version of Record of this article is published in the European Actuarial Journal, and is available online at https://doi.org/10.1007/s13385-024-00388-2
♻ ☆ Reduction of finite sampling noise in quantum neural networks
Quantum neural networks (QNNs) use parameterized quantum circuits with data-dependent inputs and generate outputs through the evaluation of expectation values. Calculating these expectation values necessitates repeated circuit evaluations, thus introducing fundamental finite-sampling noise even on error-free quantum computers. We reduce this noise by introducing the variance regularization, a technique for reducing the variance of the expectation value during the quantum model training. This technique requires no additional circuit evaluations if the QNN is properly constructed. Our empirical findings demonstrate the reduced variance speeds up the training and lowers the output noise as well as decreases the number of necessary evaluations of gradient circuits. This regularization method is benchmarked on the regression of multiple functions and the potential energy surface of water. We show that in our examples, it lowers the variance by an order of magnitude on average and leads to a significantly reduced noise level of the QNN. We finally demonstrate QNN training on a real quantum device and evaluate the impact of error mitigation. Here, the optimization is feasible only due to the reduced number of necessary shots in the gradient evaluation resulting from the reduced variance.
comment: 15 pages, 12 figures, article at Quantum
♻ ☆ SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores ICLR 2024
The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing and scaling well on common RL benchmarks with different RL algorithms, SRL can reproduce the same solution in the challenging hide-and-seek environment as reported by OpenAI with up to 5x speedup in wall-clock time. Notably, SRL is the first in the academic community to perform RL experiments at a large scale with over 15k CPU cores. SRL source code is available at: https://github.com/openpsi-project/srl .
comment: Published at ICLR 2024. 10 pages (24 pages with references and appendix), 7 figures
♻ ☆ Long-time asymptotics of noisy SVGD outside the population limit
Stein Variational Gradient Descent (SVGD) is a widely used sampling algorithm that has been successfully applied in several areas of Machine Learning. SVGD operates by iteratively moving a set of interacting particles (which represent the samples) to approximate the target distribution. Despite recent studies on the complexity of SVGD and its variants, their long-time asymptotic behavior (i.e., after numerous iterations ) is still not understood in the finite number of particles regime. We study the long-time asymptotic behavior of a noisy variant of SVGD. First, we establish that the limit set of noisy SVGD for large is well-defined. We then characterize this limit set, showing that it approaches the target distribution as increases. In particular, noisy SVGD provably avoids the variance collapse observed for SVGD. Our approach involves demonstrating that the trajectories of noisy SVGD closely resemble those described by a McKean-Vlasov process.
♻ ☆ Large Language Model-Driven Curriculum Design for Mobile Networks
This study introduces an innovative framework that employs large language models (LLMs) to automate the design and generation of curricula for reinforcement learning (RL). As mobile networks evolve towards the 6G era, managing their increasing complexity and dynamic nature poses significant challenges. Conventional RL approaches often suffer from slow convergence and poor generalization due to conflicting objectives and the large state and action spaces associated with mobile networks. To address these shortcomings, we introduce curriculum learning, a method that systematically exposes the RL agent to progressively challenging tasks, improving convergence and generalization. However, curriculum design typically requires extensive domain knowledge and manual human effort. Our framework mitigates this by utilizing the generative capabilities of LLMs to automate the curriculum design process, significantly reducing human effort while improving the RL agent's convergence and performance. We deploy our approach within a simulated mobile network environment and demonstrate improved RL convergence rates, generalization to unseen scenarios, and overall performance enhancements. As a case study, we consider autonomous coordination and user association in mobile networks. Our obtained results highlight the potential of combining LLM-based curriculum generation with RL for managing next-generation wireless networks, marking a significant step towards fully autonomous network operations.
comment: To appear in IEEE/CIC ICCC
♻ ☆ Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman Process
Fault detection and diagnosis (FDD) is a crucial task for ensuring the safety and efficiency of industrial processes. We propose a novel FDD methodology for the Tennessee Eastman Process (TEP), a widely used benchmark for chemical process control. The model employs two separate Transformer branches, enabling independent processing of input data and potential extraction of diverse information. A novel attention mechanism, Gated Dynamic Learnable Attention (GDLAttention), is introduced which integrates a gating mechanism and dynamic learning capabilities. The gating mechanism modulates the attention weights, allowing the model to focus on the most relevant parts of the input. The dynamic learning approach adapts the attention strategy during training, potentially leading to improved performance. The attention mechanism uses a bilinear similarity function, providing greater flexibility in capturing complex relationships between query and key vectors. In order to assess the effectiveness of our approach, we tested it against 21 and 18 distinct fault scenarios in TEP, and compared its performance with several established FDD techniques. The outcomes indicate that the method outperforms others in terms of accuracy, false alarm rate, and misclassification rate. This underscores the robustness and efficacy of the approach for FDD in intricate industrial processes.
♻ ☆ Metric Space Magnitude for Evaluating the Diversity of Latent Representations
The magnitude of a metric space is a novel invariant that provides a measure of the 'effective size' of a space across multiple scales, while also capturing numerous geometrical properties, such as curvature, density, or entropy. We develop a family of magnitude-based measures of the intrinsic diversity of latent representations, formalising a novel notion of dissimilarity between magnitude functions of finite metric spaces. Our measures are provably stable under perturbations of the data, can be efficiently calculated, and enable a rigorous multi-scale characterisation and comparison of latent representations. We show their utility and superior performance across different domains and tasks, including (i) the automated estimation of diversity, (ii) the detection of mode collapse, and (iii) the evaluation of generative models for text, image, and graph data.
♻ ☆ Hinge-Wasserstein: Estimating Multimodal Aleatoric Uncertainty in Regression Tasks
Computer vision systems that are deployed in safety-critical applications need to quantify their output uncertainty. We study regression from images to parameter values and here it is common to detect uncertainty by predicting probability distributions. In this context, we investigate the regression-by-classification paradigm which can represent multimodal distributions, without a prior assumption on the number of modes. Through experiments on a specifically designed synthetic dataset, we demonstrate that traditional loss functions lead to poor probability distribution estimates and severe overconfidence, in the absence of full ground truth distributions. In order to alleviate these issues, we propose hinge-Wasserstein -- a simple improvement of the Wasserstein loss that reduces the penalty for weak secondary modes during training. This enables prediction of complex distributions with multiple modes, and allows training on datasets where full ground truth distributions are not available. In extensive experiments, we show that the proposed loss leads to substantially better uncertainty estimation on two challenging computer vision tasks: horizon line detection and stereo disparity estimation.
♻ ☆ Nearly Minimax Optimal Regret for Multinomial Logistic Bandit
In this paper, we study the contextual multinomial logit (MNL) bandit problem in which a learning agent sequentially selects an assortment based on contextual information, and user feedback follows an MNL choice model. There has been a significant discrepancy between lower and upper regret bounds, particularly regarding the maximum assortment size $K$. Additionally, the variation in reward structures between these bounds complicates the quest for optimality. Under uniform rewards, where all items have the same expected reward, we establish a regret lower bound of $\Omega(d\sqrt{\smash[b]{T/K}})$ and propose a constant-time algorithm, OFU-MNL+, that achieves a matching upper bound of $\tilde{O}(d\sqrt{\smash[b]{T/K}})$. Under non-uniform rewards, we prove a lower bound of $\Omega(d\sqrt{T})$ and an upper bound of $\tilde{O}(d\sqrt{T})$, also achievable by OFU-MNL+. Our empirical studies support these theoretical findings. To the best of our knowledge, this is the first work in the contextual MNL bandit literature to prove minimax optimality -- for either uniform or non-uniform reward setting -- and to propose a computationally efficient algorithm that achieves this optimality up to logarithmic factors.
comment: Preprint. Under review
♻ ☆ Data Augmentation on Graphs: A Technical Survey
In recent years, graph representation learning has achieved remarkable success while suffering from low-quality data problems. As a mature technology to improve data quality in computer vision, data augmentation has also attracted increasing attention in graph domain. To advance research in this emerging direction, this survey provides a comprehensive review and summary of existing graph data augmentation (GDAug) techniques. Specifically, this survey first provides an overview of various feasible taxonomies and categorizes existing GDAug studies based on multi-scale graph elements. Subsequently, for each type of GDAug technique, this survey formalizes standardized technical definition, discuss the technical details, and provide schematic illustration. The survey also reviews domain-specific graph data augmentation techniques, including those for heterogeneous graphs, temporal graphs, spatio-temporal graphs, and hypergraphs. In addition, this survey provides a summary of available evaluation metrics and design guidelines for graph data augmentation. Lastly, it outlines the applications of GDAug at both the data and model levels, discusses open issues in the field, and looks forward to future directions. The latest advances in GDAug are summarized in GitHub.
comment: Version 2. Under review
♻ ☆ Deep learning empowered sensor fusion to improve infant movement classification
There is a recent boom in the development of AI solutions to facilitate and enhance diagnostic procedures for established clinical tools. To assess the integrity of the developing nervous system, the Prechtl general movement assessment (GMA) is recognized for its clinical value in diagnosing neurological impairments in early infancy. GMA has been increasingly augmented through machine learning approaches intending to scale-up its application, circumvent costs in the training of human assessors and further standardize classification of spontaneous motor patterns. Available deep learning tools, all of which are based on single sensor modalities, are however still considerably inferior to that of well-trained human assessors. These approaches are hardly comparable as all models are designed, trained and evaluated on proprietary/silo-data sets. With this study we propose a sensor fusion approach for assessing fidgety movements (FMs) comparing three different sensor modalities (pressure, inertial, and visual sensors). Various combinations and two sensor fusion approaches (late and early fusion) for infant movement classification were tested to evaluate whether a multi-sensor system outperforms single modality assessments. The performance of the three-sensor fusion (classification accuracy of 94.5\%) was significantly higher than that of any single modality evaluated, suggesting the sensor fusion approach is a promising avenue for automated classification of infant motor patterns. The development of a robust sensor fusion system may significantly enhance AI-based early recognition of neurofunctions, ultimately facilitating automated early detection of neurodevelopmental conditions.
♻ ☆ Any-Precision LLM: Low-Cost Deployment of Multiple, Different-Sized LLMs ICML 2024
Recently, considerable efforts have been directed towards compressing Large Language Models (LLMs), which showcase groundbreaking capabilities across diverse applications but entail significant deployment costs due to their large sizes. Meanwhile, much less attention has been given to mitigating the costs associated with deploying multiple LLMs of varying sizes despite its practical significance. Thus, this paper introduces \emph{any-precision LLM}, extending the concept of any-precision DNN to LLMs. Addressing challenges in any-precision LLM, we propose a lightweight method for any-precision quantization of LLMs, leveraging a post-training quantization framework, and develop a specialized software engine for its efficient serving. As a result, our solution significantly reduces the high costs of deploying multiple, different-sized LLMs by overlaying LLMs quantized to varying bit-widths, such as 3, 4, ..., $n$ bits, into a memory footprint comparable to a single $n$-bit LLM. All the supported LLMs with varying bit-widths demonstrate state-of-the-art model quality and inference throughput, proving itself to be a compelling option for deployment of multiple, different-sized LLMs. Our code is open-sourced and available online.
comment: To appear at ICML 2024. Code is available at https://github.com/SNU-ARC/any-precision-llm
♻ ☆ LayerMatch: Do Pseudo-labels Benefit All Layers?
Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods.
♻ ☆ DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning
This paper introduces DiffTOP, which utilizes Differentiable Trajectory OPtimization as the policy representation to generate actions for deep reinforcement and imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTOP addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTOP for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 35imitation learning tasks with high-dimensional image and point cloud inputs, DiffTOP outperforms prior state-of-the-art methods in both domains.
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
♻ ☆ Concept Prerequisite Relation Prediction by Using Permutation-Equivariant Directed Graph Neural Networks
This paper studies the problem of CPRP, concept prerequisite relation prediction, which is a fundamental task in using AI for education. CPRP is usually formulated into a link-prediction task on a relationship graph of concepts and solved by training the graph neural network (GNN) model. However, current directed GNNs fail to manage graph isomorphism which refers to the invariance of non-isomorphic graphs, reducing the expressivity of resulting representations. We present a permutation-equivariant directed GNN model by introducing the Weisfeiler-Lehman test into directed GNN learning. Our method is then used for CPRP and evaluated on three public datasets. The experimental results show that our model delivers better prediction performance than the state-of-the-art methods.
comment: 9 pages, 1figure, 1 Table (A perprint version)
♻ ☆ Composite Concept Extraction through Backdooring
Learning composite concepts, such as \textquotedbl red car\textquotedbl , from individual examples -- like a white car representing the concept of \textquotedbl car\textquotedbl{} and a red strawberry representing the concept of \textquotedbl red\textquotedbl -- is inherently challenging. This paper introduces a novel method called Composite Concept Extractor (CoCE), which leverages techniques from traditional backdoor attacks to learn these composite concepts in a zero-shot setting, requiring only examples of individual concepts. By repurposing the trigger-based model backdooring mechanism, we create a strategic distortion in the manifold of the target object (e.g., \textquotedbl car\textquotedbl ) induced by example objects with the target property (e.g., \textquotedbl red\textquotedbl ) from objects \textquotedbl red strawberry\textquotedbl , ensuring the distortion selectively affects the target objects with the target property. Contrastive learning is then employed to further refine this distortion, and a method is formulated for detecting objects that are influenced by the distortion. Extensive experiments with in-depth analysis across different datasets demonstrate the utility and applicability of our proposed approach.
♻ ☆ Connect Later: Improving Fine-tuning for Robustness with Targeted Augmentations ICML 2024
Models trained on a labeled source domain (e.g., labeled images from wildlife camera traps) often generalize poorly when deployed on an out-of-distribution (OOD) target domain (e.g., images from new camera trap locations). In the domain adaptation setting where unlabeled target data is available, self-supervised pretraining (e.g., masked autoencoding or contrastive learning) is a promising method to mitigate this performance drop. Pretraining improves OOD error when the generic data augmentations used (e.g., masking or cropping) connect the source and target domains, which may be far apart in the input space. In this paper, we show on real-world tasks that standard fine-tuning after pretraining does not consistently improve OOD error over simply training from scratch on labeled source data. To better leverage pretraining for distribution shifts, we propose Connect Later: after pretraining with generic augmentations, fine-tune with targeted augmentations designed with knowledge of the distribution shift. Pretraining learns good representations within the source and target domains, while targeted augmentations connect the domains better during fine-tuning. Connect Later improves average OOD error over standard fine-tuning and supervised learning with targeted augmentations on 4 real-world datasets: Connect Later achieves the state-of-the-art on astronomical time-series classification (AstroClassification) by 2.5%, wildlife species identification (iWildCam-WILDS) with ResNet-50 by 0.9%, and tumor identification (Camelyon17-WILDS) with DenseNet121 by 1.1%; as well as best performance on a new dataset for astronomical time-series redshift prediction (Redshifts) by 0.03 RMSE (11% relative). Code and datasets are available at https://github.com/helenqu/connect-later.
comment: ICML 2024
♻ ☆ Image anomaly detection and prediction scheme based on SSA optimized ResNet50-BiGRU model
Image anomaly detection is a popular research direction, with many methods emerging in recent years due to rapid advancements in computing. The use of artificial intelligence for image anomaly detection has been widely studied. By analyzing images of athlete posture and movement, it is possible to predict injury status and suggest necessary adjustments. Most existing methods rely on convolutional networks to extract information from irrelevant pixel data, limiting model accuracy. This paper introduces a network combining Residual Network (ResNet) and Bidirectional Gated Recurrent Unit (BiGRU), which can predict potential injury types and provide early warnings by analyzing changes in muscle and bone poses from video images. To address the high complexity of this network, the Sparrow search algorithm was used for optimization. Experiments conducted on four datasets demonstrated that our model has the smallest error in image anomaly detection compared to other models, showing strong adaptability. This provides a new approach for anomaly detection and predictive analysis in images, contributing to the sustainable development of human health and performance.
♻ ☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
♻ ☆ QxEAI: Quantum-like evolutionary algorithm for automated probabilistic forecasting
Forecasting, to estimate future events, is crucial for business and decision-making. This paper proposes QxEAI, a methodology that produces a probabilistic forecast that utilizes a quantum-like evolutionary algorithm based on training a quantum-like logic decision tree and a classical value tree on a small number of related time series. We demonstrate how the application of our quantum-like evolutionary algorithm to forecasting can overcome the challenges faced by classical and other machine learning approaches. By using three real-world datasets (Dow Jones Index, retail sales, gas consumption), we show how our methodology produces accurate forecasts while requiring little to none manual work.
♻ ☆ Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
♻ ☆ Can Low-Rank Knowledge Distillation in LLMs be Useful for Microelectronic Reasoning?
In this work, we present empirical results regarding the feasibility of using offline large language models (LLMs) in the context of electronic design automation (EDA). The goal is to investigate and evaluate a contemporary language model's (Llama-2-7B) ability to function as a microelectronic Q & A expert as well as its reasoning, and generation capabilities in solving microelectronic-related problems. Llama-2-7B was tested across a variety of adaptation methods, including introducing a novel low-rank knowledge distillation (LoRA-KD) scheme. Our experiments produce both qualitative and quantitative results.
comment: 4 pages, 2 figures, 2 tables, The First IEEE International Workshop on LLM-Aided Design (LAD'24)
♻ ☆ Byzantine-Robust Decentralized Federated Learning CCS '24
Federated learning (FL) enables multiple clients to collaboratively train machine learning models without revealing their private training data. In conventional FL, the system follows the server-assisted architecture (server-assisted FL), where the training process is coordinated by a central server. However, the server-assisted FL framework suffers from poor scalability due to a communication bottleneck at the server, and trust dependency issues. To address challenges, decentralized federated learning (DFL) architecture has been proposed to allow clients to train models collaboratively in a serverless and peer-to-peer manner. However, due to its fully decentralized nature, DFL is highly vulnerable to poisoning attacks, where malicious clients could manipulate the system by sending carefully-crafted local models to their neighboring clients. To date, only a limited number of Byzantine-robust DFL methods have been proposed, most of which are either communication-inefficient or remain vulnerable to advanced poisoning attacks. In this paper, we propose a new algorithm called BALANCE (Byzantine-robust averaging through local similarity in decentralization) to defend against poisoning attacks in DFL. In BALANCE, each client leverages its own local model as a similarity reference to determine if the received model is malicious or benign. We establish the theoretical convergence guarantee for BALANCE under poisoning attacks in both strongly convex and non-convex settings. Furthermore, the convergence rate of BALANCE under poisoning attacks matches those of the state-of-the-art counterparts in Byzantine-free settings. Extensive experiments also demonstrate that BALANCE outperforms existing DFL methods and effectively defends against poisoning attacks.
comment: To appear in ACM Conference on Computer and Communications Security 2024 (CCS '24)
♻ ☆ A Comparative Study of Deep Learning and Iterative Algorithms for Joint Channel Estimation and Signal Detection in OFDM Systems
Joint channel estimation and signal detection (JCESD) is crucial in orthogonal frequency division multiplexing (OFDM) systems, but traditional algorithms perform poorly in low signal-to-noise ratio (SNR) scenarios. Deep learning (DL) methods have been investigated, but concerns regarding computational expense and lack of validation in low-SNR settings remain. Hence, the development of a robust and low-complexity model that can deliver excellent performance across a wide range of SNRs is highly desirable. In this paper, we aim to establish a benchmark where traditional algorithms and DL methods are validated on different channel models, Doppler, and SNR settings, particularly focusing on the semi-blind setting. In particular, we propose a new DL model where the backbone network is formed by unrolling the iterative algorithm, and the hyperparameters are estimated by hypernetworks. Additionally, we adapt a lightweight DenseNet to the task of JCESD for comparison. We evaluate different methods in three aspects: generalization in terms of bit error rate (BER), robustness, and complexity. Our results indicate that DL approaches outperform traditional algorithms in the challenging low-SNR setting, while the iterative algorithm performs better in high-SNR settings. Furthermore, the iterative algorithm is more robust in the presence of carrier frequency offset, whereas DL methods excel when signals are corrupted by asymmetric Gaussian noise.
comment: Code is available at https://github.com/j991222/MIMO_JCESD
♻ ☆ Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo
Approximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.
comment: 52 pages, 2 figures
♻ ☆ SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
comment: v2: fix Table 1
♻ ☆ MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm
In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: https://github.com/YUNBLAK/MeGA-Merging-Multiple-Independently-Trained-Neural-Networks-Based-on-Genetic-Algorithm
Systems and Control 16
☆ Damping Wind Farm Resonances with Current Based Model Predictive Pulse Pattern Control
It is well-established that a proportional current control gain emulates a resistor in the converter output impedance. Even though this resistance can provide additional damping to grid resonances, its effect for traditional linear current controllers is known to be rather limited. Moreover, for medium-voltage systems, high switching frequencies are not an option due to the high switching losses. To meet the harmonic standards, it is expedient to use optimized pulse patterns. This further exacerbates the problems with the resistance of classical controllers, since an additional filtering would be required so that the current controller acts only on the fundamental component (and not on the ripple component). Such a design limits the damping effect not only in its amplitude but also in the frequency range where it is active. This paper shows that a high-bandwidth current-based model predictive pulse pattern controller can alleviate these limitations. The pulse pattern control approach can achieve a high gain even at low switching frequencies, while controlling directly the instantaneous currents (i.e., the fundamental component and the ripple together). With a fast implementation cycle, the frequency range where this damping effect is active can be further extended. Numerical studies showcase these benefits for a multi-phase medium-voltage wind power conversion system.
☆ Generalized Vector Locus Transformation for Unbalanced Three-Phase Systems
Coordinate transformations significantly simplify power systems computations. Most notably, the classical Clarke and dq0 transformations are widely used in three-phase systems, as together they transform balanced abc quantities into constant-valued signals. However, during unbalanced operation, the utility of these transformations diminishes, since a null 0 coordinate cannot be ensured and oscillating signals emerge. While recently proposed transformations ensure a null 0 coordinate, they still do not lead to constant-valued signals in the dq0 domain. In this letter, we propose a generalized vector locus transformation that ensures both a null 0 coordinate and constant-valued signals. Moreover, we show that, in the balanced case, the classical amplitude-invariant Clarke transformation is an instance of the proposed transformation.
☆ Passivity theorems for input-to-state stability of forced Lur'e inclusions and equations, and consequent entrainment-type properties
A suite of input-to-state stability results are presented for a class of forced differential inclusions, so-called Lur'e inclusions. As a consequence, semi-global incremental input-to-state stability results for systems of forced Lur'e differential equations are derived. The results are in the spirit of the passivity theorem from control theory as both the linear and nonlinear components of the Lur'e inclusion (or equation) are assumed to satisfy passivity-type conditions. These results provide a basis for the analysis of forced Lur'e differential equations subject to (almost) periodic forcing terms and, roughly speaking, ensure the existence and attractivity of (almost) periodic state- and output-responses, comprising another focus of the present work. One ultimate aim of the study is to provide a robust and rigorous theoretical foundation for a well-defined and tractable ``frequency response'' of forced Lur'e systems.
comment: 40 pages, 6 figures. Uploaded to arxiv for early dissemination. This version has been submitted for peer-review for publication in a journal
☆ A Biomechatronic Approach to Evaluating the Security of Wearable Devices in the Internet of Medical Things
The Internet of Medical Things (IoMT) has the potential to revolutionize healthcare by reducing human error and improving patient health. For instance, wearable smart infusion pumps can accurately administer medication and integrate with electronic health records. These pumps can alert healthcare professionals or remote servers when an operation fails, preventing distressing incidents. However, as the number of connected medical devices increases, so does the risk of cyber threats. Wearable medication devices based on IoT attached to patients' bodies are particularly vulnerable to significant cyber threats. Since they are connected to the internet, these devices can be exposed to potential harm, which can disrupt or degrade device performance and harm patients. Therefore, it is crucial to establish secure data authentication for internet-connected medical devices to ensure patient safety and well-being. It is also important to note that the wearability option of such devices might downgrade the computational resources, making them more susceptible to security risks. We propose implementing a security approach for a wearable infusion pump to mitigate cyber threats. We evaluated the proposed architecture with 20, 50, and 100 users for 10 minutes and repeated the evaluation 10 times with two infusion settings, each repeated five times. The desired volumes and rates for the two settings were 2 ml and 4 ml/hr and 5 ml and 5 ml/hr, respectively. The maximum error in infusion rate was measured to be 2.5%. We discuss the practical challenges of implementing such a security-enabled device and suggest initial solutions.
comment: Preprint version of the work accepted for publication in the International Journal of Robotics, Theory and Applications, June 2024. arXiv admin note: text overlap with arXiv:2312.08160
☆ Learning Autonomous Race Driving with Action Mapping Reinforcement Learning
Autonomous race driving poses a complex control challenge as vehicles must be operated at the edge of their handling limits to reduce lap times while respecting physical and safety constraints. This paper presents a novel reinforcement learning (RL)-based approach, incorporating the action mapping (AM) mechanism to manage state-dependent input constraints arising from limited tire-road friction. A numerical approximation method is proposed to implement AM, addressing the complex dynamics associated with the friction constraints. The AM mechanism also allows the learned driving policy to be generalized to different friction conditions. Experimental results in our developed race simulator demonstrate that the proposed AM-RL approach achieves superior lap times and better success rates compared to the conventional RL-based approaches. The generalization capability of driving policy with AM is also validated in the experiments.
☆ Resilience of the Electric Grid through Trustable IoT-Coordinated Assets
The electricity grid has evolved from a physical system to a cyber-physical system with digital devices that perform measurement, control, communication, computation, and actuation. The increased penetration of distributed energy resources (DERs) that include renewable generation, flexible loads, and storage provides extraordinary opportunities for improvements in efficiency and sustainability. However, they can introduce new vulnerabilities in the form of cyberattacks, which can cause significant challenges in ensuring grid resilience. %, i.e. the ability to rapidly restore grid services in the face of severe disruptions. We propose a framework in this paper for achieving grid resilience through suitably coordinated assets including a network of Internet of Things (IoT) devices. A local electricity market is proposed to identify trustable assets and carry out this coordination. Situational Awareness (SA) of locally available DERs with the ability to inject power or reduce consumption is enabled by the market, together with a monitoring procedure for their trustability and commitment. With this SA, we show that a variety of cyberattacks can be mitigated using local trustable resources without stressing the bulk grid. The demonstrations are carried out using a variety of platforms with a high-fidelity co-simulation platform, real-time hardware-in-the-loop validation, and a utility-friendly simulator.
comment: Submitted to the Proceedings of the National Academy of Sciences (PNAS), under review
☆ Converse Theorems for Certificates of Safety and Stability
Motivated by the key role of control barrier functions (CBFs) in assessing safety and enabling the synthesis of safe controllers in nonlinear control systems, this paper presents a suite of converse results on CBFs. Given any safe set, we first identify a set of general sufficient conditions which guarantee the existence of a CBF. Our technical analysis also enables us to define an extended notion of CBF which is always guaranteed to exist if the set is safe. We next turn our attention to the problem of joint safety and stability, and give conditions under which the notions of control Lyapunov-barrier function (CLBF) and compatible control Lyapunov function (CLF) and CBF pair are guaranteed to exist. Finally, we identify conditions under which a CLBF and a compatible CLF-CBF pair can be constructed from a non-compatible CLF-CBF pair. Throughout the paper, we intersperse different examples and counterexamples to motivate our results and position them within the state of the art.
☆ Model Predictive Control of the Neural Manifold
Neural manifolds are an attractive theoretical framework for characterizing the complex behaviors of neural populations. However, many of the tools for identifying these low-dimensional subspaces are correlational and provide limited insight into the underlying dynamics. The ability to precisely control this latent activity would allow researchers to investigate the structure and function of neural manifolds. Employing techniques from the field of optimal control, we simulate controlling the latent dynamics of a neural population using closed-loop, dynamically generated sensory inputs. Using a spiking neural network (SNN) as a model of a neural circuit, we find low-dimensional representations of both the network activity (the neural manifold) and a set of salient visual stimuli. With a data-driven latent dynamics model, we apply model predictive control (MPC) to provide anticipatory, optimal control over the trajectory of the circuit in a latent space. We are able to control the latent dynamics of the SNN to follow several reference trajectories despite observing only a subset of neurons and with a substantial amount of unknown noise injected into the network. These results provide a framework to experimentally test for causal relationships between manifold dynamics and other variables of interest such as organismal behavior and BCI performance.
☆ Capture Point Control in Thruster-Assisted Bipedal Locomotion
Despite major advancements in control design that are robust to unplanned disturbances, bipedal robots are still susceptible to falling over and struggle to negotiate rough terrains. By utilizing thrusters in our bipedal robot, we can perform additional posture manipulation and expand the modes of locomotion to enhance the robot's stability and ability to negotiate rough and difficult-to-navigate terrains. In this paper, we present our efforts in designing a controller based on capture point control for our thruster-assisted walking model named Harpy and explore its control design possibilities. While capture point control based on centroidal models for bipedal systems has been extensively studied, the incorporation of external forces that can influence the dynamics of linear inverted pendulum models, often used in capture point-based works, has not been explored before. The inclusion of these external forces can lead to interesting interpretations of locomotion, such as virtual buoyancy studied in aquatic-legged locomotion. This paper outlines the dynamical model of our robot, the capture point method we use to assist the upper body stabilization, and the simulation work done to show the controller's feasibility.
comment: Submitted and to be presented at IEEE AIM 2024. arXiv admin note: substantial text overlap with arXiv:2103.15952
☆ Background results for robust minmax control of linear dynamical systems
The purpose of this note is to summarize the arguments required to derive the results appearing in robust minmax control of linear dynamical systems using a quadratic stage cost. The main results required in robust minmax control are Corollary 19 and Proposition 20. Moreover, the solution to the trust-region problem given in Proposition 15 and Lemma 16 may be of more general interest.
☆ Combining Neural Networks and Symbolic Regression for Analytical Lyapunov Function Discovery
We propose CoNSAL (Combining Neural networks and Symbolic regression for Analytical Lyapunov function) to construct analytical Lyapunov functions for nonlinear dynamic systems. This framework contains a neural Lyapunov function and a symbolic regression component, where symbolic regression is applied to distill the neural network to precise analytical forms. Our approach utilizes symbolic regression not only as a tool for translation but also as a means to uncover counterexamples. This procedure terminates when no counterexamples are found in the analytical formulation. Compared with previous results, our algorithm directly produces an analytical form of the Lyapunov function with improved interpretability in both the learning process and the final results. We apply our algorithm to 2-D inverted pendulum, path following, Van Der Pol Oscillator, 3-D trig dynamics, 4-D rotating wheel pendulum, 6-D 3-bus power system, and demonstrate that our algorithm successfully finds their valid Lyapunov functions.
comment: Workshop paper, accepted by Workshop on Foundations of Reinforcement Learning and Control at the 41st International Conference on Machine Learning, Vienna, Austria
☆ The Stochastic Occupation Kernel Method for System Identification
The method of occupation kernels has been used to learn ordinary differential equations from data in a non-parametric way. We propose a two-step method for learning the drift and diffusion of a stochastic differential equation given snapshots of the process. In the first step, we learn the drift by applying the occupation kernel algorithm to the expected value of the process. In the second step, we learn the diffusion given the drift using a semi-definite program. Specifically, we learn the diffusion squared as a non-negative function in a RKHS associated with the square of a kernel. We present examples and simulations.
comment: 8 pages, 3 figures
☆ Necessary and sufficient condition for neutral-type delay systems: Polynomial approximations
A new necessary and sufficient stability test in a tractable number of operations for linear neutral-type delay systems is introduced. It is developed in the Lyapunov-Krasovskii framework via functionals with prescribed derivatives. The necessary conditions, which stem from substituting any polynomial approximation of the functional argument, reduce to a quadratic form of monomials whose matrix is independent of the coefficients of the approximation under consideration. In the particular case of Chebyshev polynomials, the functional approximation error is quantified, leading to an estimate of the order of approximation such that the positive semi-definiteness of the functional is verified. Some examples illustrate the obtained results.
☆ Cyclic Scheduler Design for Minimizing Age of Information in Massive Scale Networks Susceptible to Packet Errors
In multi-source status update systems, sources need to be scheduled appropriately to maintain timely communication between each of the sources and the monitor. A cyclic schedule is an age-agnostic schedule in which the sources are served according to a fixed finite transmission pattern, which upon completion, repeats itself. Such a scheme has a low $O(1)$ runtime complexity, which is desirable in large networks. This paper's focus is on designing transmission patterns so as to be used in massive scale networking scenarios involving a very large number of sources, e.g., up to thousands of IoT sources, with service time requirements and weights being heterogeneous in nature. The goal is to minimize the weighted sum age of information (AoI), called weighted AoI, when transmitting users' packets over a channel susceptible to heterogeneous packet errors. The main tool we use is a stochastic modeling framework using either Markov chains (MC) or moment generating functions (MGF), by which we obtain the weighted AoI for a given transmission pattern, which is not straightforward in the presence of packet drops. Using this framework, we provide a lower bound on the weighted AoI for the particular case of two sources, and also an algorithm to attain this lower bound. Then, by using the same framework, we design a cyclic scheduler for general number of sources with reasonable complexity using convex optimization and well-established packet spreading algorithms, and comparatively evaluate the proposed algorithm and existing age-agnostic scheduling schemes for general number of sources (resp.~two sources) when the lower bound is not available (resp.~when it is available). We present extensive numerical results to validate the effectiveness of the proposed approach.
☆ Miniature fluorescence sensor for quantitative detection of brain tumour
Fluorescence-guided surgery has emerged as a vital tool for tumour resection procedures. As well as intraoperative tumour visualisation, 5-ALA-induced PpIX provides an avenue for quantitative tumour identification based on ratiometric fluorescence measurement. To this end, fluorescence imaging and fibre-based probes have enabled more precise demarcation between the cancerous and healthy tissues. These sensing approaches, which rely on collecting the fluorescence light from the tumour resection site and its remote spectral sensing, introduce challenges associated with optical losses. In this work, we demonstrate the viability of tumour detection at the resection site using a miniature fluorescence measurement system. Unlike the current bulky systems, which necessitate remote measurement, we have adopted a millimetre-sized spectral sensor chip for quantitative fluorescence measurements. A reliable measurement at the resection site requires a stable optical window between the tissue and the optoelectronic system. This is achieved using an antifouling diamond window, which provides stable optical transparency. The system achieved a sensitivity of 92.3% and specificity of 98.3% in detecting a surrogate tumour at a resolution of 1 x 1 mm2. As well as addressing losses associated with collecting and coupling fluorescence light in the current remote sensing approaches, the small size of the system introduced in this work paves the way for its direct integration with the tumour resection tools with the aim of more accurate interoperative tumour identification.
♻ ☆ Explainable Online Unsupervised Anomaly Detection for Cyber-Physical Systems via Causal Discovery from Time Series
Online unsupervised detection of anomalies is crucial to guarantee the correct operation of cyber-physical systems and the safety of humans interacting with them. State-of-the-art approaches based on deep learning via neural networks achieve outstanding performance at anomaly recognition, evaluating the discrepancy between a normal model of the system (with no anomalies) and the real-time stream of sensor time series. However, large training data and time are typically required, and explainability is still a challenge to identify the root of the anomaly and implement predictive maintainance. In this paper, we use causal discovery to learn a normal causal graph of the system, and we evaluate the persistency of causal links during real-time acquisition of sensor data to promptly detect anomalies. On two benchmark anomaly detection datasets, we show that our method has higher training efficiency, outperforms the accuracy of state-of-the-art neural architectures and correctly identifies the sources of $>10$ different anomalies. The code for experimental replication is at http://tinyurl.com/case24causal.
comment: In publication for IEEE Conference on Automation and Smart Engineering (CASE) 2024
Multiagent Systems 10
☆ Phase-Bounded Broadcast Networks over Topologies of Communication
We study networks of processes that all execute the same finite state protocol and that communicate through broadcasts. The processes are organized in a graph (a topology) and only the neighbors of a process in this graph can receive its broadcasts. The coverability problem asks, given a protocol and a state of the protocol, whether there is a topology for the processes such that one of them (at least) reaches the given state. This problem is undecidable. We study here an under-approximation of the problem where processes alternate a bounded number of times $k$ between phases of broadcasting and phases of receiving messages. We show that, if the problem remains undecidable when $k$ is greater than 6, it becomes decidable for $k=2$, and EXPSPACE-complete for $k=1$. Furthermore, we show that if we restrict ourselves to line topologies, the problem is in $P$ for $k=1$ and $k=2$.
comment: long version of a paper accepted to appear at CONCUR 2024
☆ Towards General Negotiation Strategies with End-to-End Reinforcement Learning
The research field of automated negotiation has a long history of designing agents that can negotiate with other agents. Such negotiation strategies are traditionally based on manual design and heuristics. More recently, reinforcement learning approaches have also been used to train agents to negotiate. However, negotiation problems are diverse, causing observation and action dimensions to change, which cannot be handled by default linear policy networks. Previous work on this topic has circumvented this issue either by fixing the negotiation problem, causing policies to be non-transferable between negotiation problems or by abstracting the observations and actions into fixed-size representations, causing loss of information and expressiveness due to feature design. We developed an end-to-end reinforcement learning method for diverse negotiation problems by representing observations and actions as a graph and applying graph neural networks in the policy. With empirical evaluations, we show that our method is effective and that we can learn to negotiate with other agents on never-before-seen negotiation problems. Our result opens up new opportunities for reinforcement learning in negotiation agents.
comment: Accepted at the Reinforcement Learning Conference (RLC) 2024
☆ Effects of non-uniform number of actions by Hawkes process on spatial cooperation
The emergence of cooperative behavior, despite natural selection favoring rational self-interest, presents a significant evolutionary puzzle. Evolutionary game theory elucidates why cooperative behavior can be advantageous for survival. However, the impact of non-uniformity in the frequency of actions, particularly when actions are altered in the short term, has received little scholarly attention. To demonstrate the relationship between the non-uniformity in the frequency of actions and the evolution of cooperation, we conducted multi-agent simulations of evolutionary games. In our model, each agent performs actions in a chain-reaction, resulting in a non-uniform distribution of the number of actions. To achieve a variety of non-uniform action frequency, we introduced two types of chain-reaction rules: one where an agent's actions trigger subsequent actions, and another where an agent's actions depend on the actions of others. Our results revealed that cooperation evolves more effectively in scenarios with even slight non-uniformity in action frequency compared to completely uniform cases. In addition, scenarios where agents' actions are primarily triggered by their own previous actions more effectively support cooperation, whereas those triggered by others' actions are less effective. This implies that a few highly active individuals contribute positively to cooperation, while the tendency to follow others' actions can hinder it.
☆ Autonomous Agents for Collaborative Task under Information Asymmetry
Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents' communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access the information of its human user. Previous MAS struggle to complete tasks under this condition. To address this, we propose a new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems. In iAgents, the human social network is mirrored in the agent network, where agents proactively exchange human information necessary for task resolution, thereby overcoming information asymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, to navigate agents' communication towards effective information exchange. Together with InfoNav, iAgents organizes human information in a mixed memory to provide agents with accurate and comprehensive information for exchange. Additionally, we introduce InformativeBench, the first benchmark tailored for evaluating LLM agents' task-solving ability under information asymmetry. Experimental results show that iAgents can collaborate within a social network of 140 individuals and 588 relationships, autonomously communicate over 30 turns, and retrieve information from nearly 70,000 messages to complete tasks within 3 minutes.
comment: 16 pages, 8 figures, 5 tables, Work in progress
☆ Social learning with complex contagion
We introduce a mathematical model that combines the concepts of complex contagion with payoff-biased imitation, to describe how social behaviors spread through a population. Traditional models of social learning by imitation are based on simple contagion -- where an individual may imitate a more successful neighbor following a single interaction. Our framework generalizes this process to incorporate complex contagion, which requires multiple exposures before an individual considers adopting a different behavior. We formulate this as a discrete time and state stochastic process in a finite population, and we derive its continuum limit as an ordinary differential equation that generalizes the replicator equation, the most widely used dynamical model in evolutionary game theory. When applied to linear frequency-dependent games, our social learning with complex contagion produces qualitatively different outcomes than traditional imitation dynamics: it can shift the Prisoner's Dilemma from a unique all-defector equilibrium to either a stable mixture of cooperators and defectors in the population, or a bistable system; it changes the Snowdrift game from a single to a bistable equilibrium; and it can alter the Coordination game from bistability at the boundaries to two internal equilibria. The long-term outcome depends on the balance between the complexity of the contagion process and the strength of selection that biases imitation towards more successful types. Our analysis intercalates the fields of evolutionary game theory with complex contagions, and it provides a synthetic framework that describes more realistic forms of behavioral change in social systems.
comment: 19 pages, 5 figures
☆ Cooperative bots exhibit nuanced effects on cooperation across strategic frameworks
The positive impact of cooperative bots on cooperation within evolutionary game theory is well documented; however, existing studies have predominantly used discrete strategic frameworks, focusing on deterministic actions with a fixed probability of one. This paper extends the investigation to continuous and mixed strategic approaches. Continuous strategies employ intermediate probabilities to convey varying degrees of cooperation and focus on expected payoffs. In contrast, mixed strategies calculate immediate payoffs from actions chosen at a given moment within these probabilities. Using the prisoner's dilemma game, this study examines the effects of cooperative bots on human cooperation within hybrid populations of human players and simple bots, across both well-mixed and structured populations. Our findings reveal that cooperative bots significantly enhance cooperation in both population types across these strategic approaches under weak imitation scenarios, where players are less concerned with material gains. However, under strong imitation scenarios, while cooperative bots do not alter the defective equilibrium in well-mixed populations, they have varied impacts in structured populations across these strategic approaches. Specifically, they disrupt cooperation under discrete and continuous strategies but facilitate it under mixed strategies. These results highlight the nuanced effects of cooperative bots within different strategic frameworks and underscore the need for careful deployment, as their effectiveness is highly sensitive to how humans update their actions and their chosen strategic approach.
☆ Autonomous Decision Making for Air Taxi Networks
Future urban air mobility systems are expected to be operated by rideshare companies as fleets, which will require fully autonomous air traffic control systems and an order of magnitude increase in airspace capacity. Such a system must not only be safe, but also highly responsive to customer demand. This paper proposes the air traffic network problem (ATNP), which models the optimization problem of future cooperative air taxi networks. We propose a three-phase decision making model that efficiently assigns vehicles to passengers, determines flight levels to reduce collision risk, and resolves aircraft conflicts by selectively applying Monte Carlo tree search. We develop a simulator for the ATNP and show that our approach has increased safety and reduced passenger waiting time compared to greedy and first-dispatch protocols over potential vertiport layouts across the Bay Area and New York City.
☆ Contextual Sprint Classification in Soccer Based on Deep Learning IJCAI 2024
The analysis of high-intensity runs (or sprints) in soccer has long been a topic of interest for sports science researchers and practitioners. In particular, recent studies suggested contextualizing sprints based on their tactical purposes to better understand the physical-tactical requirements of modern match-play. However, they have a limitation in scalability, as human experts have to manually classify hundreds of sprints for every match. To address this challenge, this paper proposes a deep learning framework for automatically classifying sprints in soccer into contextual categories. The proposed model covers the permutation-invariant and sequential nature of multi-agent trajectories in soccer by deploying Set Transformers and a bidirectional GRU. We train the model with category labels made through the collaboration of human annotators and a rule-based classifier. Experimental results show that our model classifies sprints in the test dataset into 15 categories with the accuracy of 77.65%, implying the potential of the proposed framework for facilitating the integrated analysis of soccer sprints at scale.
comment: Accepted at IJCAI 2024 Workshop on Intelligent Technologies for Precision Sports Science (IT4PSS 2024)
♻ ☆ Large Language Model-Enabled Multi-Agent Manufacturing Systems
Traditional manufacturing faces challenges adapting to dynamic environments and quickly responding to manufacturing changes. The use of multi-agent systems has improved adaptability and coordination but requires further advancements in rapid human instruction comprehension, operational adaptability, and coordination through natural language integration. Large language models like GPT-3.5 and GPT-4 enhance multi-agent manufacturing systems by enabling agents to communicate in natural language and interpret human instructions for decision-making. This research introduces a novel framework where large language models enhance the capabilities of agents in manufacturing, making them more adaptable, and capable of processing context-specific instructions. A case study demonstrates the practical application of this framework, showing how agents can effectively communicate, understand tasks, and execute manufacturing processes, including precise G-code allocation among agents. The findings highlight the importance of continuous large language model integration into multi-agent manufacturing systems and the development of sophisticated agent communication protocols for a more flexible manufacturing system.
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
Artificial Intelligence 138
☆ NAVSIM: Data-Driven Non-Reactive Autonomous Vehicle Simulation and Benchmarking
Benchmarking vision-based driving policies is challenging. On one hand, open-loop evaluation with real data is easy, but these results do not reflect closed-loop performance. On the other, closed-loop evaluation is possible in simulation, but is hard to scale due to its significant computational demands. Further, the simulators available today exhibit a large domain gap to real data. This has resulted in an inability to draw clear conclusions from the rapidly growing body of research on end-to-end autonomous driving. In this paper, we present NAVSIM, a middle ground between these evaluation paradigms, where we use large datasets in combination with a non-reactive simulator to enable large-scale real-world benchmarking. Specifically, we gather simulation-based metrics, such as progress and time to collision, by unrolling bird's eye view abstractions of the test scenes for a short simulation horizon. Our simulation is non-reactive, i.e., the evaluated policy and environment do not influence each other. As we demonstrate empirically, this decoupling allows open-loop metric computation while being better aligned with closed-loop evaluations than traditional displacement errors. NAVSIM enabled a new competition held at CVPR 2024, where 143 teams submitted 463 entries, resulting in several new insights. On a large set of challenging scenarios, we observe that simple methods with moderate compute requirements such as TransFuser can match recent large-scale end-to-end driving architectures such as UniAD. Our modular framework can potentially be extended with new datasets, data curation strategies, and metrics, and will be continually maintained to host future challenges. Our code is available at https://github.com/autonomousvision/navsim.
☆ Privacy Preserved Blood Glucose Level Cross-Prediction: An Asynchronous Decentralized Federated Learning Approach
Newly diagnosed Type 1 Diabetes (T1D) patients often struggle to obtain effective Blood Glucose (BG) prediction models due to the lack of sufficient BG data from Continuous Glucose Monitoring (CGM), presenting a significant "cold start" problem in patient care. Utilizing population models to address this challenge is a potential solution, but collecting patient data for training population models in a privacy-conscious manner is challenging, especially given that such data is often stored on personal devices. Considering the privacy protection and addressing the "cold start" problem in diabetes care, we propose "GluADFL", blood Glucose prediction by Asynchronous Decentralized Federated Learning. We compared GluADFL with eight baseline methods using four distinct T1D datasets, comprising 298 participants, which demonstrated its superior performance in accurately predicting BG levels for cross-patient analysis. Furthermore, patients' data might be stored and shared across various communication networks in GluADFL, ranging from highly interconnected (e.g., random, performs the best among others) to more structured topologies (e.g., cluster and ring), suitable for various social networks. The asynchronous training framework supports flexible participation. By adjusting the ratios of inactive participants, we found it remains stable if less than 70% are inactive. Our results confirm that GluADFL offers a practical, privacy-preserving solution for BG prediction in T1D, significantly enhancing the quality of diabetes management.
☆ GenoTEX: A Benchmark for Evaluating LLM-Based Exploration of Gene Expression Data in Alignment with Bioinformaticians
Recent advancements in machine learning have significantly improved the identification of disease-associated genes from gene expression datasets. However, these processes often require extensive expertise and manual effort, limiting their scalability. Large Language Model (LLM)-based agents have shown promise in automating these tasks due to their increasing problem-solving abilities. To support the evaluation and development of such methods, we introduce GenoTEX, a benchmark dataset for the automatic exploration of gene expression data, involving the tasks of dataset selection, preprocessing, and statistical analysis. GenoTEX provides annotated code and results for solving a wide range of gene identification problems, in a full analysis pipeline that follows the standard of computational genomics. These annotations are curated by human bioinformaticians who carefully analyze the datasets to ensure accuracy and reliability. To provide baselines for these tasks, we present GenoAgents, a team of LLM-based agents designed with context-aware planning, iterative correction, and domain expert consultation to collaboratively explore gene datasets. Our experiments with GenoAgents demonstrate the potential of LLM-based approaches in genomics data analysis, while error analysis highlights the challenges and areas for future improvement. We propose GenoTEX as a promising resource for benchmarking and enhancing AI-driven methods for genomics data analysis. We make our benchmark publicly available at \url{https://github.com/Liu-Hy/GenoTex}.
comment: 25 pages, 3 figures
☆ Image Conductor: Precision Control for Interactive Video Synthesis
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
comment: Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
☆ Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.
☆ Gradient-Mask Tuning Elevates the Upper Limits of LLM Performance
Large language models (LLMs) have revolutionized lots of fields of research. Although it is well-known that fine-tuning is essential for enhancing the capabilities of LLMs, existing research suggests that there is potential redundancy in the fine-tuning process and therefore proposes to update only a subset of parameters. However, these methods fail to leverage the task-specific information to identify important parameters during training. Based on the insight that gradients inherently contain information on task-specific data, we propose Gradient-Mask Tuning (GMT), a method that selectively updates parameters during training based on their gradient information. Specifically, we compute the absolute values of the gradients and apply masking to those with relatively smaller magnitudes. Our empirical results across various tasks demonstrate that GMT not only outperforms traditional fine-tuning methods but also elevates the upper limits of LLM performance. Further analysis indicates that GMT exhibits insensitivity to mask ratio and possesses computational efficiency comparable to vanilla SFT.
☆ An End-to-End, Segmentation-Free, Arabic Handwritten Recognition Model on KHATT
An end-to-end, segmentation-free, deep learning model trained from scratch is proposed, leveraging DCNN for feature extraction, alongside Bidirectional Long-Short Term Memory (BLSTM) for sequence recognition and Connectionist Temporal Classification (CTC) loss function on the KHATT database. The training phase yields remarkable results 84% recognition rate on the test dataset at the character level and 71% on the word level, establishing an image-based sequence recognition framework that operates without segmentation only at the line level. The analysis and preprocessing of the KFUPM Handwritten Arabic TexT (KHATT) database are also presented. Finally, advanced image processing techniques, including filtering, transformation, and line segmentation are implemented. The importance of this work is highlighted by its wide-ranging applications. Including digitizing, documentation, archiving, and text translation in fields such as banking. Moreover, AHR serves as a pivotal tool for making images searchable, enhancing information retrieval capabilities, and enabling effortless editing. This functionality significantly reduces the time and effort required for tasks such as Arabic data organization and manipulation.
☆ Bug In the Code Stack: Can LLMs Find Bugs in Large Python Code Stacks
Recent research in Needle-in-a-Haystack (NIAH) benchmarks has explored the capabilities of Large Language Models (LLMs) in retrieving contextual information from large text documents. However, as LLMs become increasingly integrated into software development processes, it is crucial to evaluate their performance in code-based environments. As LLMs are further developed for program synthesis, we need to ensure that LLMs can understand syntax and write syntactically correct code. As a step in ensuring LLMs understand syntax, LLMs can be evaluated in their ability to find and detect syntax bugs. Our benchmark, Bug In The Code Stack (BICS), is designed to assess the ability of LLMs to identify simple syntax bugs within large source code. Our findings reveal three key insights: (1) code-based environments pose significantly more challenge compared to text-based environments for retrieval tasks, (2) there is a substantial performance disparity among different models, and (3) there is a notable correlation between longer context lengths and performance degradation, though the extent of this degradation varies between models.
comment: 8 pages
☆ LongRAG: Enhancing Retrieval-Augmented Generation with Long-context LLMs
In traditional RAG framework, the basic retrieval units are normally short. The common retrievers like DPR normally work with 100-word Wikipedia paragraphs. Such a design forces the retriever to search over a large corpus to find the `needle' unit. In contrast, the readers only need to extract answers from the short retrieved units. Such an imbalanced `heavy' retriever and `light' reader design can lead to sub-optimal performance. In order to alleviate the imbalance, we propose a new framework LongRAG, consisting of a `long retriever' and a `long reader'. LongRAG processes the entire Wikipedia into 4K-token units, which is 30x longer than before. By increasing the unit size, we significantly reduce the total units from 22M to 700K. This significantly lowers the burden of retriever, which leads to a remarkable retrieval score: answer recall@1=71% on NQ (previously 52%) and answer recall@2=72% (previously 47%) on HotpotQA (full-wiki). Then we feed the top-k retrieved units ($\approx$ 30K tokens) to an existing long-context LLM to perform zero-shot answer extraction. Without requiring any training, LongRAG achieves an EM of 62.7% on NQ, which is the best known result. LongRAG also achieves 64.3% on HotpotQA (full-wiki), which is on par of the SoTA model. Our study offers insights into the future roadmap for combining RAG with long-context LLMs.
comment: Technical Report
☆ Grants4Companies: Applying Declarative Methods for Recommending and Reasoning About Business Grants in the Austrian Public Administration (System Description)
We describe the methods and technologies underlying the application Grants4Companies. The application uses a logic-based expert system to display a list of business grants suitable for the logged-in business. To evaluate suitability of the grants, formal representations of their conditions are evaluated against properties of the business, taken from the registers of the Austrian public administration. The logical language for the representations of the grant conditions is based on S-expressions. We further describe a Proof of Concept implementation of reasoning over the formalised grant conditions. The proof of concept is implemented in Common Lisp and interfaces with a reasoning engine implemented in Scryer Prolog. The application has recently gone live and is provided as part of the Business Service Portal by the Austrian Federal Ministry of Finance.
☆ Cross-Modality Safety Alignment
As Artificial General Intelligence (AGI) becomes increasingly integrated into various facets of human life, ensuring the safety and ethical alignment of such systems is paramount. Previous studies primarily focus on single-modality threats, which may not suffice given the integrated and complex nature of cross-modality interactions. We introduce a novel safety alignment challenge called Safe Inputs but Unsafe Output (SIUO) to evaluate cross-modality safety alignment. Specifically, it considers cases where single modalities are safe independently but could potentially lead to unsafe or unethical outputs when combined. To empirically investigate this problem, we developed the SIUO, a cross-modality benchmark encompassing 9 critical safety domains, such as self-harm, illegal activities, and privacy violations. Our findings reveal substantial safety vulnerabilities in both closed- and open-source LVLMs, such as GPT-4V and LLaVA, underscoring the inadequacy of current models to reliably interpret and respond to complex, real-world scenarios.
☆ Towards Robust Training Datasets for Machine Learning with Ontologies: A Case Study for Emergency Road Vehicle Detection
Countless domains rely on Machine Learning (ML) models, including safety-critical domains, such as autonomous driving, which this paper focuses on. While the black box nature of ML is simply a nuisance in some domains, in safety-critical domains, this makes ML models difficult to trust. To fully utilize ML models in safety-critical domains, it would be beneficial to have a method to improve trust in model robustness and accuracy without human experts checking each decision. This research proposes a method to increase trust in ML models used in safety-critical domains by ensuring the robustness and completeness of the model's training dataset. Because ML models embody what they are trained with, ensuring the completeness of training datasets can help to increase the trust in the training of ML models. To this end, this paper proposes the use of a domain ontology and an image quality characteristic ontology to validate the domain completeness and image quality robustness of a training dataset. This research also presents an experiment as a proof of concept for this method, where ontologies are built for the emergency road vehicle domain.
☆ V-RECS, a Low-Cost LLM4VIS Recommender with Explanations, Captioning and Suggestions
NL2VIS (natural language to visualization) is a promising and recent research area that involves interpreting natural language queries and translating them into visualizations that accurately represent the underlying data. As we navigate the era of big data, NL2VIS holds considerable application potential since it greatly facilitates data exploration by non-expert users. Following the increasingly widespread usage of generative AI in NL2VIS applications, in this paper we present V-RECS, the first LLM-based Visual Recommender augmented with explanations(E), captioning(C), and suggestions(S) for further data exploration. V-RECS' visualization narratives facilitate both response verification and data exploration by non-expert users. Furthermore, our proposed solution mitigates computational, controllability, and cost issues associated with using powerful LLMs by leveraging a methodology to effectively fine-tune small models. To generate insightful visualization narratives, we use Chain-of-Thoughts (CoT), a prompt engineering technique to help LLM identify and generate the logical steps to produce a correct answer. Since CoT is reported to perform poorly with small LLMs, we adopted a strategy in which a large LLM (GPT-4), acting as a Teacher, generates CoT-based instructions to fine-tune a small model, Llama-2-7B, which plays the role of a Student. Extensive experiments-based on a framework for the quantitative evaluation of AI-based visualizations and on manual assessment by a group of participants-show that V-RECS achieves performance scores comparable to GPT-4, at a much lower cost. The efficacy of the V-RECS teacher-student paradigm is also demonstrated by the fact that the un-tuned Llama fails to perform the task in the vast majority of test cases. We release V-RECS for the visualization community to assist visualization designers throughout the entire visualization generation process.
☆ MantisScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation
The recent years have witnessed great advances in video generation. However, the development of automatic video metrics is lagging significantly behind. None of the existing metric is able to provide reliable scores over generated videos. The main barrier is the lack of large-scale human-annotated dataset. In this paper, we release VideoFeedback, the first large-scale dataset containing human-provided multi-aspect score over 37.6K synthesized videos from 11 existing video generative models. We train MantisScore (initialized from Mantis) based on VideoFeedback to enable automatic video quality assessment. Experiments show that the Spearman correlation between MantisScore and humans can reach 77.1 on VideoFeedback-test, beating the prior best metrics by about 50 points. Further result on other held-out EvalCrafter, GenAI-Bench, and VBench show that MantisScore has consistently much higher correlation with human judges than other metrics. Due to these results, we believe MantisScore can serve as a great proxy for human raters to (1) rate different video models to track progress (2) simulate fine-grained human feedback in Reinforcement Learning with Human Feedback (RLHF) to improve current video generation models.
☆ Detecting Synthetic Lyrics with Few-Shot Inference
In recent years, generated content in music has gained significant popularity, with large language models being effectively utilized to produce human-like lyrics in various styles, themes, and linguistic structures. This technological advancement supports artists in their creative processes but also raises issues of authorship infringement, consumer satisfaction and content spamming. To address these challenges, methods for detecting generated lyrics are necessary. However, existing works have not yet focused on this specific modality or on creative text in general regarding machine-generated content detection methods and datasets. In response, we have curated the first dataset of high-quality synthetic lyrics and conducted a comprehensive quantitative evaluation of various few-shot content detection approaches, testing their generalization capabilities and complementing this with a human evaluation. Our best few-shot detector, based on LLM2Vec, surpasses stylistic and statistical methods, which are shown competitive in other domains at distinguishing human-written from machine-generated content. It also shows good generalization capabilities to new artists and models, and effectively detects post-generation paraphrasing. This study emphasizes the need for further research on creative content detection, particularly in terms of generalization and scalability with larger song catalogs. All datasets, pre-processing scripts, and code are available publicly on GitHub and Hugging Face under the Apache 2.0 license.
comment: Under review
☆ Deep UAV Path Planning with Assured Connectivity in Dense Urban Setting
Unmanned Ariel Vehicle (UAV) services with 5G connectivity is an emerging field with numerous applications. Operator-controlled UAV flights and manual static flight configurations are major limitations for the wide adoption of scalability of UAV services. Several services depend on excellent UAV connectivity with a cellular network and maintaining it is challenging in predetermined flight paths. This paper addresses these limitations by proposing a Deep Reinforcement Learning (DRL) framework for UAV path planning with assured connectivity (DUPAC). During UAV flight, DUPAC determines the best route from a defined source to the destination in terms of distance and signal quality. The viability and performance of DUPAC are evaluated under simulated real-world urban scenarios using the Unity framework. The results confirm that DUPAC achieves an autonomous UAV flight path similar to base method with only 2% increment while maintaining an average 9% better connection quality throughout the flight.
comment: 5 pages, 4 figures, Published in the 2024 IEEE Network Operations and Management Symposium (NOMS 2024)
☆ Injecting Bias in Text-To-Image Models via Composite-Trigger Backdoors
Recent advances in large text-conditional image generative models such as Stable Diffusion, Midjourney, and DALL-E 3 have revolutionized the field of image generation, allowing users to produce high-quality, realistic images from textual prompts. While these developments have enhanced artistic creation and visual communication, they also present an underexplored attack opportunity: the possibility of inducing biases by an adversary into the generated images for malicious intentions, e.g., to influence society and spread propaganda. In this paper, we demonstrate the possibility of such a bias injection threat by an adversary who backdoors such models with a small number of malicious data samples; the implemented backdoor is activated when special triggers exist in the input prompt of the backdoored models. On the other hand, the model's utility is preserved in the absence of the triggers, making the attack highly undetectable. We present a novel framework that enables efficient generation of poisoning samples with composite (multi-word) triggers for such an attack. Our extensive experiments using over 1 million generated images and against hundreds of fine-tuned models demonstrate the feasibility of the presented backdoor attack. We illustrate how these biases can bypass conventional detection mechanisms, highlighting the challenges in proving the existence of biases within operational constraints. Our cost analysis confirms the low financial barrier to executing such attacks, underscoring the need for robust defensive strategies against such vulnerabilities in text-to-image generation models.
☆ How Effective is GPT-4 Turbo in Generating School-Level Questions from Textbooks Based on Bloom's Revised Taxonomy?
We evaluate the effectiveness of GPT-4 Turbo in generating educational questions from NCERT textbooks in zero-shot mode. Our study highlights GPT-4 Turbo's ability to generate questions that require higher-order thinking skills, especially at the "understanding" level according to Bloom's Revised Taxonomy. While we find a notable consistency between questions generated by GPT-4 Turbo and those assessed by humans in terms of complexity, there are occasional differences. Our evaluation also uncovers variations in how humans and machines evaluate question quality, with a trend inversely related to Bloom's Revised Taxonomy levels. These findings suggest that while GPT-4 Turbo is a promising tool for educational question generation, its efficacy varies across different cognitive levels, indicating a need for further refinement to fully meet educational standards.
comment: Accepted at Learnersourcing: Student-Generated Content @ Scale 2024
☆ Exploring the Efficacy of Robotic Assistants with ChatGPT and Claude in Enhancing ADHD Therapy: Innovating Treatment Paradigms
Attention Deficit Hyperactivity Disorder (ADHD) is a neurodevelopmental condition characterized by inattention, hyperactivity, and impulsivity, which can significantly impact an individual's daily functioning and quality of life. Occupational therapy plays a crucial role in managing ADHD by fostering the development of skills needed for daily living and enhancing an individual's ability to participate fully in school, home, and social situations. Recent studies highlight the potential of integrating Large Language Models (LLMs) like ChatGPT and Socially Assistive Robots (SAR) to improve psychological treatments. This integration aims to overcome existing limitations in mental health therapy by providing tailored support and adapting to the unique needs of this sensitive group. However, there remains a significant gap in research exploring the combined use of these advanced technologies in ADHD therapy, suggesting an opportunity for novel therapeutic approaches. Thus, we integrated two advanced language models, ChatGPT-4 Turbo and Claude-3 Opus, into a robotic assistant to explore how well each model performs in robot-assisted interactions. Additionally, we have compared their performance in a simulated therapy scenario to gauge their effectiveness against a clinically validated customized model. The results of this study show that ChatGPT-4 Turbo excelled in performance and responsiveness, making it suitable for time-sensitive applications. Claude-3 Opus, on the other hand, showed strengths in understanding, coherence, and ethical considerations, prioritizing safe and engaging interactions. Both models demonstrated innovation and adaptability, but ChatGPT-4 Turbo offered greater ease of integration and broader language support. The selection between them hinges on the specific demands of ADHD therapy.
comment: Paper accepted at the 20th International Conference on Intelligent Environments
☆ UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis
The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at https://github.com/qinchuanhui/UDA-Benchmark.
☆ Enhancing Idiomatic Representation in Multiple Languages via an Adaptive Contrastive Triplet Loss
Accurately modeling idiomatic or non-compositional language has been a longstanding challenge in Natural Language Processing (NLP). This is partly because these expressions do not derive their meanings solely from their constituent words, but also due to the scarcity of relevant data resources, and their impact on the performance of downstream tasks such as machine translation and simplification. In this paper we propose an approach to model idiomaticity effectively using a triplet loss that incorporates the asymmetric contribution of components words to an idiomatic meaning for training language models by using adaptive contrastive learning and resampling miners to build an idiomatic-aware learning objective. Our proposed method is evaluated on a SemEval challenge and outperforms previous alternatives significantly in many metrics.
☆ Évaluation des capacités de réponse de larges modèles de langage (LLM) pour des questions d'historiens
Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.
comment: in French language
☆ This actually looks like that: Proto-BagNets for local and global interpretability-by-design
Interpretability is a key requirement for the use of machine learning models in high-stakes applications, including medical diagnosis. Explaining black-box models mostly relies on post-hoc methods that do not faithfully reflect the model's behavior. As a remedy, prototype-based networks have been proposed, but their interpretability is limited as they have been shown to provide coarse, unreliable, and imprecise explanations.In this work, we introduce Proto-BagNets, an interpretable-by-design prototype-based model that combines the advantages of bag-of-local feature models and prototype learning to provide meaningful, coherent, and relevant prototypical parts needed for accurate and interpretable image classification tasks. We evaluated the Proto-BagNet for drusen detection on publicly available retinal OCT data. The Proto-BagNet performed comparably to the state-of-the-art interpretable and non-interpretable models while providing faithful, accurate, and clinically meaningful local and global explanations. The code is available at https://github.com/kdjoumessi/Proto-BagNets.
☆ Perks and Pitfalls of Faithfulness in Regular, Self-Explainable and Domain Invariant GNNs
As Graph Neural Networks (GNNs) become more pervasive, it becomes paramount to build robust tools for computing explanations of their predictions. A key desideratum is that these explanations are faithful, i.e., that they portray an accurate picture of the GNN's reasoning process. A number of different faithfulness metrics exist, begging the question of what faithfulness is exactly, and what its properties are. We begin by showing that existing metrics are not interchangeable -- i.e., explanations attaining high faithfulness according to one metric may be unfaithful according to others -- and can be systematically insensitive to important properties of the explanation, and suggest how to address these issues. We proceed to show that, surprisingly, optimizing for faithfulness is not always a sensible design goal. Specifically, we show that for injective regular GNN architectures, perfectly faithful explanations are completely uninformative. The situation is different for modular GNNs, such as self-explainable and domain-invariant architectures, where optimizing faithfulness does not compromise informativeness, and is also unexpectedly tied to out-of-distribution generalization.
☆ Generative Topological Networks
Generative models have seen significant advancements in recent years, yet often remain challenging and costly to train and use. We introduce Generative Topological Networks (GTNs) -- a new class of generative models that addresses these shortcomings. GTNs are trained deterministically using a simple supervised learning approach grounded in topology theory. GTNs are fast to train, and require only a single forward pass in a standard feedforward neural network to generate samples. We demonstrate the strengths of GTNs in several datasets, including MNIST, celebA and the Hands and Palm Images dataset. Finally, the theory behind GTNs offers insights into how to train generative models for improved performance.
☆ Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks
Simulators are powerful tools for autonomous robot learning as they offer scalable data generation, flexible design, and optimization of trajectories. However, transferring behavior learned from simulation data into the real world proves to be difficult, usually mitigated with compute-heavy domain randomization methods or further model fine-tuning. We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks. To this end, we first build a simulator by integrating Gaussian Splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks. In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, crafty programming of expert demonstration training data, and the task understanding capabilities of Liquid networks. Through a series of quantitative flight tests, we demonstrate the robust transfer of navigation skills learned in a single simulation scene directly to the real world. We further show the ability to maintain performance beyond the training environment under drastic distribution and physical environment changes. Our learned Liquid policies, trained on single target manoeuvres curated from a photorealistic simulated indoor flight only, generalize to multi-step hikes onboard a real hardware platform outdoors.
☆ Assessing Good, Bad and Ugly Arguments Generated by ChatGPT: a New Dataset, its Methodology and Associated Tasks
The recent success of Large Language Models (LLMs) has sparked concerns about their potential to spread misinformation. As a result, there is a pressing need for tools to identify ``fake arguments'' generated by such models. To create these tools, examples of texts generated by LLMs are needed. This paper introduces a methodology to obtain good, bad and ugly arguments from argumentative essays produced by ChatGPT, OpenAI's LLM. We then describe a novel dataset containing a set of diverse arguments, ArGPT. We assess the effectiveness of our dataset and establish baselines for several argumentation-related tasks. Finally, we show that the artificially generated data relates well to human argumentation and thus is useful as a tool to train and test systems for the defined tasks.
☆ A Wavelet Guided Attention Module for Skin Cancer Classification with Gradient-based Feature Fusion
Skin cancer is a highly dangerous type of cancer that requires an accurate diagnosis from experienced physicians. To help physicians diagnose skin cancer more efficiently, a computer-aided diagnosis (CAD) system can be very helpful. In this paper, we propose a novel model, which uses a novel attention mechanism to pinpoint the differences in features across the spatial dimensions and symmetry of the lesion, thereby focusing on the dissimilarities of various classes based on symmetry, uniformity in texture and color, etc. Additionally, to take into account the variations in the boundaries of the lesions for different classes, we employ a gradient-based fusion of wavelet and soft attention-aided features to extract boundary information of skin lesions. We have tested our model on the multi-class and highly class-imbalanced dataset, called HAM10000, and achieved promising results, with a 91.17\% F1-score and 90.75\% accuracy. The code is made available at: https://github.com/AyushRoy2001/WAGF-Fusion.
☆ Speech Emotion Recognition under Resource Constraints with Data Distillation
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising concerns about privacy leakage during the deployment of SER models. To address these challenges, we propose a data distillation framework to facilitate efficient development of SER models in IoT applications using a synthesised, smaller, and distilled dataset. Our experiments demonstrate that the distilled dataset can be effectively utilised to train SER models with fixed initialisation, achieving performances comparable to those developed using the original full emotional speech dataset.
☆ FA-Net: A Fuzzy Attention-aided Deep Neural Network for Pneumonia Detection in Chest X-Rays
Pneumonia is a respiratory infection caused by bacteria, fungi, or viruses. It affects many people, particularly those in developing or underdeveloped nations with high pollution levels, unhygienic living conditions, overcrowding, and insufficient medical infrastructure. Pneumonia can cause pleural effusion, where fluids fill the lungs, leading to respiratory difficulty. Early diagnosis is crucial to ensure effective treatment and increase survival rates. Chest X-ray imaging is the most commonly used method for diagnosing pneumonia. However, visual examination of chest X-rays can be difficult and subjective. In this study, we have developed a computer-aided diagnosis system for automatic pneumonia detection using chest X-ray images. We have used DenseNet-121 and ResNet50 as the backbone for the binary class (pneumonia and normal) and multi-class (bacterial pneumonia, viral pneumonia, and normal) classification tasks, respectively. We have also implemented a channel-specific spatial attention mechanism, called Fuzzy Channel Selective Spatial Attention Module (FCSSAM), to highlight the specific spatial regions of relevant channels while removing the irrelevant channels of the extracted features by the backbone. We evaluated the proposed approach on a publicly available chest X-ray dataset, using binary and multi-class classification setups. Our proposed method achieves accuracy rates of 97.15\% and 79.79\% for the binary and multi-class classification setups, respectively. The results of our proposed method are superior to state-of-the-art (SOTA) methods. The code of the proposed model will be available at: https://github.com/AyushRoy2001/FA-Net.
☆ A Dual Attention-aided DenseNet-121 for Classification of Glaucoma from Fundus Images
Deep learning and computer vision methods are nowadays predominantly used in the field of ophthalmology. In this paper, we present an attention-aided DenseNet-121 for classifying normal and glaucomatous eyes from fundus images. It involves the convolutional block attention module to highlight relevant spatial and channel features extracted by DenseNet-121. The channel recalibration module further enriches the features by utilizing edge information along with the statistical features of the spatial dimension. For the experiments, two standard datasets, namely RIM-ONE and ACRIMA, have been used. Our method has shown superior results than state-of-the-art models. An ablation study has also been conducted to show the effectiveness of each of the components. The code of the proposed work is available at: https://github.com/Soham2004GitHub/DADGC.
☆ Investigating the impact of 2D gesture representation on co-speech gesture generation
Co-speech gestures play a crucial role in the interactions between humans and embodied conversational agents (ECA). Recent deep learning methods enable the generation of realistic, natural co-speech gestures synchronized with speech, but such approaches require large amounts of training data. "In-the-wild" datasets, which compile videos from sources such as YouTube through human pose detection models, offer a solution by providing 2D skeleton sequences that are paired with speech. Concurrently, innovative lifting models have emerged, capable of transforming these 2D pose sequences into their 3D counterparts, leading to large and diverse datasets of 3D gestures. However, the derived 3D pose estimation is essentially a pseudo-ground truth, with the actual ground truth being the 2D motion data. This distinction raises questions about the impact of gesture representation dimensionality on the quality of generated motions, a topic that, to our knowledge, remains largely unexplored. In this work, we evaluate the impact of the dimensionality of the training data, 2D or 3D joint coordinates, on the performance of a multimodal speech-to-gesture deep generative model. We use a lifting model to convert 2D-generated sequences of body pose to 3D. Then, we compare the sequence of gestures generated directly in 3D to the gestures generated in 2D and lifted to 3D as post-processing.
comment: 8 pages. Paper accepted at WACAI 2024
☆ How Intermodal Interaction Affects the Performance of Deep Multimodal Fusion for Mixed-Type Time Series
Mixed-type time series (MTTS) is a bimodal data type that is common in many domains, such as healthcare, finance, environmental monitoring, and social media. It consists of regularly sampled continuous time series and irregularly sampled categorical event sequences. The integration of both modalities through multimodal fusion is a promising approach for processing MTTS. However, the question of how to effectively fuse both modalities remains open. In this paper, we present a comprehensive evaluation of several deep multimodal fusion approaches for MTTS forecasting. Our comparison includes three fusion types (early, intermediate, and late) and five fusion methods (concatenation, weighted mean, weighted mean with correlation, gating, and feature sharing). We evaluate these fusion approaches on three distinct datasets, one of which was generated using a novel framework. This framework allows for the control of key data properties, such as the strength and direction of intermodal interactions, modality imbalance, and the degree of randomness in each modality, providing a more controlled environment for testing fusion approaches. Our findings show that the performance of different fusion approaches can be substantially influenced by the direction and strength of intermodal interactions. The study reveals that early and intermediate fusion approaches excel at capturing fine-grained and coarse-grained cross-modal features, respectively. These findings underscore the crucial role of intermodal interactions in determining the most effective fusion strategy for MTTS forecasting.
☆ KnobTree: Intelligent Database Parameter Configuration via Explainable Reinforcement Learning
Databases are fundamental to contemporary information systems, yet traditional rule-based configuration methods struggle to manage the complexity of real-world applications with hundreds of tunable parameters. Deep reinforcement learning (DRL), which combines perception and decision-making, presents a potential solution for intelligent database configuration tuning. However, due to black-box property of RL-based method, the generated database tuning strategies still face the urgent problem of lack explainability. Besides, the redundant parameters in large scale database always make the strategy learning become unstable. This paper proposes KnobTree, an interpertable framework designed for the optimization of database parameter configuration. In this framework, an interpertable database tuning algorithm based on RL-based differentatial tree is proposed, which building a transparent tree-based model to generate explainable database tuning strategies. To address the problem of large-scale parameters, We also introduce a explainable method for parameter importance assessment, by utilizing Shapley Values to identify parameters that have significant impacts on database performance. Experiments conducted on MySQL and Gbase8s databases have verified exceptional transparency and interpretability of the KnobTree model. The good property makes generated strategies can offer practical guidance to algorithm designers and database administrators. Moreover, our approach also slightly outperforms the existing RL-based tuning algorithms in aspects such as throughput, latency, and processing time.
☆ Tri-VQA: Triangular Reasoning Medical Visual Question Answering for Multi-Attribute Analysis
The intersection of medical Visual Question Answering (Med-VQA) is a challenging research topic with advantages including patient engagement and clinical expert involvement for second opinions. However, existing Med-VQA methods based on joint embedding fail to explain whether their provided results are based on correct reasoning or coincidental answers, which undermines the credibility of VQA answers. In this paper, we investigate the construction of a more cohesive and stable Med-VQA structure. Motivated by causal effect, we propose a novel Triangular Reasoning VQA (Tri-VQA) framework, which constructs reverse causal questions from the perspective of "Why this answer?" to elucidate the source of the answer and stimulate more reasonable forward reasoning processes. We evaluate our method on the Endoscopic Ultrasound (EUS) multi-attribute annotated dataset from five centers, and test it on medical VQA datasets. Experimental results demonstrate the superiority of our approach over existing methods. Our codes and pre-trained models are available at https://anonymous.4open.science/r/Tri_VQA.
☆ From Overfitting to Robustness: Quantity, Quality, and Variety Oriented Negative Sample Selection in Graph Contrastive Learning
Graph contrastive learning (GCL) aims to contrast positive-negative counterparts to learn the node embeddings, whereas graph data augmentation methods are employed to generate these positive-negative samples. The variation, quantity, and quality of negative samples compared to positive samples play crucial roles in learning meaningful embeddings for node classification downstream tasks. Less variation, excessive quantity, and low-quality negative samples cause the model to be overfitted for particular nodes, resulting in less robust models. To solve the overfitting problem in the GCL paradigm, this study proposes a novel Cumulative Sample Selection (CSS) algorithm by comprehensively considering negative samples' quality, variations, and quantity. Initially, three negative sample pools are constructed: easy, medium, and hard negative samples, which contain 25%, 50%, and 25% of the total available negative samples, respectively. Then, 10% negative samples are selected from each of these three negative sample pools for training the model. After that, a decision agent module evaluates model training results and decides whether to explore more negative samples from three negative sample pools by increasing the ratio or keep exploiting the current sampling ratio. The proposed algorithm is integrated into a proposed graph contrastive learning framework named NegAmplify. NegAmplify is compared with the SOTA methods on nine graph node classification datasets, with seven achieving better node classification accuracy with up to 2.86% improvement.
☆ Behaviour Distillation ICLR 2024
Dataset distillation aims to condense large datasets into a small number of synthetic examples that can be used as drop-in replacements when training new models. It has applications to interpretability, neural architecture search, privacy, and continual learning. Despite strong successes in supervised domains, such methods have not yet been extended to reinforcement learning, where the lack of a fixed dataset renders most distillation methods unusable. Filling the gap, we formalize behaviour distillation, a setting that aims to discover and then condense the information required for training an expert policy into a synthetic dataset of state-action pairs, without access to expert data. We then introduce Hallucinating Datasets with Evolution Strategies (HaDES), a method for behaviour distillation that can discover datasets of just four state-action pairs which, under supervised learning, train agents to competitive performance levels in continuous control tasks. We show that these datasets generalize out of distribution to training policies with a wide range of architectures and hyperparameters. We also demonstrate application to a downstream task, namely training multi-task agents in a zero-shot fashion. Beyond behaviour distillation, HaDES provides significant improvements in neuroevolution for RL over previous approaches and achieves SoTA results on one standard supervised dataset distillation task. Finally, we show that visualizing the synthetic datasets can provide human-interpretable task insights.
comment: Published as a conference paper at ICLR 2024
☆ Online detection and infographic explanation of spam reviews with data drift adaptation
Spam reviews are a pervasive problem on online platforms due to its significant impact on reputation. However, research into spam detection in data streams is scarce. Another concern lies in their need for transparency. Consequently, this paper addresses those problems by proposing an online solution for identifying and explaining spam reviews, incorporating data drift adaptation. It integrates (i) incremental profiling, (ii) data drift detection & adaptation, and (iii) identification of spam reviews employing Machine Learning. The explainable mechanism displays a visual and textual prediction explanation in a dashboard. The best results obtained reached up to 87 % spam F-measure.
☆ GiusBERTo: A Legal Language Model for Personal Data De-identification in Italian Court of Auditors Decisions
Recent advances in Natural Language Processing have demonstrated the effectiveness of pretrained language models like BERT for a variety of downstream tasks. We present GiusBERTo, the first BERT-based model specialized for anonymizing personal data in Italian legal documents. GiusBERTo is trained on a large dataset of Court of Auditors decisions to recognize entities to anonymize, including names, dates, locations, while retaining contextual relevance. We evaluate GiusBERTo on a held-out test set and achieve 97% token-level accuracy. GiusBERTo provides the Italian legal community with an accurate and tailored BERT model for de-identification, balancing privacy and data protection.
comment: 14 pages, 4 figures, 6 Tables
☆ Evolution of Rewards for Food and Motor Action by Simulating Birth and Death
The reward system is one of the fundamental drivers of animal behaviors and is critical for survival and reproduction. Despite its importance, the problem of how the reward system has evolved is underexplored. In this paper, we try to replicate the evolution of biologically plausible reward functions and investigate how environmental conditions affect evolved rewards' shape. For this purpose, we developed a population-based decentralized evolutionary simulation framework, where agents maintain their energy level to live longer and produce more children. Each agent inherits its reward function from its parent subject to mutation and learns to get rewards via reinforcement learning throughout its lifetime. Our results show that biologically reasonable positive rewards for food acquisition and negative rewards for motor action can evolve from randomly initialized ones. However, we also find that the rewards for motor action diverge into two modes: largely positive and slightly negative. The emergence of positive motor action rewards is surprising because it can make agents too active and inefficient in foraging. In environments with poor and poisonous foods, the evolution of rewards for less important foods tends to be unstable, while rewards for normal foods are still stable. These results demonstrate the usefulness of our simulation environment and energy-dependent birth and death model for further studies of the origin of reward systems.
☆ GraLMatch: Matching Groups of Entities with Graphs and Language Models EDBT 2025
In this paper, we present an end-to-end multi-source Entity Matching problem, which we call entity group matching, where the goal is to assign to the same group, records originating from multiple data sources but representing the same real-world entity. We focus on the effects of transitively matched records, i.e. the records connected by paths in the graph G = (V,E) whose nodes and edges represent the records and whether they are a match or not. We present a real-world instance of this problem, where the challenge is to match records of companies and financial securities originating from different data providers. We also introduce two new multi-source benchmark datasets that present similar matching challenges as real-world records. A distinctive characteristic of these records is that they are regularly updated following real-world events, but updates are not applied uniformly across data sources. This phenomenon makes the matching of certain groups of records only possible through the use of transitive information. In our experiments, we illustrate how considering transitively matched records is challenging since a limited amount of false positive pairwise match predictions can throw off the group assignment of large quantities of records. Thus, we propose GraLMatch, a method that can partially detect and remove false positive pairwise predictions through graph-based properties. Finally, we showcase how fine-tuning a Transformer-based model (DistilBERT) on a reduced number of labeled samples yields a better final entity group matching than training on more samples and/or incorporating fine-tuning optimizations, illustrating how precision becomes the deciding factor in the entity group matching of large volumes of records.
comment: 12 pages, 4 figures, accepted as research paper at EDBT 2025
☆ Fair, Manipulation-Robust, and Transparent Sortition
Sortition, the random selection of political representatives, is increasingly being used around the world to choose participants of deliberative processes like Citizens' Assemblies. Motivated by sortition's practical importance, there has been a recent flurry of research on sortition algorithms, whose task it is to select a panel from among a pool of volunteers. This panel must satisfy quotas enforcing representation of key population subgroups. Past work has contributed an algorithmic approach for fulfilling this task while ensuring that volunteers' chances of selection are maximally equal, as measured by any convex equality objective. The question, then, is: which equality objective is the right one? Past work has mainly studied the objectives Minimax and Leximin, which respectively minimize the maximum and maximize the minimum chance of selection given to any volunteer. Recent work showed that both of these objectives have key weaknesses: Minimax is highly robust to manipulation but is arbitrarily unfair; oppositely, Leximin is highly fair but arbitrarily manipulable. In light of this gap, we propose a new equality objective, Goldilocks, that aims to achieve these ideals simultaneously by ensuring that no volunteer receives too little or too much chance of selection. We theoretically bound the extent to which Goldilocks achieves these ideals, finding that in an important sense, Goldilocks recovers among the best available solutions in a given instance. We then extend our bounds to the case where the output of Goldilocks is transformed to achieve a third goal, Transparency. Our empirical analysis of Goldilocks in real data is even more promising: we find that this objective achieves nearly instance-optimal minimum and maximum selection probabilities simultaneously in most real instances -- an outcome not even guaranteed to be possible for any algorithm.
☆ RouteFinder: Towards Foundation Models for Vehicle Routing Problems
Vehicle Routing Problems (VRPs) are optimization problems with significant real-world implications in logistics, transportation, and supply chain management. Despite the recent progress made in learning to solve individual VRP variants, there is a lack of a unified approach that can effectively tackle a wide range of tasks, which is crucial for real-world impact. This paper introduces RouteFinder, a framework for developing foundation models for VRPs. Our key idea is that a foundation model for VRPs should be able to model variants by treating each variant as a subset of a larger VRP problem, equipped with different attributes. We introduce a parallelized environment that can handle any combination of attributes at the same time in a batched manner, and an efficient sampling procedure to train on a mix of problems at each optimization step that can greatly improve convergence robustness. We also introduce novel Global Feature Embeddings that project instance-wise attributes efficiently onto the latent space and help the model understand different VRP variants. Finally, we introduce Efficient Adapter Layers, a simple yet effective technique to finetune pre-trained RouteFinder models to solve novel variants with previously unseen attributes outside of the original feature space. We validate our approach through extensive experiments on 24 VRP variants, demonstrating competitive results over recent multi-task learning models. We make our code openly available at https://github.com/ai4co/routefinder.
☆ Unveiling the Impact of Multi-Modal Interactions on User Engagement: A Comprehensive Evaluation in AI-driven Conversations
Large Language Models (LLMs) have significantly advanced user-bot interactions, enabling more complex and coherent dialogues. However, the prevalent text-only modality might not fully exploit the potential for effective user engagement. This paper explores the impact of multi-modal interactions, which incorporate images and audio alongside text, on user engagement in chatbot conversations. We conduct a comprehensive analysis using a diverse set of chatbots and real-user interaction data, employing metrics such as retention rate and conversation length to evaluate user engagement. Our findings reveal a significant enhancement in user engagement with multi-modal interactions compared to text-only dialogues. Notably, the incorporation of a third modality significantly amplifies engagement beyond the benefits observed with just two modalities. These results suggest that multi-modal interactions optimize cognitive processing and facilitate richer information comprehension. This study underscores the importance of multi-modality in chatbot design, offering valuable insights for creating more engaging and immersive AI communication experiences and informing the broader AI community about the benefits of multi-modal interactions in enhancing user engagement.
☆ Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System
Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage. While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions. In this work, we introduce a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Our system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback. Secondly, we present Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations. Our methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of our system in teaching robots dexterous manipulations with enhanced adaptability and safety.
☆ Introducing the Biomechanics-Function Relationship in Glaucoma: Improved Visual Field Loss Predictions from intraocular pressure-induced Neural Tissue Strains
Objective. (1) To assess whether neural tissue structure and biomechanics could predict functional loss in glaucoma; (2) To evaluate the importance of biomechanics in making such predictions. Design, Setting and Participants. We recruited 238 glaucoma subjects. For one eye of each subject, we imaged the optic nerve head (ONH) using spectral-domain OCT under the following conditions: (1) primary gaze and (2) primary gaze with acute IOP elevation. Main Outcomes: We utilized automatic segmentation of optic nerve head (ONH) tissues and digital volume correlation (DVC) analysis to compute intraocular pressure (IOP)-induced neural tissue strains. A robust geometric deep learning approach, known as Point-Net, was employed to predict the full Humphrey 24-2 pattern standard deviation (PSD) maps from ONH structural and biomechanical information. For each point in each PSD map, we predicted whether it exhibited no defect or a PSD value of less than 5%. Predictive performance was evaluated using 5-fold cross-validation and the F1-score. We compared the model's performance with and without the inclusion of IOP-induced strains to assess the impact of biomechanics on prediction accuracy. Results: Integrating biomechanical (IOP-induced neural tissue strains) and structural (tissue morphology and neural tissues thickness) information yielded a significantly better predictive model (F1-score: 0.76+-0.02) across validation subjects, as opposed to relying only on structural information, which resulted in a significantly lower F1-score of 0.71+-0.02 (p < 0.05). Conclusion: Our study has shown that the integration of biomechanical data can significantly improve the accuracy of visual field loss predictions. This highlights the importance of the biomechanics-function relationship in glaucoma, and suggests that biomechanics may serve as a crucial indicator for the development and progression of glaucoma.
comment: 19 pages, 2 figures
☆ Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Prompting and Multiple Choices Questions (MCQ) have become the preferred approach to assess the capabilities of Large Language Models (LLMs), due to their ease of manipulation and evaluation. Such experimental appraisals have pointed toward the LLMs' apparent ability to perform causal reasoning or to grasp uncertainty. In this paper, we investigate whether these abilities are measurable outside of tailored prompting and MCQ by reformulating these issues as direct text completion - the foundation of LLMs. To achieve this goal, we define scenarios with multiple possible outcomes and we compare the prediction made by the LLM through prompting (their Stated Answer) to the probability distributions they compute over these outcomes during next token prediction (their Revealed Belief). Our findings suggest that the Revealed Belief of LLMs significantly differs from their Stated Answer and hint at multiple biases and misrepresentations that their beliefs may yield in many scenarios and outcomes. As text completion is at the core of LLMs, these results suggest that common evaluation methods may only provide a partial picture and that more research is needed to assess the extent and nature of their capabilities.
☆ Human-AI collectives produce the most accurate differential diagnoses
Artificial intelligence systems, particularly large language models (LLMs), are increasingly being employed in high-stakes decisions that impact both individuals and society at large, often without adequate safeguards to ensure safety, quality, and equity. Yet LLMs hallucinate, lack common sense, and are biased - shortcomings that may reflect LLMs' inherent limitations and thus may not be remedied by more sophisticated architectures, more data, or more human feedback. Relying solely on LLMs for complex, high-stakes decisions is therefore problematic. Here we present a hybrid collective intelligence system that mitigates these risks by leveraging the complementary strengths of human experience and the vast information processed by LLMs. We apply our method to open-ended medical diagnostics, combining 40,762 differential diagnoses made by physicians with the diagnoses of five state-of-the art LLMs across 2,133 medical cases. We show that hybrid collectives of physicians and LLMs outperform both single physicians and physician collectives, as well as single LLMs and LLM ensembles. This result holds across a range of medical specialties and professional experience, and can be attributed to humans' and LLMs' complementary contributions that lead to different kinds of errors. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.
☆ Trustworthy Enhanced Multi-view Multi-modal Alzheimer's Disease Prediction with Brain-wide Imaging Transcriptomics Data
Brain transcriptomics provides insights into the molecular mechanisms by which the brain coordinates its functions and processes. However, existing multimodal methods for predicting Alzheimer's disease (AD) primarily rely on imaging and sometimes genetic data, often neglecting the transcriptomic basis of brain. Furthermore, while striving to integrate complementary information between modalities, most studies overlook the informativeness disparities between modalities. Here, we propose TMM, a trusted multiview multimodal graph attention framework for AD diagnosis, using extensive brain-wide transcriptomics and imaging data. First, we construct view-specific brain regional co-function networks (RRIs) from transcriptomics and multimodal radiomics data to incorporate interaction information from both biomolecular and imaging perspectives. Next, we apply graph attention (GAT) processing to each RRI network to produce graph embeddings and employ cross-modal attention to fuse transcriptomics-derived embedding with each imagingderived embedding. Finally, a novel true-false-harmonized class probability (TFCP) strategy is designed to assess and adaptively adjust the prediction confidence of each modality for AD diagnosis. We evaluate TMM using the AHBA database with brain-wide transcriptomics data and the ADNI database with three imaging modalities (AV45-PET, FDG-PET, and VBM-MRI). The results demonstrate the superiority of our method in identifying AD, EMCI, and LMCI compared to state-of-the-arts. Code and data are available at https://github.com/Yaolab-fantastic/TMM.
☆ Domain Adaptation of Llama3-70B-Instruct through Continual Pre-Training and Model Merging: A Comprehensive Evaluation
We conducted extensive experiments on domain adaptation of the Meta-Llama-3-70B-Instruct model on SEC data, exploring its performance on both general and domain-specific benchmarks. Our focus included continual pre-training (CPT) and model merging, aiming to enhance the model's domain-specific capabilities while mitigating catastrophic forgetting. Through this study, we evaluated the impact of integrating financial regulatory data into a robust language model and examined the effectiveness of our model merging techniques in preserving and improving the model's instructive abilities. The model is accessible at hugging face: https://huggingface.co/arcee-ai/Llama-3-SEC-Base, arcee-ai/Llama-3-SEC-Base. This is an intermediate checkpoint of our final model, which has seen 20B tokens so far. The full model is still in the process of training. This is a preprint technical report with thorough evaluations to understand the entire process.
comment: 8 pages, 6 figures
☆ Uni-Mol2: Exploring Molecular Pretraining Model at Scale
In recent years, pretraining models have made significant advancements in the fields of natural language processing (NLP), computer vision (CV), and life sciences. The significant advancements in NLP and CV are predominantly driven by the expansion of model parameters and data size, a phenomenon now recognized as the scaling laws. However, research exploring scaling law in molecular pretraining models remains unexplored. In this work, we present Uni-Mol2 , an innovative molecular pretraining model that leverages a two-track transformer to effectively integrate features at the atomic level, graph level, and geometry structure level. Along with this, we systematically investigate the scaling law within molecular pretraining models, characterizing the power-law correlations between validation loss and model size, dataset size, and computational resources. Consequently, we successfully scale Uni-Mol2 to 1.1 billion parameters through pretraining on 800 million conformations, making it the largest molecular pretraining model to date. Extensive experiments show consistent improvement in the downstream tasks as the model size grows. The Uni-Mol2 with 1.1B parameters also outperforms existing methods, achieving an average 27% improvement on the QM9 and 14% on COMPAS-1D dataset.
☆ Deep Imbalanced Regression to Estimate Vascular Age from PPG Data: a Novel Digital Biomarker for Cardiovascular Health
Photoplethysmography (PPG) is emerging as a crucial tool for monitoring human hemodynamics, with recent studies highlighting its potential in assessing vascular aging through deep learning. However, real-world age distributions are often imbalanced, posing significant challenges for deep learning models. In this paper, we introduce a novel, simple, and effective loss function named the Dist Loss to address deep imbalanced regression tasks. We trained a one-dimensional convolutional neural network (Net1D) incorporating the Dist Loss on the extensive UK Biobank dataset (n=502,389) to estimate vascular age from PPG signals and validate its efficacy in characterizing cardiovascular health. The model's performance was validated on a 40% held-out test set, achieving state-of-the-art results, especially in regions with small sample sizes. Furthermore, we divided the population into three subgroups based on the difference between predicted vascular age and chronological age: less than -10 years, between -10 and 10 years, and greater than 10 years. We analyzed the relationship between predicted vascular age and several cardiovascular events over a follow-up period of up to 10 years, including death, coronary heart disease, and heart failure. Our results indicate that the predicted vascular age has significant potential to reflect an individual's cardiovascular health status. Our code will be available at https://github.com/Ngk03/AI-vascular-age.
☆ An Idiosyncrasy of Time-discretization in Reinforcement Learning
Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic.
comment: RLC 2024
☆ CEASEFIRE: An AI-powered system for combatting illicit firearms trafficking
Modern technologies have led illicit firearms trafficking to partially merge with cybercrime, while simultaneously permitting its off-line aspects to become more sophisticated. Law enforcement officers face difficult challenges that require hi-tech solutions. This article presents a real-world system, powered by advanced Artificial Intelligence, for facilitating them in their everyday work.
☆ Towards Retrieval Augmented Generation over Large Video Libraries
Video content creators need efficient tools to repurpose content, a task that often requires complex manual or automated searches. Crafting a new video from large video libraries remains a challenge. In this paper we introduce the task of Video Library Question Answering (VLQA) through an interoperable architecture that applies Retrieval Augmented Generation (RAG) to video libraries. We propose a system that uses large language models (LLMs) to generate search queries, retrieving relevant video moments indexed by speech and visual metadata. An answer generation module then integrates user queries with this metadata to produce responses with specific video timestamps. This approach shows promise in multimedia content retrieval, and AI-assisted video content creation.
comment: Accepted in IEEE HSI 2024
☆ Autonomous Agents for Collaborative Task under Information Asymmetry
Large Language Model Multi-Agent Systems (LLM-MAS) have achieved great progress in solving complex tasks. It performs communication among agents within the system to collaboratively solve tasks, under the premise of shared information. However, when agents' communication is leveraged to enhance human cooperation, a new challenge arises due to information asymmetry, since each agent can only access the information of its human user. Previous MAS struggle to complete tasks under this condition. To address this, we propose a new MAS paradigm termed iAgents, which denotes Informative Multi-Agent Systems. In iAgents, the human social network is mirrored in the agent network, where agents proactively exchange human information necessary for task resolution, thereby overcoming information asymmetry. iAgents employs a novel agent reasoning mechanism, InfoNav, to navigate agents' communication towards effective information exchange. Together with InfoNav, iAgents organizes human information in a mixed memory to provide agents with accurate and comprehensive information for exchange. Additionally, we introduce InformativeBench, the first benchmark tailored for evaluating LLM agents' task-solving ability under information asymmetry. Experimental results show that iAgents can collaborate within a social network of 140 individuals and 588 relationships, autonomously communicate over 30 turns, and retrieve information from nearly 70,000 messages to complete tasks within 3 minutes.
comment: 16 pages, 8 figures, 5 tables, Work in progress
☆ Extraction of 3D trajectories of mandibular condyles from 2D real-time MRI
Computing the trajectories of mandibular condyles directly from MRI could provide a comprehensive examination, allowing for the extraction of both anatomical and kinematic details. This study aimed to investigate the feasibility of extracting 3D condylar trajectories from 2D real-time MRI and to assess their precision.Twenty healthy subjects underwent real-time MRI while opening and closing their jaws. One axial and two sagittal slices were segmented using a U-Net-based algorithm. The centers of mass of the resulting masks were projected onto the coordinate system based on anatomical markers and temporally adjusted using a common projection. The quality of the computed trajectories was evaluated using metrics designed to estimate movement reproducibility, head motion, and slice placement symmetry.The segmentation of the axial slices demonstrated good-to-excellent quality; however, the segmentation of the sagittal slices required some fine-tuning. The movement reproducibility was acceptable for most cases; nevertheless, head motion displaced the trajectories by 1 mm on average. The difference in the superior-inferior coordinate of the condyles in the closed jaw position was 1.7 mm on average.Despite limitations in precision, real-time MRI enables the extraction of condylar trajectories with sufficient accuracy for evaluating clinically relevant parameters such as condyle displacement, trajectories aspect, and symmetry.
☆ LLM2FEA: Discover Novel Designs with Generative Evolutionary Multitasking
The rapid research and development of generative artificial intelligence has enabled the generation of high-quality images, text, and 3D models from text prompts. This advancement impels an inquiry into whether these models can be leveraged to create digital artifacts for both creative and engineering applications. Drawing on innovative designs from other domains may be one answer to this question, much like the historical practice of ``bionics", where humans have sought inspiration from nature's exemplary designs. This raises the intriguing possibility of using generative models to simultaneously tackle design tasks across multiple domains, facilitating cross-domain learning and resulting in a series of innovative design solutions. In this paper, we propose LLM2FEA as the first attempt to discover novel designs in generative models by transferring knowledge across multiple domains. By utilizing a multi-factorial evolutionary algorithm (MFEA) to drive a large language model, LLM2FEA integrates knowledge from various fields to generate prompts that guide the generative model in discovering novel and practical objects. Experimental results in the context of 3D aerodynamic design verify the discovery capabilities of the proposed LLM2FEA. The designs generated by LLM2FEA not only satisfy practicality requirements to a certain degree but also feature novel and aesthetically pleasing shapes, demonstrating the potential applications of LLM2FEA in discovery tasks.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Demonstrating the Efficacy of Kolmogorov-Arnold Networks in Vision Tasks
In the realm of deep learning, the Kolmogorov-Arnold Network (KAN) has emerged as a potential alternative to multilayer projections (MLPs). However, its applicability to vision tasks has not been extensively validated. In our study, we demonstrated the effectiveness of KAN for vision tasks through multiple trials on the MNIST, CIFAR10, and CIFAR100 datasets, using a training batch size of 32. Our results showed that while KAN outperformed the original MLP-Mixer on CIFAR10 and CIFAR100, it performed slightly worse than the state-of-the-art ResNet-18. These findings suggest that KAN holds significant promise for vision tasks, and further modifications could enhance its performance in future evaluations.Our contributions are threefold: first, we showcase the efficiency of KAN-based algorithms for visual tasks; second, we provide extensive empirical assessments across various vision benchmarks, comparing KAN's performance with MLP-Mixer, CNNs, and Vision Transformers (ViT); and third, we pioneer the use of natural KAN layers in visual tasks, addressing a gap in previous research. This paper lays the foundation for future studies on KANs, highlighting their potential as a reliable alternative for image classification tasks.
☆ MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression
Sparse attention can effectively mitigate the significant memory and throughput demands of Large Language Models (LLMs) in long contexts. Existing methods typically employ a uniform sparse attention mask, applying the same sparse pattern across different attention heads and input lengths. However, this uniform approach fails to capture the diverse attention patterns inherent in LLMs, ignoring their distinct accuracy-latency trade-offs. To address this challenge, we propose the Mixture of Attention (MoA), which automatically tailors distinct sparse attention configurations to different heads and layers. MoA constructs and navigates a search space of various attention patterns and their scaling rules relative to input sequence lengths. It profiles the model, evaluates potential configurations, and pinpoints the optimal sparse attention compression plan. MoA adapts to varying input sizes, revealing that some attention heads expand their focus to accommodate longer sequences, while other heads consistently concentrate on fixed-length local contexts. Experiments show that MoA increases the effective context length by $3.9\times$ with the same average attention span, boosting retrieval accuracy by $1.5-7.1\times$ over the uniform-attention baseline across Vicuna-7B, Vicuna-13B, and Llama3-8B models. Moreover, MoA narrows the capability gaps between sparse and dense models, reducing the maximum relative performance drop from $9\%-36\%$ to within $5\%$ across two long-context understanding benchmarks. MoA achieves a $1.2-1.4\times$ GPU memory reduction and boosts decode throughput by $5.5-6.7 \times$ for 7B and 13B dense models on a single GPU, with minimal impact on performance.
comment: 10 pages
☆ GIEBench: Towards Holistic Evaluation of Group Indentity-based Empathy for Large Language Models
As large language models (LLMs) continue to develop and gain widespread application, the ability of LLMs to exhibit empathy towards diverse group identities and understand their perspectives is increasingly recognized as critical. Most existing benchmarks for empathy evaluation of LLMs focus primarily on universal human emotions, such as sadness and pain, often overlooking the context of individuals' group identities. To address this gap, we introduce GIEBench, a comprehensive benchmark that includes 11 identity dimensions, covering 97 group identities with a total of 999 single-choice questions related to specific group identities. GIEBench is designed to evaluate the empathy of LLMs when presented with specific group identities such as gender, age, occupation, and race, emphasizing their ability to respond from the standpoint of the identified group. This supports the ongoing development of empathetic LLM applications tailored to users with different identities. Our evaluation of 23 LLMs revealed that while these LLMs understand different identity standpoints, they fail to consistently exhibit equal empathy across these identities without explicit instructions to adopt those perspectives. This highlights the need for improved alignment of LLMs with diverse values to better accommodate the multifaceted nature of human identities. Our datasets are available at https://github.com/GIEBench/GIEBench.
☆ Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition
Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.
☆ Training Greedy Policy for Proposal Batch Selection in Expensive Multi-Objective Combinatorial Optimization ICML 2024
Active learning is increasingly adopted for expensive multi-objective combinatorial optimization problems, but it involves a challenging subset selection problem, optimizing the batch acquisition score that quantifies the goodness of a batch for evaluation. Due to the excessively large search space of the subset selection problem, prior methods optimize the batch acquisition on the latent space, which has discrepancies with the actual space, or optimize individual acquisition scores without considering the dependencies among candidates in a batch instead of directly optimizing the batch acquisition. To manage the vast search space, a simple and effective approach is the greedy method, which decomposes the problem into smaller subproblems, yet it has difficulty in parallelization since each subproblem depends on the outcome from the previous ones. To this end, we introduce a novel greedy-style subset selection algorithm that optimizes batch acquisition directly on the combinatorial space by sequential greedy sampling from the greedy policy, specifically trained to address all greedy subproblems concurrently. Notably, our experiments on the red fluorescent proteins design task show that our proposed method achieves the baseline performance in 1.69x fewer queries, demonstrating its efficiency.
comment: ICML 2024; Codes at https://github.com/snu-mllab/GreedyPolicyForMOCO
☆ I don't trust you (anymore)! -- The effect of students' LLM use on Lecturer-Student-Trust in Higher Education
Trust plays a pivotal role in Lecturer-Student-Collaboration, encompassing teaching and research aspects. The advent of Large Language Models (LLMs) in platforms like Open AI's ChatGPT, coupled with their cost-effectiveness and high-quality results, has led to their rapid adoption among university students. However, discerning genuine student input from LLM-generated output poses a challenge for lecturers. This dilemma jeopardizes the trust relationship between lecturers and students, potentially impacting university downstream activities, particularly collaborative research initiatives. Despite attempts to establish guidelines for student LLM use, a clear framework mutually beneficial for lecturers and students in higher education remains elusive. This study addresses the research question: How does the use of LLMs by students impact Informational and Procedural Justice, influencing Team Trust and Expected Team Performance? Methodically, we applied a quantitative construct-based survey, evaluated using techniques of Structural Equation Modelling (PLS- SEM) to examine potential relationships among these constructs. Our findings based on 23 valid respondents from Ndejje University indicate that lecturers are less concerned about the fairness of LLM use per se but are more focused on the transparency of student utilization, which significantly influences Team Trust positively. This research contributes to the global discourse on integrating and regulating LLMs and subsequent models in education. We propose that guidelines should support LLM use while enforcing transparency in Lecturer-Student- Collaboration to foster Team Trust and Performance. The study contributes valuable insights for shaping policies enabling ethical and transparent LLMs usage in education to ensure effectiveness of collaborative learning environments.
comment: Working Paper
☆ DistiLRR: Transferring Code Repair for Low-Resource Programming Languages
Large language models (LLMs) have shown remarkable performance on code generation tasks. A recent application of LLMs for code generation is iterative code repair, where a model fixes an incorrect program by rationalizing about errors and generating a new program. However, code repair is primarily studied on high-resource languages like Python, and the framework's efficacy is under-explored on low-resource languages. To apply code repair for low-resource languages, we propose Distilling Low-Resource Repairs (DistiLRR), an approach that transfers the reasoning and code generation ability from a teacher model to a student model. Our results show that DistiLRR consistently outperforms baselines on low-resource languages, but has similar performance on high-resource languages. To investigate this behavior, we perform a further analysis and find that the correlation between rationale quality and code correctness is weaker than previously perceived. We hypothesize this weakness is magnified in low-resource settings where base models lack deep knowledge of a programming language, leading to wavering benefits of code repair between high-resource and low-resource languages.
☆ AI-based Anomaly Detection for Clinical-Grade Histopathological Diagnostics
While previous studies have demonstrated the potential of AI to diagnose diseases in imaging data, clinical implementation is still lagging behind. This is partly because AI models require training with large numbers of examples only available for common diseases. In clinical reality, however, only few diseases are common, whereas the majority of diseases are less frequent (long-tail distribution). Current AI models overlook or misclassify these diseases. We propose a deep anomaly detection approach that only requires training data from common diseases to detect also all less frequent diseases. We collected two large real-world datasets of gastrointestinal biopsies, which are prototypical of the problem. Herein, the ten most common findings account for approximately 90% of cases, whereas the remaining 10% contained 56 disease entities, including many cancers. 17 million histological images from 5,423 cases were used for training and evaluation. Without any specific training for the diseases, our best-performing model reliably detected a broad spectrum of infrequent ("anomalous") pathologies with 95.0% (stomach) and 91.0% (colon) AUROC and generalized across scanners and hospitals. By design, the proposed anomaly detection can be expected to detect any pathological alteration in the diagnostic tail of gastrointestinal biopsies, including rare primary or metastatic cancers. This study establishes the first effective clinical application of AI-based anomaly detection in histopathology that can flag anomalous cases, facilitate case prioritization, reduce missed diagnoses and enhance the general safety of AI models, thereby driving AI adoption and automation in routine diagnostics and beyond.
☆ From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
The rapid development of Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) has exposed vulnerabilities to various adversarial attacks. This paper provides a comprehensive overview of jailbreaking research targeting both LLMs and MLLMs, highlighting recent advancements in evaluation benchmarks, attack techniques and defense strategies. Compared to the more advanced state of unimodal jailbreaking, multimodal domain remains underexplored. We summarize the limitations and potential research directions of multimodal jailbreaking, aiming to inspire future research and further enhance the robustness and security of MLLMs.
☆ PEANO-ViT: Power-Efficient Approximations of Non-Linearities in Vision Transformers
The deployment of Vision Transformers (ViTs) on hardware platforms, specially Field-Programmable Gate Arrays (FPGAs), presents many challenges, which are mainly due to the substantial computational and power requirements of their non-linear functions, notably layer normalization, softmax, and Gaussian Error Linear Unit (GELU). These critical functions pose significant obstacles to efficient hardware implementation due to their complex mathematical operations and the inherent resource count and architectural limitations of FPGAs. PEANO-ViT offers a novel approach to streamlining the implementation of the layer normalization layer by introducing a division-free technique that simultaneously approximates the division and square root function. Additionally, PEANO-ViT provides a multi-scale division strategy to eliminate division operations in the softmax layer, aided by a Pade-based approximation for the exponential function. Finally, PEANO-ViT introduces a piece-wise linear approximation for the GELU function, carefully designed to bypass the computationally intensive operations associated with GELU. In our comprehensive evaluations, PEANO-ViT exhibits minimal accuracy degradation (<= 0.5% for DeiT-B) while significantly enhancing power efficiency, achieving improvements of 1.91x, 1.39x, 8.01x for layer normalization, softmax, and GELU, respectively. This improvement is achieved through substantial reductions in DSP, LUT, and register counts for these non-linear operations. Consequently, PEANO-ViT enables efficient deployment of Vision Transformers on resource- and power-constrained FPGA platforms.
☆ Is A Picture Worth A Thousand Words? Delving Into Spatial Reasoning for Vision Language Models
Large language models (LLMs) and vision-language models (VLMs) have demonstrated remarkable performance across a wide range of tasks and domains. Despite this promise, spatial understanding and reasoning -- a fundamental component of human cognition -- remains under-explored. We develop novel benchmarks that cover diverse aspects of spatial reasoning such as relationship understanding, navigation, and counting. We conduct a comprehensive evaluation of competitive language and vision-language models. Our findings reveal several counter-intuitive insights that have been overlooked in the literature: (1) Spatial reasoning poses significant challenges where competitive models can fall behind random guessing; (2) Despite additional visual input, VLMs often under-perform compared to their LLM counterparts; (3) When both textual and visual information is available, multi-modal language models become less reliant on visual information if sufficient textual clues are provided. Additionally, we demonstrate that leveraging redundancy between vision and text can significantly enhance model performance. We hope our study will inform the development of multimodal models to improve spatial intelligence and further close the gap with human intelligence.
☆ DN-CL: Deep Symbolic Regression against Noise via Contrastive Learning
Noise ubiquitously exists in signals due to numerous factors including physical, electronic, and environmental effects. Traditional methods of symbolic regression, such as genetic programming or deep learning models, aim to find the most fitting expressions for these signals. However, these methods often overlook the noise present in real-world data, leading to reduced fitting accuracy. To tackle this issue, we propose \textit{\textbf{D}eep Symbolic Regression against \textbf{N}oise via \textbf{C}ontrastive \textbf{L}earning (DN-CL)}. DN-CL employs two parameter-sharing encoders to embed data points from various data transformations into feature shields against noise. This model treats noisy data and clean data as different views of the ground-truth mathematical expressions. Distances between these features are minimized, utilizing contrastive learning to distinguish between 'positive' noise-corrected pairs and 'negative' contrasting pairs. Our experiments indicate that DN-CL demonstrates superior performance in handling both noisy and clean data, presenting a promising method of symbolic regression.
☆ Automated architectural space layout planning using a physics-inspired generative design framework
The determination of space layout is one of the primary activities in the schematic design stage of an architectural project. The initial layout planning defines the shape, dimension, and circulation pattern of internal spaces; which can also affect performance and cost of the construction. When carried out manually, space layout planning can be complicated, repetitive and time consuming. In this work, a generative design framework for the automatic generation of spatial architectural layout has been developed. The proposed approach integrates a novel physics-inspired parametric model for space layout planning and an evolutionary optimisation metaheuristic. Results revealed that such a generative design framework can generate a wide variety of design suggestions at the schematic design stage, applicable to complex design problems.
Self-supervised Brain Lesion Generation for Effective Data Augmentation of Medical Images
Accurate brain lesion delineation is important for planning neurosurgical treatment. Automatic brain lesion segmentation methods based on convolutional neural networks have demonstrated remarkable performance. However, neural network performance is constrained by the lack of large-scale well-annotated training datasets. In this manuscript, we propose a comprehensive framework to efficiently generate new, realistic samples for training a brain lesion segmentation model. We first train a lesion generator, based on an adversarial autoencoder, in a self-supervised manner. Next, we utilize a novel image composition algorithm, Soft Poisson Blending, to seamlessly combine synthetic lesions and brain images to obtain training samples. Finally, to effectively train the brain lesion segmentation model with augmented images we introduce a new prototype consistence regularization to align real and synthetic features. Our framework is validated by extensive experiments on two public brain lesion segmentation datasets: ATLAS v2.0 and Shift MS. Our method outperforms existing brain image data augmentation schemes. For instance, our method improves the Dice from 50.36% to 60.23% compared to the U-Net with conventional data augmentation techniques for the ATLAS v2.0 dataset.
comment: 10 pages, 6 figures, 8 tables
☆ Latent diffusion models for parameterization and data assimilation of facies-based geomodels
Geological parameterization entails the representation of a geomodel using a small set of latent variables and a mapping from these variables to grid-block properties such as porosity and permeability. Parameterization is useful for data assimilation (history matching), as it maintains geological realism while reducing the number of variables to be determined. Diffusion models are a new class of generative deep-learning procedures that have been shown to outperform previous methods, such as generative adversarial networks, for image generation tasks. Diffusion models are trained to "denoise", which enables them to generate new geological realizations from input fields characterized by random noise. Latent diffusion models, which are the specific variant considered in this study, provide dimension reduction through use of a low-dimensional latent variable. The model developed in this work includes a variational autoencoder for dimension reduction and a U-net for the denoising process. Our application involves conditional 2D three-facies (channel-levee-mud) systems. The latent diffusion model is shown to provide realizations that are visually consistent with samples from geomodeling software. Quantitative metrics involving spatial and flow-response statistics are evaluated, and general agreement between the diffusion-generated models and reference realizations is observed. Stability tests are performed to assess the smoothness of the parameterization method. The latent diffusion model is then used for ensemble-based data assimilation. Two synthetic "true" models are considered. Significant uncertainty reduction, posterior P$_{10}$-P$_{90}$ forecasts that generally bracket observed data, and consistent posterior geomodels, are achieved in both cases.
☆ Securing the Future: Proactive Threat Hunting for Sustainable IoT Ecosystems
In the rapidly evolving landscape of the IoT, the security of connected devices has become a paramount concern. This paper explores the concept of proactive threat hunting as a pivotal strategy for enhancing the security and sustainability of IoT systems. Proactive threat hunting is an alternative to traditional reactive security measures that analyses IoT networks continuously and in advance to find and eliminate threats before they occure. By improving the security posture of IoT devices this approach significantly contributes to extending IoT operational lifespan and reduces environmental impact. By integrating security metrics similar to the Common Vulnerability Scoring System (CVSS) into consumer platforms, this paper argues that proactive threat hunting can elevate user awareness about the security of IoT devices. This has the potential to impact consumer choices and encourage a security-conscious mindset in both the manufacturing and user communities. Through a comprehensive analysis, this study demonstrates how proactive threat hunting can contribute to the development of a more secure, sustainable, and user-aware IoT ecosystem.
☆ Probabilistic Emulation of a Global Climate Model with Spherical DYffusion
Data-driven deep learning models are on the verge of transforming global weather forecasting. It is an open question if this success can extend to climate modeling, where long inference rollouts and data complexity pose significant challenges. Here, we present the first conditional generative model able to produce global climate ensemble simulations that are accurate and physically consistent. Our model runs at 6-hourly time steps and is shown to be stable for 10-year-long simulations. Our approach beats relevant baselines and nearly reaches a gold standard for successful climate model emulation. We discuss the key design choices behind our dynamics-informed diffusion model-based approach which enables this significant step towards efficient, data-driven climate simulations that can help us better understand the Earth and adapt to a changing climate.
☆ Camera-Invariant Meta-Learning Network for Single-Camera-Training Person Re-identification
Single-camera-training person re-identification (SCT re-ID) aims to train a re-ID model using SCT datasets where each person appears in only one camera. The main challenge of SCT re-ID is to learn camera-invariant feature representations without cross-camera same-person (CCSP) data as supervision. Previous methods address it by assuming that the most similar person should be found in another camera. However, this assumption is not guaranteed to be correct. In this paper, we propose a Camera-Invariant Meta-Learning Network (CIMN) for SCT re-ID. CIMN assumes that the camera-invariant feature representations should be robust to camera changes. To this end, we split the training data into meta-train set and meta-test set based on camera IDs and perform a cross-camera simulation via meta-learning strategy, aiming to enforce the representations learned from the meta-train set to be robust to the meta-test set. With the cross-camera simulation, CIMN can learn camera-invariant and identity-discriminative representations even there are no CCSP data. However, this simulation also causes the separation of the meta-train set and the meta-test set, which ignores some beneficial relations between them. Thus, we introduce three losses: meta triplet loss, meta classification loss, and meta camera alignment loss, to leverage the ignored relations. The experiment results demonstrate that our method achieves comparable performance with and without CCSP data, and outperforms the state-of-the-art methods on SCT re-ID benchmarks. In addition, it is also effective in improving the domain generalization ability of the model.
☆ MU-Bench: A Multitask Multimodal Benchmark for Machine Unlearning
Recent advancements in Machine Unlearning (MU) have introduced solutions to selectively remove certain training samples, such as those with outdated or sensitive information, from trained models. Despite these advancements, evaluation of MU methods have been inconsistent, employing different trained models and architectures, and sample removal strategies, which hampers accurate comparison. In addition, prior MU approaches have mainly focused on singular tasks or modalities, which is not comprehensive. To address these limitations, we develop MU-Bench, the first comprehensive benchmark for MU that (i) unifies the sets of deleted samples and trained models, and (ii) provides broad coverage of tasks and data modalities, including previously unexplored domains such as speech and video classification. Our evaluation show that RandLabel and SalUn are the most effective general MU approaches on MU-Bench, and BadT and SCRUB are capable of achieving random performance on the deletion set. We analyze several under-investigated aspects of unlearning, including scalability, the impacts of parameter-efficient fine-tuning and curriculum learning, and susceptibility to dataset biases. MU-Bench provides an easy-to-use package that includes dataset splits, models, and implementations, together with a leader board to enable unified and scalable MU research.
♻ ☆ Large language models surpass human experts in predicting neuroscience results
Scientific discoveries often hinge on synthesizing decades of research, a task that potentially outstrips human information processing capacities. Large language models (LLMs) offer a solution. LLMs trained on the vast scientific literature could potentially integrate noisy yet interrelated findings to forecast novel results better than human experts. To evaluate this possibility, we created BrainBench, a forward-looking benchmark for predicting neuroscience results. We find that LLMs surpass experts in predicting experimental outcomes. BrainGPT, an LLM we tuned on the neuroscience literature, performed better yet. Like human experts, when LLMs were confident in their predictions, they were more likely to be correct, which presages a future where humans and LLMs team together to make discoveries. Our approach is not neuroscience-specific and is transferable to other knowledge-intensive endeavors.
♻ ☆ MTUncertainty: Assessing the Need for Post-editing of Machine Translation Outputs by Fine-tuning OpenAI LLMs
Translation Quality Evaluation (TQE) is an essential step of the modern translation production process. TQE is critical in assessing both machine translation (MT) and human translation (HT) quality without reference translations. The ability to evaluate or even simply estimate the quality of translation automatically may open significant efficiency gains through process optimisation. This work examines whether the state-of-the-art large language models (LLMs) can be used for this purpose. We take OpenAI models as the best state-of-the-art technology and approach TQE as a binary classification task. On eight language pairs including English to Italian, German, French, Japanese, Dutch, Portuguese, Turkish, and Chinese, our experimental results show that fine-tuned gpt3.5 can demonstrate good performance on translation quality prediction tasks, i.e. whether the translation needs to be edited. Another finding is that simply increasing the sizes of LLMs does not lead to apparent better performances on this task by comparing the performance of three different versions of OpenAI models: curie, davinci, and gpt3.5 with 13B, 175B, and 175B parameters, respectively.
comment: Accepted by EAMT2024: The 25th Annual Conference of The European Association for Machine Translation
♻ ☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
♻ ☆ Getting Serious about Humor: Crafting Humor Datasets with Unfunny Large Language Models
Humor is a fundamental facet of human cognition and interaction. Yet, despite recent advances in natural language processing, humor detection remains a challenging task that is complicated by the scarcity of datasets that pair humorous texts with similar non-humorous counterparts. In our work, we investigate whether large language models (LLMs), can generate synthetic data for humor detection via editing texts. We benchmark LLMs on an existing human dataset and show that current LLMs display an impressive ability to 'unfun' jokes, as judged by humans and as measured on the downstream task of humor detection. We extend our approach to a code-mixed English-Hindi humor dataset, where we find that GPT-4's synthetic data is highly rated by bilingual annotators and provides challenging adversarial examples for humor classifiers.
♻ ☆ DASB -- Discrete Audio and Speech Benchmark
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
comment: 9 pages, 5 tables
♻ ☆ Offline Diversity Maximization Under Imitation Constraints
There has been significant recent progress in the area of unsupervised skill discovery, utilizing various information-theoretic objectives as measures of diversity. Despite these advances, challenges remain: current methods require significant online interaction, fail to leverage vast amounts of available task-agnostic data and typically lack a quantitative measure of skill utility. We address these challenges by proposing a principled offline algorithm for unsupervised skill discovery that, in addition to maximizing diversity, ensures that each learned skill imitates state-only expert demonstrations to a certain degree. Our main analytical contribution is to connect Fenchel duality, reinforcement learning, and unsupervised skill discovery to maximize a mutual information objective subject to KL-divergence state occupancy constraints. Furthermore, we demonstrate the effectiveness of our method on the standard offline benchmark D4RL and on a custom offline dataset collected from a 12-DoF quadruped robot for which the policies trained in simulation transfer well to the real robotic system.
comment: RLC 2024
♻ ☆ DUAL-REFLECT: Enhancing Large Language Models for Reflective Translation through Dual Learning Feedback Mechanisms ACL 2024
Recently, large language models (LLMs) enhanced by self-reflection have achieved promising performance on machine translation. The key idea is guiding LLMs to generate translation with human-like feedback. However, existing self-reflection methods lack effective feedback information, limiting the translation performance. To address this, we introduce a DUAL-REFLECT framework, leveraging the dual learning of translation tasks to provide effective feedback, thereby enhancing the models' self-reflective abilities and improving translation performance. The application of this method across various translation tasks has proven its effectiveness in improving translation accuracy and eliminating ambiguities, especially in translation tasks with low-resource language pairs.
comment: Accepted to ACL 2024 main conference
♻ ☆ Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models
Do large language models (LLMs) know the law? These models are increasingly being used to augment legal practice, education, and research, yet their revolutionary potential is threatened by the presence of hallucinations -- textual output that is not consistent with legal facts. We present the first systematic evidence of these hallucinations, documenting LLMs' varying performance across jurisdictions, courts, time periods, and cases. Our work makes four key contributions. First, we develop a typology of legal hallucinations, providing a conceptual framework for future research in this area. Second, we find that legal hallucinations are alarmingly prevalent, occurring between 58% of the time with ChatGPT 4 and 88% with Llama 2, when these models are asked specific, verifiable questions about random federal court cases. Third, we illustrate that LLMs often fail to correct a user's incorrect legal assumptions in a contra-factual question setup. Fourth, we provide evidence that LLMs cannot always predict, or do not always know, when they are producing legal hallucinations. Taken together, our findings caution against the rapid and unsupervised integration of popular LLMs into legal tasks. Even experienced lawyers must remain wary of legal hallucinations, and the risks are highest for those who stand to benefit from LLMs the most -- pro se litigants or those without access to traditional legal resources.
♻ ☆ Reinforcement Learning with Latent State Inference for Autonomous On-ramp Merging under Observation Delay
This paper presents a novel approach to address the challenging problem of autonomous on-ramp merging, where a self-driving vehicle needs to seamlessly integrate into a flow of vehicles on a multi-lane highway. We introduce the Lane-keeping, Lane-changing with Latent-state Inference and Safety Controller (L3IS) agent, designed to perform the on-ramp merging task safely without comprehensive knowledge about surrounding vehicles' intents or driving styles. We also present an augmentation of this agent called AL3IS that accounts for observation delays, allowing the agent to make more robust decisions in real-world environments with vehicle-to-vehicle (V2V) communication delays. By modeling the unobservable aspects of the environment through latent states, such as other drivers' intents, our approach enhances the agent's ability to adapt to dynamic traffic conditions, optimize merging maneuvers, and ensure safe interactions with other vehicles. We demonstrate the effectiveness of our method through extensive simulations generated from real traffic data and compare its performance with existing approaches. L3IS shows a 99.90% success rate in a challenging on-ramp merging case generated from the real US Highway 101 data. We further perform a sensitivity analysis on AL3IS to evaluate its robustness against varying observation delays, which demonstrates an acceptable performance of 93.84% success rate in 1-second V2V communication delay.
♻ ☆ Large Language Model-Enabled Multi-Agent Manufacturing Systems
Traditional manufacturing faces challenges adapting to dynamic environments and quickly responding to manufacturing changes. The use of multi-agent systems has improved adaptability and coordination but requires further advancements in rapid human instruction comprehension, operational adaptability, and coordination through natural language integration. Large language models like GPT-3.5 and GPT-4 enhance multi-agent manufacturing systems by enabling agents to communicate in natural language and interpret human instructions for decision-making. This research introduces a novel framework where large language models enhance the capabilities of agents in manufacturing, making them more adaptable, and capable of processing context-specific instructions. A case study demonstrates the practical application of this framework, showing how agents can effectively communicate, understand tasks, and execute manufacturing processes, including precise G-code allocation among agents. The findings highlight the importance of continuous large language model integration into multi-agent manufacturing systems and the development of sophisticated agent communication protocols for a more flexible manufacturing system.
♻ ☆ A Low-Overhead Incorporation-Extrapolation based Few-Shot CSI Feedback Framework for Massive MIMO Systems
Accurate channel state information (CSI) is essential for downlink precoding in frequency division duplexing (FDD) massive multiple-input multiple-output (MIMO) systems with orthogonal frequency-division multiplexing (OFDM). However, obtaining CSI through feedback from the user equipment (UE) becomes challenging with the increasing scale of antennas and subcarriers and leads to extremely high CSI feedback overhead. Deep learning-based methods have emerged for compressing CSI but these methods generally require substantial collected samples and thus pose practical challenges. Moreover, existing deep learning methods also suffer from dramatically growing feedback overhead owing to their focus on full-dimensional CSI feedback. To address these issues, we propose a low-overhead Incorporation-Extrapolation based Few-Shot CSI feedback Framework (IEFSF) for massive MIMO systems. An incorporation-extrapolation scheme for eigenvector-based CSI feedback is proposed to reduce the feedback overhead. Then, to alleviate the necessity of extensive collected samples and enable few-shot CSI feedback, we further propose a knowledge-driven data augmentation (KDDA) method and an artificial intelligence-generated content (AIGC) -based data augmentation method by exploiting the domain knowledge of wireless channels and by exploiting a novel generative model, respectively. Experimental results based on the DeepMIMO dataset demonstrate that the proposed IEFSF significantly reduces CSI feedback overhead by 64 times compared with existing methods while maintaining higher feedback accuracy using only several hundred collected samples.
comment: 16 pages, 12 figures, 5 tables. Accepted by IEEE Transactions on Wireless Communications
♻ ☆ A Survey on Intelligent Internet of Things: Applications, Security, Privacy, and Future Directions
The rapid advances in the Internet of Things (IoT) have promoted a revolution in communication technology and offered various customer services. Artificial intelligence (AI) techniques have been exploited to facilitate IoT operations and maximize their potential in modern application scenarios. In particular, the convergence of IoT and AI has led to a new networking paradigm called Intelligent IoT (IIoT), which has the potential to significantly transform businesses and industrial domains. This paper presents a comprehensive survey of IIoT by investigating its significant applications in mobile networks, as well as its associated security and privacy issues. Specifically, we explore and discuss the roles of IIoT in a wide range of key application domains, from smart healthcare and smart cities to smart transportation and smart industries. Through such extensive discussions, we investigate important security issues in IIoT networks, where network attacks, confidentiality, integrity, and intrusion are analyzed, along with a discussion of potential countermeasures. Privacy issues in IIoT networks were also surveyed and discussed, including data, location, and model privacy leakage. Finally, we outline several key challenges and highlight potential research directions in this important area.
comment: This work has been accepted by IEEE Communications Surveys & Tutorials
♻ ☆ Uncertainty-Aware Probabilistic Graph Neural Networks for Road-Level Traffic Accident Prediction
Traffic accidents present substantial challenges to human safety and socioeconomic development in urban areas. Developing a reliable and responsible traffic accident prediction model is crucial to addressing growing public safety concerns and enhancing the safety of urban mobility systems. Traditional methods face limitations at fine spatiotemporal scales due to the sporadic nature of highrisk accidents and the predominance of nonaccident characteristics. Furthermore, while most current models show promising occurrence prediction, they overlook the uncertainties arising from the inherent nature of accidents, and then fail to adequately map the hierarchical ranking of accident risk values for more precise insights. To address these issues, we introduce the Spatiotemporal ZeroInflated Tweedie Graph Neural Network ,STZITDGNN, the first uncertainty-aware probabilistic graph deep learning model in roadlevel traffic accident prediction for multi-steps. This model integrates the interpretability of the statistical Tweedie family model and the expressive power of graph neural networks. Its decoder innovatively employs a compound Tweedie model, a Poisson distribution to model the frequency of accident occurrences and a Gamma distribution to assess injury severity, supplemented by a zeroinflated component to effectively identify exessive non-incident instances. Empirical tests using realworld traffic data from London, UK, demonstrate that the STZITDGNN surpasses other baseline models across multiple benchmarks and metrics, including accident risk value prediction, uncertainty minimisation, nonaccident road identification and accident occurrence accuracy. Our study demonstrates that STZTIDGNN can effectively inform targeted road monitoring, thereby improving urban road safety strategies.
♻ ☆ Chain-of-Thought Unfaithfulness as Disguised Accuracy
Understanding the extent to which Chain-of-Thought (CoT) generations align with a large language model's (LLM) internal computations is critical for deciding whether to trust an LLM's output. As a proxy for CoT faithfulness, Lanham et al. (2023) propose a metric that measures a model's dependence on its CoT for producing an answer. Within a single family of proprietary models, they find that LLMs exhibit a scaling-then-inverse-scaling relationship between model size and their measure of faithfulness, and that a 13 billion parameter model exhibits increased faithfulness compared to models ranging from 810 million to 175 billion parameters in size. We evaluate whether these results generalize as a property of all LLMs. We replicate the experimental setup in their section focused on scaling experiments with three different families of models and, under specific conditions, successfully reproduce the scaling trends for CoT faithfulness they report. However, after normalizing the metric to account for a model's bias toward certain answer choices, unfaithfulness drops significantly for smaller less-capable models. This normalized faithfulness metric is also strongly correlated ($R^2$=0.74) with accuracy, raising doubts about its validity for evaluating faithfulness.
comment: TMLR accepted paper camera-ready version. First two authors contributed equally. 8 pages main, 13 pages appendix
♻ ☆ Two Complementary Perspectives to Continual Learning: Ask Not Only What to Optimize, But Also How
Recent years have seen considerable progress in the continual training of deep neural networks, predominantly thanks to approaches that add replay or regularization terms to the loss function to approximate the joint loss over all tasks so far. However, we show that even with a perfect approximation to the joint loss, these approaches still suffer from temporary but substantial forgetting when starting to train on a new task. Motivated by this 'stability gap', we propose that continual learning strategies should focus not only on the optimization objective, but also on the way this objective is optimized. While there is some continual learning work that alters the optimization trajectory (e.g., using gradient projection techniques), this line of research is positioned as alternative to improving the optimization objective, while we argue it should be complementary. In search of empirical support for our proposition, we perform a series of pre-registered experiments combining replay-approximated joint objectives with gradient projection-based optimization routines. However, this first experimental attempt fails to show clear and consistent benefits. Nevertheless, our conceptual arguments, as well as some of our empirical results, demonstrate the distinctive importance of the optimization trajectory in continual learning, thereby opening up a new direction for continual learning research.
comment: Full paper version of pre-registered report accepted at the 1st ContinualAI Unconference. The originally submitted pre-registered proposal can be found at arXiv:2311.04898v1
♻ ☆ Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
comment: arXiv admin note: text overlap with arXiv:2402.02042; text overlap with arXiv:2202.00150 by other authors
♻ ☆ Information Guided Regularization for Fine-tuning Language Models
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
♻ ☆ CORM: Cache Optimization with Recent Message for Large Language Model Inference
Large Language Models (LLMs), despite their remarkable performance across a wide range of tasks, necessitate substantial GPU memory and consume significant computational resources. Beyond the memory taken up by model weights, the memory used by the KV cache rises linearly with sequence length, becoming a primary bottleneck for inference. In this paper, we introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint. Upon thorough investigation, we discover that in most Transformer models, (i) there is a striking similarity between adjacent tokens' query vectors, and (ii) the attention calculation of the current query can rely exclusively on the attention information of a small fraction of preceding queries. Based on these observations, we present CORM, a KV cache eviction policy that dynamically retains essential key-value pairs for inference without the need for model fine-tuning. Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70\% with negligible performance degradation across six tasks in LongBench. Furthermore, we demonstrate that CORM is compatible with GQA for further compression rate.
♻ ☆ Straight-Through meets Sparse Recovery: the Support Exploration Algorithm
The {\it straight-through estimator} (STE) is commonly used to optimize quantized neural networks, yet its contexts of effective performance are still unclear despite empirical successes.To make a step forward in this comprehension, we apply STE to a well-understood problem: {\it sparse support recovery}. We introduce the {\it Support Exploration Algorithm} (SEA), a novel algorithm promoting sparsity, and we analyze its performance in support recovery (a.k.a. model selection) problems. SEA explores more supports than the state-of-the-art, leading to superior performance in experiments, especially when the columns of $A$ are strongly coherent.The theoretical analysis considers recovery guarantees when the linear measurements matrix $A$ satisfies the {\it Restricted Isometry Property} (RIP).The sufficient conditions of recovery are comparable but more stringent than those of the state-of-the-art in sparse support recovery. Their significance lies mainly in their applicability to an instance of the STE.
♻ ☆ HW-GPT-Bench: Hardware-Aware Architecture Benchmark for Language Models
The increasing size of language models necessitates a thorough analysis across multiple dimensions to assess trade-offs among crucial hardware metrics such as latency, energy consumption, GPU memory usage, and performance. Identifying optimal model configurations under specific hardware constraints is becoming essential but remains challenging due to the computational load of exhaustive training and evaluation on multiple devices. To address this, we introduce HW-GPT-Bench, a hardware-aware benchmark that utilizes surrogate predictions to approximate various hardware metrics across 13 devices of architectures in the GPT-2 family, with architectures containing up to 774M parameters. Our surrogates, via calibrated predictions and reliable uncertainty estimates, faithfully model the heteroscedastic noise inherent in the energy and latency measurements. To estimate perplexity, we employ weight-sharing techniques from Neural Architecture Search (NAS), inheriting pretrained weights from the largest GPT-2 model. Finally, we demonstrate the utility of HW-GPT-Bench by simulating optimization trajectories of various multi-objective optimization algorithms in just a few seconds.
comment: 48 pages, 69 figures, 10 tables
♻ ☆ Generative AI Misuse: A Taxonomy of Tactics and Insights from Real-World Data
Generative, multimodal artificial intelligence (GenAI) offers transformative potential across industries, but its misuse poses significant risks. Prior research has shed light on the potential of advanced AI systems to be exploited for malicious purposes. However, we still lack a concrete understanding of how GenAI models are specifically exploited or abused in practice, including the tactics employed to inflict harm. In this paper, we present a taxonomy of GenAI misuse tactics, informed by existing academic literature and a qualitative analysis of approximately 200 observed incidents of misuse reported between January 2023 and March 2024. Through this analysis, we illuminate key and novel patterns in misuse during this time period, including potential motivations, strategies, and how attackers leverage and abuse system capabilities across modalities (e.g. image, text, audio, video) in the wild.
♻ ☆ RL4CO: an Extensive Reinforcement Learning for Combinatorial Optimization Benchmark NeurIPS 2023
Deep reinforcement learning (RL) has recently shown significant benefits in solving combinatorial optimization (CO) problems, reducing reliance on domain expertise, and improving computational efficiency. However, the field lacks a unified benchmark for easy development and standardized comparison of algorithms across diverse CO problems. To fill this gap, we introduce RL4CO, a unified and extensive benchmark with in-depth library coverage of 23 state-of-the-art methods and more than 20 CO problems. Built on efficient software libraries and best practices in implementation, RL4CO features modularized implementation and flexible configuration of diverse RL algorithms, neural network architectures, inference techniques, and environments. RL4CO allows researchers to seamlessly navigate existing successes and develop their unique designs, facilitating the entire research process by decoupling science from heavy engineering. We also provide extensive benchmark studies to inspire new insights and future work. RL4CO has attracted numerous researchers in the community and is open-sourced at https://github.com/ai4co/rl4co.
comment: A previous version was presented as a workshop paper at the NeurIPS 2023 GLFrontiers Workshop (Oral)
♻ ☆ Efficient Perception, Planning, and Control Algorithm for Vision-Based Automated Vehicles
Autonomous vehicles have limited computational resources and thus require efficient control systems. The cost and size of sensors have limited the development of self-driving cars. To overcome these restrictions, this study proposes an efficient framework for the operation of vision-based automatic vehicles; the framework requires only a monocular camera and a few inexpensive radars. The proposed algorithm comprises a multi-task UNet (MTUNet) network for extracting image features and constrained iterative linear quadratic regulator (CILQR) and vision predictive control (VPC) modules for rapid motion planning and control. MTUNet is designed to simultaneously solve lane line segmentation, the ego vehicle's heading angle regression, road type classification, and traffic object detection tasks at approximately 40 FPS for 228 x 228 pixel RGB input images. The CILQR controllers then use the MTUNet outputs and radar data as inputs to produce driving commands for lateral and longitudinal vehicle guidance within only 1 ms. In particular, the VPC algorithm is included to reduce steering command latency to below actuator latency, preventing performance degradation during tight turns. The VPC algorithm uses road curvature data from MTUNet to estimate the appropriate correction for the current steering angle at a look-ahead point to adjust the turning amount. The inclusion of the VPC algorithm in a VPC-CILQR controller leads to higher performance on curvy roads than the use of CILQR alone. Our experiments demonstrate that the proposed autonomous driving system, which does not require high-definition maps, can be applied in current autonomous vehicles.
comment: 14 figures, 14 pages
♻ ☆ AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI.
comment: Code available https://github.com/alanaai/EVUD
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Jellyfish: A Large Language Model for Data Preprocessing
This paper explores the utilization of LLMs for data preprocessing (DP), a crucial step in the data mining pipeline that transforms raw data into a clean format conducive to easy processing. Whereas the use of LLMs has sparked interest in devising universal solutions to DP, recent initiatives in this domain typically rely on GPT APIs, raising inevitable data breach concerns. Unlike these approaches, we consider instruction-tuning local LLMs (7 -- 13B models) as universal DP task solvers that operate on a local, single, and low-priced GPU, ensuring data security and enabling further customization. We select a collection of datasets across four representative DP tasks and construct instruction tuning data using data configuration, knowledge injection, and reasoning data distillation techniques tailored to DP. By tuning Mistral-7B, Llama 3-8B, and OpenOrca-Platypus2-13B, our models, namely, Jellyfish-7B/8B/13B, deliver competitiveness compared to GPT-3.5/4 models and strong generalizability to unseen tasks while barely compromising the base models' abilities in NLP tasks. Meanwhile, Jellyfish offers enhanced reasoning capabilities compared to GPT-3.5. Our models are available at: https://huggingface.co/NECOUDBFM/Jellyfish . Our instruction dataset is available at: https://huggingface.co/datasets/NECOUDBFM/Jellyfish-Instruct .
comment: a.k.a. "Jellyfish: Instruction-Tuning Local Large Language Models for Data Preprocessing''
♻ ☆ Worse than Random? An Embarrassingly Simple Probing Evaluation of Large Multimodal Models in Medical VQA
Large Multimodal Models (LMMs) have shown remarkable progress in medical Visual Question Answering (Med-VQA), achieving high accuracy on existing benchmarks. However, their reliability under robust evaluation is questionable. This study reveals that when subjected to simple probing evaluation, state-of-the-art models perform worse than random guessing on medical diagnosis questions. To address this critical evaluation problem, we introduce the Probing Evaluation for Medical Diagnosis (ProbMed) dataset to rigorously assess LMM performance in medical imaging through probing evaluation and procedural diagnosis. Particularly, probing evaluation features pairing original questions with negation questions with hallucinated attributes, while procedural diagnosis requires reasoning across various diagnostic dimensions for each image, including modality recognition, organ identification, clinical findings, abnormalities, and positional grounding. Our evaluation reveals that top-performing models like GPT-4o, GPT-4V, and Gemini Pro perform worse than random guessing on specialized diagnostic questions, indicating significant limitations in handling fine-grained medical inquiries. Besides, models like LLaVA-Med struggle even with more general questions, and results from CheXagent demonstrate the transferability of expertise across different modalities of the same organ, showing that specialized domain knowledge is still crucial for improving performance. This study underscores the urgent need for more robust evaluation to ensure the reliability of LMMs in critical fields like medical diagnosis, and current LMMs are still far from applicable to those fields.
♻ ☆ Multi-Objective Quality-Diversity for Crystal Structure Prediction GECCO 2024
Crystal structures are indispensable across various domains, from batteries to solar cells, and extensive research has been dedicated to predicting their properties based on their atomic configurations. However, prevailing Crystal Structure Prediction methods focus on identifying the most stable solutions that lie at the global minimum of the energy function. This approach overlooks other potentially interesting materials that lie in neighbouring local minima and have different material properties such as conductivity or resistance to deformation. By contrast, Quality-Diversity algorithms provide a promising avenue for Crystal Structure Prediction as they aim to find a collection of high-performing solutions that have diverse characteristics. However, it may also be valuable to optimise for the stability of crystal structures alongside other objectives such as magnetism or thermoelectric efficiency. Therefore, in this work, we harness the power of Multi-Objective Quality-Diversity algorithms in order to find crystal structures which have diverse features and achieve different trade-offs of objectives. We analyse our approach on 5 crystal systems and demonstrate that it is not only able to re-discover known real-life structures, but also find promising new ones. Moreover, we propose a method for illuminating the objective space to gain an understanding of what trade-offs can be achieved.
comment: Accepted GECCO 2024
♻ ☆ MRHER: Model-based Relay Hindsight Experience Replay for Sequential Object Manipulation Tasks with Sparse Rewards
Sparse rewards pose a significant challenge to achieving high sample efficiency in goal-conditioned reinforcement learning (RL). Specifically, in sequential manipulation tasks, the agent receives failure rewards until it successfully completes the entire manipulation task, which leads to low sample efficiency. To tackle this issue and improve sample efficiency, we propose a novel model-based RL framework called Model-based Relay Hindsight Experience Replay (MRHER). MRHER breaks down a continuous task into subtasks with increasing complexity and utilizes the previous subtask to guide the learning of the subsequent one. Instead of using Hindsight Experience Replay (HER) in every subtask, we design a new robust model-based relabeling method called Foresight relabeling (FR). FR predicts the future trajectory of the hindsight state and relabels the expected goal as a goal achieved on the virtual future trajectory. By incorporating FR, MRHER effectively captures more information from historical experiences, leading to improved sample efficiency, particularly in object-manipulation environments. Experimental results demonstrate that MRHER exhibits state-of-the-art sample efficiency in benchmark tasks, outperforming RHER by 13.79% and 14.29% in the FetchPush-v1 environment and FetchPickandPlace-v1 environment, respectively.
♻ ☆ Efficient Data Generation for Source-grounded Information-seeking Dialogs: A Use Case for Meeting Transcripts
Automating data generation with Large Language Models (LLMs) has become increasingly popular. In this work, we investigate the feasibility and effectiveness of LLM-based data generation in the challenging setting of source-grounded information-seeking dialogs, with response attribution, over long documents. Our source texts consist of long and noisy meeting transcripts, adding to the task complexity. Since automating attribution remains difficult, we propose a semi-automatic approach: dialog queries and responses are generated with LLMs, followed by human verification and identification of attribution spans. Using this approach, we created MISeD -- Meeting Information Seeking Dialogs dataset -- a dataset of information-seeking dialogs focused on meeting transcripts. Models finetuned with MISeD demonstrate superior performance compared to off-the-shelf models, even those of larger size. Finetuning on MISeD gives comparable response generation quality to finetuning on fully manual data, while improving attribution quality and reducing time and effort.
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ On the Worst Prompt Performance of Large Language Models
The performance of large language models (LLMs) is acutely sensitive to the phrasing of prompts, which raises significant concerns about their reliability in real-world scenarios. Existing studies often divide prompts into task-level instructions and case-level inputs and primarily focus on evaluating and improving robustness against variations in tasks-level instructions. However, this setup fails to fully address the diversity of real-world user queries and assumes the existence of task-specific datasets. To address these limitations, we introduce RobustAlpacaEval, a new benchmark that consists of semantically equivalent case-level queries and emphasizes the importance of using the worst prompt performance to gauge the lower bound of model performance. Extensive experiments on RobustAlpacaEval with ChatGPT and six open-source LLMs from the Llama, Mistral, and Gemma families uncover substantial variability in model performance; for instance, a difference of 45.48% between the worst and best performance for the Llama-2-70B-chat model, with its worst performance dipping as low as 9.38%. We further illustrate the difficulty in identifying the worst prompt from both model-agnostic and model-dependent perspectives, emphasizing the absence of a shortcut to characterize the worst prompt. We also attempt to enhance the worst prompt performance using existing prompt engineering and prompt consistency methods, but find that their impact is limited. These findings underscore the need to create more resilient LLMs that can maintain high performance across diverse prompts. Data and code are available at https://github.com/cbwbuaa/On-the-Worst-Prompt- Performance-of-LLMs.
♻ ☆ Provable Privacy with Non-Private Pre-Processing
When analysing Differentially Private (DP) machine learning pipelines, the potential privacy cost of data-dependent pre-processing is frequently overlooked in privacy accounting. In this work, we propose a general framework to evaluate the additional privacy cost incurred by non-private data-dependent pre-processing algorithms. Our framework establishes upper bounds on the overall privacy guarantees by utilising two new technical notions: a variant of DP termed Smooth DP and the bounded sensitivity of the pre-processing algorithms. In addition to the generic framework, we provide explicit overall privacy guarantees for multiple data-dependent pre-processing algorithms, such as data imputation, quantization, deduplication and PCA, when used in combination with several DP algorithms. Notably, this framework is also simple to implement, allowing direct integration into existing DP pipelines.
♻ ☆ Active Few-Shot Fine-Tuning
We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
♻ ☆ Contextual Knowledge Graph
Knowledge Graphs (KGs) are foundational structures in many AI applications, representing entities and their interrelations through triples. However, triple-based KGs lack the contextual information of relational knowledge, like temporal dynamics and provenance details, which are crucial for comprehensive knowledge representation and effective reasoning. Instead, \textbf{Contextual Knowledge Graphs} (CKGs) expand upon the conventional structure by incorporating additional information such as time validity, geographic location, and source provenance. This integration provides a more nuanced and accurate understanding of knowledge, enabling KGs to offer richer insights and support more sophisticated reasoning processes. In this work, we first discuss the inherent limitations of triple-based KGs and introduce the concept of contextual KGs, highlighting their advantages in knowledge representation and reasoning. We then present \textbf{KGR$^3$, a context-enriched KG reasoning paradigm} that leverages large language models (LLMs) to retrieve candidate entities and related contexts, rank them based on the retrieved information, and reason whether sufficient information has been obtained to answer a query. Our experimental results demonstrate that KGR$^3$ significantly improves performance on KG completion (KGC) and KG question answering (KGQA) tasks, validating the effectiveness of incorporating contextual information on KG representation and reasoning.
♻ ☆ Eight challenges in developing theory of intelligence
A good theory of mathematical beauty is more practical than any current observation, as new predictions of physical reality can be verified self-consistently. This belief applies to the current status of understanding deep neural networks including large language models and even the biological intelligence. Toy models provide a metaphor of physical reality, allowing mathematically formulating that reality (i.e., the so-called theory), which can be updated as more conjectures are justified or refuted. One does not need to pack all details into a model, but rather, more abstract models are constructed, as complex systems like brains or deep networks have many sloppy dimensions but much less stiff dimensions that strongly impact macroscopic observables. This kind of bottom-up mechanistic modeling is still promising in the modern era of understanding the natural or artificial intelligence. Here, we shed light on eight challenges in developing theory of intelligence following this theoretical paradigm. Theses challenges are representation learning, generalization, adversarial robustness, continual learning, causal learning, internal model of the brain, next-token prediction, and finally the mechanics of subjective experience.
comment: 24 pages, 131 references, revised version to journal
♻ ☆ Enhancing Actuarial Non-Life Pricing Models via Transformers
Currently, there is a lot of research in the field of neural networks for non-life insurance pricing. The usual goal is to improve the predictive power via neural networks while building upon the generalized linear model, which is the current industry standard. Our paper contributes to this current journey via novel methods to enhance actuarial non-life models with transformer models for tabular data. We build here upon the foundation laid out by the combined actuarial neural network as well as the localGLMnet and enhance those models via the feature tokenizer transformer. The manuscript demonstrates the performance of the proposed methods on a real-world claim frequency dataset and compares them with several benchmark models such as generalized linear models, feed-forward neural networks, combined actuarial neural networks, LocalGLMnet, and pure feature tokenizer transformer. The paper shows that the new methods can achieve better results than the benchmark models while preserving certain generalized linear model advantages. The paper also discusses the practical implications and challenges of applying transformer models in actuarial settings.
comment: This preprint manuscript was uploaded to arXiv on November 10, 2023. It has not undergone peer review or any post-submission improvements or corrections. The peer reviewed Version of Record of this article is published in the European Actuarial Journal, and is available online at https://doi.org/10.1007/s13385-024-00388-2
♻ ☆ From Words to Worlds: Transforming One-line Prompt into Immersive Multi-modal Digital Stories with Communicative LLM Agent
Digital storytelling, essential in entertainment, education, and marketing, faces challenges in production scalability and flexibility. The StoryAgent framework, introduced in this paper, utilizes Large Language Models and generative tools to automate and refine digital storytelling. Employing a top-down story drafting and bottom-up asset generation approach, StoryAgent tackles key issues such as manual intervention, interactive scene orchestration, and narrative consistency. This framework enables efficient production of interactive and consistent narratives across multiple modalities, democratizing content creation and enhancing engagement. Our results demonstrate the framework's capability to produce coherent digital stories without reference videos, marking a significant advancement in automated digital storytelling.
comment: 16 pages, 13 figures
♻ ☆ SRL: Scaling Distributed Reinforcement Learning to Over Ten Thousand Cores ICLR 2024
The ever-growing complexity of reinforcement learning (RL) tasks demands a distributed system to efficiently generate and process a massive amount of data. However, existing open-source libraries suffer from various limitations, which impede their practical use in challenging scenarios where large-scale training is necessary. In this paper, we present a novel abstraction on the dataflows of RL training, which unifies diverse RL training applications into a general framework. Following this abstraction, we develop a scalable, efficient, and extensible distributed RL system called ReaLlyScalableRL, which allows efficient and massively parallelized training and easy development of customized algorithms. Our evaluation shows that SRL outperforms existing academic libraries, reaching at most 21x higher training throughput in a distributed setting. On learning performance, beyond performing and scaling well on common RL benchmarks with different RL algorithms, SRL can reproduce the same solution in the challenging hide-and-seek environment as reported by OpenAI with up to 5x speedup in wall-clock time. Notably, SRL is the first in the academic community to perform RL experiments at a large scale with over 15k CPU cores. SRL source code is available at: https://github.com/openpsi-project/srl .
comment: Published at ICLR 2024. 10 pages (24 pages with references and appendix), 7 figures
♻ ☆ TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT). However, careful evaluations by human reveal that the translations produced by LLMs still contain multiple errors. Importantly, feeding back such error information into the LLMs can lead to self-refinement and result in improved translation performance. Motivated by these insights, we introduce a systematic LLM-based self-refinement translation framework, named \textbf{TEaR}, which stands for \textbf{T}ranslate, \textbf{E}stimate, \textbf{a}nd \textbf{R}efine, marking a significant step forward in this direction. Our findings demonstrate that 1) our self-refinement framework successfully assists LLMs in improving their translation quality across a wide range of languages, whether it's from high-resource languages to low-resource ones or whether it's English-centric or centered around other languages; 2) TEaR exhibits superior systematicity and interpretability; 3) different estimation strategies yield varied impacts, directly affecting the effectiveness of the final corrections. Additionally, traditional neural translation models and evaluation models operate separately, often focusing on singular tasks due to their limited capabilities, while general-purpose LLMs possess the capability to undertake both tasks simultaneously. We further conduct cross-model correction experiments to investigate the potential relationship between the translation and evaluation capabilities of general-purpose LLMs. Our code and data are available at https://github.com/fzp0424/self_correct_mt
comment: Our code and data are available at https://github.com/fzp0424/self_correct_mt
♻ ☆ The Impact of AI on Perceived Job Decency and Meaningfulness: A Case Study
The proliferation of Artificial Intelligence (AI) in workplaces stands to change the way humans work, with job satisfaction intrinsically linked to work life. Existing research on human-AI collaboration tends to prioritize performance over the experiential aspects of work. In contrast, this paper explores the impact of AI on job decency and meaningfulness in workplaces. Through interviews in the Information Technology (IT) domain, we not only examined the current work environment, but also explored the perceived evolution of the workplace ecosystem with the introduction of an AI. Findings from the preliminary exploratory study reveal that respondents tend to visualize a workplace where humans continue to play a dominant role, even with the introduction of advanced AIs. In this prospective scenario, AI is seen as serving as a complement rather than replacing the human workforce. Furthermore, respondents believe that the introduction of AI will maintain or potentially increase overall job satisfaction.
♻ ☆ Twin Transformer using Gated Dynamic Learnable Attention mechanism for Fault Detection and Diagnosis in the Tennessee Eastman Process
Fault detection and diagnosis (FDD) is a crucial task for ensuring the safety and efficiency of industrial processes. We propose a novel FDD methodology for the Tennessee Eastman Process (TEP), a widely used benchmark for chemical process control. The model employs two separate Transformer branches, enabling independent processing of input data and potential extraction of diverse information. A novel attention mechanism, Gated Dynamic Learnable Attention (GDLAttention), is introduced which integrates a gating mechanism and dynamic learning capabilities. The gating mechanism modulates the attention weights, allowing the model to focus on the most relevant parts of the input. The dynamic learning approach adapts the attention strategy during training, potentially leading to improved performance. The attention mechanism uses a bilinear similarity function, providing greater flexibility in capturing complex relationships between query and key vectors. In order to assess the effectiveness of our approach, we tested it against 21 and 18 distinct fault scenarios in TEP, and compared its performance with several established FDD techniques. The outcomes indicate that the method outperforms others in terms of accuracy, false alarm rate, and misclassification rate. This underscores the robustness and efficacy of the approach for FDD in intricate industrial processes.
♻ ☆ What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
This paper presents a way of enhancing the reliability of Large Multi-modal Models (LMMs) in addressing hallucination, where the models generate cross-modal inconsistent responses. Without additional training, we propose Counterfactual Inception, a novel method that implants counterfactual thinking into LMMs using self-generated counterfactual keywords. Our method is grounded in the concept of counterfactual thinking, a cognitive process where human considers alternative realities, enabling more extensive context exploration. Bridging the human cognition mechanism into LMMs, we aim for the models to engage with and generate responses that span a wider contextual scene understanding, mitigating hallucinatory outputs. We further introduce Plausibility Verification Process (PVP), a simple yet robust keyword constraint that effectively filters out sub-optimal keywords to enable the consistent triggering of counterfactual thinking in the model responses. Comprehensive analyses across various LMMs, including both open-source and proprietary models, corroborate that counterfactual thinking significantly reduces hallucination and helps to broaden contextual understanding based on true visual clues.
comment: Project page: https://ivy-lvlm.github.io/Counterfactual-Inception/
♻ ☆ Empowering Multi-step Reasoning across Languages via Tree-of-Thoughts NAACL 2024
Reasoning methods, best exemplified by the well-known Chain-of-Thought (CoT), empower the reasoning abilities of Large Language Models (LLMs) by eliciting them to solve complex tasks in a step-by-step manner. Although they are achieving significant success, the ability to deliver multi-step reasoning remains limited to English because of the imbalance in the distribution of pre-training data, which makes other languages a barrier. In this paper, we propose Cross-lingual Tree-of-Thoughts (Cross-ToT), a method for aligning Cross-lingual CoT reasoning across languages. The proposed method, through a self-consistent cross-lingual prompting mechanism inspired by the Tree-of-Thoughts approach, provides multi-step reasoning paths in different languages that, during the steps, lead to the final solution. Experimental evaluations show that our method significantly outperforms existing prompting methods by reducing the number of interactions and achieving state-of-the-art performance.
comment: Findings of the Association for Computational Linguistics: NAACL 2024
♻ ☆ Automatic Counting and Classification of Mosquito Eggs in Field Traps
The analysis of the field traps where the mosquitoes insert their eggs is vital to check that the sterile insect technique (SIT) is working properly. This is because the number of hatched eggs may indicate that the sterile males are not competing with the wild ones. Nowadays, the study of the traps is done manually by microscope and is very time-consuming and prone to human error. This paper presents an automatic trap survey. For this purpose, a device has been designed that automatically scans the slat obtaining different overlapping photos. Subsequently, the images are analyzed by a Mask-RCNN neural network that segments the eggs and classifies them into 2 classes: full or hatch
♻ ☆ Deep learning empowered sensor fusion to improve infant movement classification
There is a recent boom in the development of AI solutions to facilitate and enhance diagnostic procedures for established clinical tools. To assess the integrity of the developing nervous system, the Prechtl general movement assessment (GMA) is recognized for its clinical value in diagnosing neurological impairments in early infancy. GMA has been increasingly augmented through machine learning approaches intending to scale-up its application, circumvent costs in the training of human assessors and further standardize classification of spontaneous motor patterns. Available deep learning tools, all of which are based on single sensor modalities, are however still considerably inferior to that of well-trained human assessors. These approaches are hardly comparable as all models are designed, trained and evaluated on proprietary/silo-data sets. With this study we propose a sensor fusion approach for assessing fidgety movements (FMs) comparing three different sensor modalities (pressure, inertial, and visual sensors). Various combinations and two sensor fusion approaches (late and early fusion) for infant movement classification were tested to evaluate whether a multi-sensor system outperforms single modality assessments. The performance of the three-sensor fusion (classification accuracy of 94.5\%) was significantly higher than that of any single modality evaluated, suggesting the sensor fusion approach is a promising avenue for automated classification of infant motor patterns. The development of a robust sensor fusion system may significantly enhance AI-based early recognition of neurofunctions, ultimately facilitating automated early detection of neurodevelopmental conditions.
♻ ☆ REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent's learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's performance in test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help in the optimization of the
comment: arXiv admin note: text overlap with arXiv:2307.01452 by other authors
♻ ☆ DiffTOP: Differentiable Trajectory Optimization for Deep Reinforcement and Imitation Learning
This paper introduces DiffTOP, which utilizes Differentiable Trajectory OPtimization as the policy representation to generate actions for deep reinforcement and imitation learning. Trajectory optimization is a powerful and widely used algorithm in control, parameterized by a cost and a dynamics function. The key to our approach is to leverage the recent progress in differentiable trajectory optimization, which enables computing the gradients of the loss with respect to the parameters of trajectory optimization. As a result, the cost and dynamics functions of trajectory optimization can be learned end-to-end. DiffTOP addresses the ``objective mismatch'' issue of prior model-based RL algorithms, as the dynamics model in DiffTOP is learned to directly maximize task performance by differentiating the policy gradient loss through the trajectory optimization process. We further benchmark DiffTOP for imitation learning on standard robotic manipulation task suites with high-dimensional sensory observations and compare our method to feed-forward policy classes as well as Energy-Based Models (EBM) and Diffusion. Across 15 model-based RL tasks and 35imitation learning tasks with high-dimensional image and point cloud inputs, DiffTOP outperforms prior state-of-the-art methods in both domains.
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
♻ ☆ Concept Prerequisite Relation Prediction by Using Permutation-Equivariant Directed Graph Neural Networks
This paper studies the problem of CPRP, concept prerequisite relation prediction, which is a fundamental task in using AI for education. CPRP is usually formulated into a link-prediction task on a relationship graph of concepts and solved by training the graph neural network (GNN) model. However, current directed GNNs fail to manage graph isomorphism which refers to the invariance of non-isomorphic graphs, reducing the expressivity of resulting representations. We present a permutation-equivariant directed GNN model by introducing the Weisfeiler-Lehman test into directed GNN learning. Our method is then used for CPRP and evaluated on three public datasets. The experimental results show that our model delivers better prediction performance than the state-of-the-art methods.
comment: 9 pages, 1figure, 1 Table (A perprint version)
♻ ☆ SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.
♻ ☆ Optimization Techniques for Unsupervised Complex Table Reasoning via Self-Training Framework
Structured tabular data is a fundamental data type in numerous fields, and the capacity to reason over tables is crucial for answering questions and validating hypotheses. However, constructing labeled data for complex reasoning tasks is labor intensive, and the quantity of annotated data remains insufficient to support the intricate demands of real-world applications. To address the insufficient annotation challenge, we present a self-training framework for unsupervised complex tabular reasoning (UCTR-ST) by generating diverse synthetic data with complex logic. Specifically, UCTR-ST incorporates several essential techniques: we aggregate diverse programs and execute them on tables based on a "Program-Management" component, and we bridge the gap between programs and text with a powerful "Program-Transformation" module that generates natural language sentences with complex logic. Furthermore, we optimize the procedure using a "Table-Text Manipulator" to handle joint table-text reasoning scenarios. The entire framework utilizes self-training techniques to leverage the unlabeled training data, which results in significant performance improvements when tested on real-world data. Experimental results demonstrate that UCTRST achieves above 90% of the supervised model performance on different tasks and domains, reducing the dependence on manual annotation. Additionally, our approach can serve as a data augmentation technique, significantly boosting the performance of supervised models in low-resourced domains.
comment: Submitted to TKDE, preprint version
♻ ☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
♻ ☆ QxEAI: Quantum-like evolutionary algorithm for automated probabilistic forecasting
Forecasting, to estimate future events, is crucial for business and decision-making. This paper proposes QxEAI, a methodology that produces a probabilistic forecast that utilizes a quantum-like evolutionary algorithm based on training a quantum-like logic decision tree and a classical value tree on a small number of related time series. We demonstrate how the application of our quantum-like evolutionary algorithm to forecasting can overcome the challenges faced by classical and other machine learning approaches. By using three real-world datasets (Dow Jones Index, retail sales, gas consumption), we show how our methodology produces accurate forecasts while requiring little to none manual work.
♻ ☆ Hierarchical Path-planning from Speech Instructions with Spatial Concept-based Topometric Semantic Mapping
Assisting individuals in their daily activities through autonomous mobile robots, especially for users without specialized knowledge, is crucial. Specifically, the capability of robots to navigate to destinations based on human speech instructions is essential. While robots can take different paths to the same goal, the shortest path is not always the best. A preferred approach is to accommodate waypoint specifications flexibly, planning an improved alternative path, even with detours. Additionally, robots require real-time inference capabilities. This study aimed to realize a hierarchical spatial representation using a topometric semantic map and path planning with speech instructions, including waypoints. This paper presents Spatial Concept-based Topometric Semantic Mapping for Hierarchical Path Planning (SpCoTMHP), integrating place connectivity. This approach offers a novel integrated probabilistic generative model and fast approximate inference across hierarchy levels. A formulation based on control as probabilistic inference theoretically supports the proposed path planning algorithm. We conducted experiments in home environments using the Toyota Human Support Robot on the SIGVerse simulator and in a lab-office environment with the real robot, Albert. Users issued speech commands specifying the waypoint and goal, such as "Go to the bedroom via the corridor." Navigation experiments using speech instructions with a waypoint demonstrated a performance improvement of SpCoTMHP over the baseline hierarchical path planning method with heuristic path costs (HPP-I), in terms of the weighted success rate at which the robot reaches the closest target and passes the correct waypoints, by 0.590. The computation time was significantly accelerated by 7.14 seconds with SpCoTMHP compared to baseline HPP-I in advanced tasks.
comment: Accepted to Frontiers in Robotics and AI
♻ ☆ SPL: A Socratic Playground for Learning Powered by Large Language Model
Dialogue-based Intelligent Tutoring Systems (ITSs) have significantly advanced adaptive and personalized learning by automating sophisticated human tutoring strategies within interactive dialogues. However, replicating the nuanced patterns of expert human communication remains a challenge in Natural Language Processing (NLP). Recent advancements in NLP, particularly Large Language Models (LLMs) such as OpenAI's GPT-4, offer promising solutions by providing human-like and context-aware responses based on extensive pre-trained knowledge. Motivated by the effectiveness of LLMs in various educational tasks (e.g., content creation and summarization, problem-solving, and automated feedback provision), our study introduces the Socratic Playground for Learning (SPL), a dialogue-based ITS powered by the GPT-4 model, which employs the Socratic teaching method to foster critical thinking among learners. Through extensive prompt engineering, SPL can generate specific learning scenarios and facilitates efficient multi-turn tutoring dialogues. The SPL system aims to enhance personalized and adaptive learning experiences tailored to individual needs, specifically focusing on improving critical thinking skills. Our pilot experimental results from essay writing tasks demonstrate SPL has the potential to improve tutoring interactions and further enhance dialogue-based ITS functionalities. Our study, exemplified by SPL, demonstrates how LLMs enhance dialogue-based ITSs and expand the accessibility and efficacy of educational technologies.
♻ ☆ Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
♻ ☆ Accelerating Approximate Thompson Sampling with Underdamped Langevin Monte Carlo
Approximate Thompson sampling with Langevin Monte Carlo broadens its reach from Gaussian posterior sampling to encompass more general smooth posteriors. However, it still encounters scalability issues in high-dimensional problems when demanding high accuracy. To address this, we propose an approximate Thompson sampling strategy, utilizing underdamped Langevin Monte Carlo, where the latter is the go-to workhorse for simulations of high-dimensional posteriors. Based on the standard smoothness and log-concavity conditions, we study the accelerated posterior concentration and sampling using a specific potential function. This design improves the sample complexity for realizing logarithmic regrets from $\mathcal{\tilde O}(d)$ to $\mathcal{\tilde O}(\sqrt{d})$. The scalability and robustness of our algorithm are also empirically validated through synthetic experiments in high-dimensional bandit problems.
comment: 52 pages, 2 figures
♻ ☆ SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K -- the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (an additional 7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively.
comment: v2: fix Table 1
♻ ☆ MeGA: Merging Multiple Independently Trained Neural Networks Based on Genetic Algorithm
In this paper, we introduce a novel method for merging the weights of multiple pre-trained neural networks using a genetic algorithm called MeGA. Traditional techniques, such as weight averaging and ensemble methods, often fail to fully harness the capabilities of pre-trained networks. Our approach leverages a genetic algorithm with tournament selection, crossover, and mutation to optimize weight combinations, creating a more effective fusion. This technique allows the merged model to inherit advantageous features from both parent models, resulting in enhanced accuracy and robustness. Through experiments on the CIFAR-10 dataset, we demonstrate that our genetic algorithm-based weight merging method improves test accuracy compared to individual models and conventional methods. This approach provides a scalable solution for integrating multiple pre-trained networks across various deep learning applications. Github is available at: https://github.com/YUNBLAK/MeGA-Merging-Multiple-Independently-Trained-Neural-Networks-Based-on-Genetic-Algorithm
♻ ☆ FedSecurity: Benchmarking Attacks and Defenses in Federated Learning and Federated LLMs
This paper introduces FedSecurity, an end-to-end benchmark that serves as a supplementary component of the FedML library for simulating adversarial attacks and corresponding defense mechanisms in Federated Learning (FL). FedSecurity eliminates the need for implementing the fundamental FL procedures, e.g., FL training and data loading, from scratch, thus enables users to focus on developing their own attack and defense strategies. It contains two key components, including FedAttacker that conducts a variety of attacks during FL training, and FedDefender that implements defensive mechanisms to counteract these attacks. FedSecurity has the following features: i) It offers extensive customization options to accommodate a broad range of machine learning models (e.g., Logistic Regression, ResNet, and GAN) and FL optimizers (e.g., FedAVG, FedOPT, and FedNOVA); ii) it enables exploring the effectiveness of attacks and defenses across different datasets and models; and iii) it supports flexible configuration and customization through a configuration file and some APIs. We further demonstrate FedSecurity's utility and adaptability through federated training of Large Language Models (LLMs) to showcase its potential on a wide range of complex applications.
Information Retrieval 19
☆ STARD: A Chinese Statute Retrieval Dataset with Real Queries Issued by Non-professionals
Statute retrieval aims to find relevant statutory articles for specific queries. This process is the basis of a wide range of legal applications such as legal advice, automated judicial decisions, legal document drafting, etc. Existing statute retrieval benchmarks focus on formal and professional queries from sources like bar exams and legal case documents, thereby neglecting non-professional queries from the general public, which often lack precise legal terminology and references. To address this gap, we introduce the STAtute Retrieval Dataset (STARD), a Chinese dataset comprising 1,543 query cases collected from real-world legal consultations and 55,348 candidate statutory articles. Unlike existing statute retrieval datasets, which primarily focus on professional legal queries, STARD captures the complexity and diversity of real queries from the general public. Through a comprehensive evaluation of various retrieval baselines, we reveal that existing retrieval approaches all fall short of these real queries issued by non-professional users. The best method only achieves a Recall@100 of 0.907, suggesting the necessity for further exploration and additional research in this area. All the codes and datasets are available at: https://github.com/oneal2000/STARD/tree/main
☆ Towards Fine-Grained Citation Evaluation in Generated Text: A Comparative Analysis of Faithfulness Metrics
Large language models (LLMs) often produce unsupported or unverifiable information, known as "hallucinations." To mitigate this, retrieval-augmented LLMs incorporate citations, grounding the content in verifiable sources. Despite such developments, manually assessing how well a citation supports the associated statement remains a major challenge. Previous studies use faithfulness metrics to estimate citation support automatically but are limited to binary classification, overlooking fine-grained citation support in practical scenarios. To investigate the effectiveness of faithfulness metrics in fine-grained scenarios, we propose a comparative evaluation framework that assesses the metric effectiveness in distinguishinging citations between three-category support levels: full, partial, and no support. Our framework employs correlation analysis, classification evaluation, and retrieval evaluation to measure the alignment between metric scores and human judgments comprehensively. Our results show no single metric consistently excels across all evaluations, revealing the complexity of assessing fine-grained support. Based on the findings, we provide practical recommendations for developing more effective metrics.
comment: 12 pages, 3 figures
☆ Retrieval Augmented Zero-Shot Text Classification SIGIR
Zero-shot text learning enables text classifiers to handle unseen classes efficiently, alleviating the need for task-specific training data. A simple approach often relies on comparing embeddings of query (text) to those of potential classes. However, the embeddings of a simple query sometimes lack rich contextual information, which hinders the classification performance. Traditionally, this has been addressed by improving the embedding model with expensive training. We introduce QZero, a novel training-free knowledge augmentation approach that reformulates queries by retrieving supporting categories from Wikipedia to improve zero-shot text classification performance. Our experiments across six diverse datasets demonstrate that QZero enhances performance for state-of-the-art static and contextual embedding models without the need for retraining. Notably, in News and medical topic classification tasks, QZero improves the performance of even the largest OpenAI embedding model by at least 5% and 3%, respectively. Acting as a knowledge amplifier, QZero enables small word embedding models to achieve performance levels comparable to those of larger contextual models, offering the potential for significant computational savings. Additionally, QZero offers meaningful insights that illuminate query context and verify topic relevance, aiding in understanding model predictions. Overall, QZero improves embedding-based zero-shot classifiers while maintaining their simplicity. This makes it particularly valuable for resource-constrained environments and domains with constantly evolving information.
comment: Proceedings of the 2024 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR '24), July 13, 2024, Washington DC, DC, USA
☆ UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis
The use of Retrieval-Augmented Generation (RAG) has improved Large Language Models (LLMs) in collaborating with external data, yet significant challenges exist in real-world scenarios. In areas such as academic literature and finance question answering, data are often found in raw text and tables in HTML or PDF formats, which can be lengthy and highly unstructured. In this paper, we introduce a benchmark suite, namely Unstructured Document Analysis (UDA), that involves 2,965 real-world documents and 29,590 expert-annotated Q&A pairs. We revisit popular LLM- and RAG-based solutions for document analysis and evaluate the design choices and answer qualities across multiple document domains and diverse query types. Our evaluation yields interesting findings and highlights the importance of data parsing and retrieval. We hope our benchmark can shed light and better serve real-world document analysis applications. The benchmark suite and code can be found at https://github.com/qinchuanhui/UDA-Benchmark.
☆ Évaluation des capacités de réponse de larges modèles de langage (LLM) pour des questions d'historiens
Large Language Models (LLMs) like ChatGPT or Bard have revolutionized information retrieval and captivated the audience with their ability to generate custom responses in record time, regardless of the topic. In this article, we assess the capabilities of various LLMs in producing reliable, comprehensive, and sufficiently relevant responses about historical facts in French. To achieve this, we constructed a testbed comprising numerous history-related questions of varying types, themes, and levels of difficulty. Our evaluation of responses from ten selected LLMs reveals numerous shortcomings in both substance and form. Beyond an overall insufficient accuracy rate, we highlight uneven treatment of the French language, as well as issues related to verbosity and inconsistency in the responses provided by LLMs.
comment: in French language
☆ Hierarchical thematic classification of major conference proceedings
In this paper, we develop a decision support system for the hierarchical text classification. We consider text collections with a fixed hierarchical structure of topics given by experts in the form of a tree. The system sorts the topics by relevance to a given document. The experts choose one of the most relevant topics to finish the classification. We propose a weighted hierarchical similarity function to calculate topic relevance. The function calculates the similarity of a document and a tree branch. The weights in this function determine word importance. We use the entropy of words to estimate the weights. The proposed hierarchical similarity function formulates a joint hierarchical thematic classification probability model of the document topics, parameters, and hyperparameters. The variational Bayesian inference gives a closed-form EM algorithm. The EM algorithm estimates the parameters and calculates the probability of a topic for a given document. Compared to hierarchical multiclass SVM, hierarchical PLSA with adaptive regularization, and hierarchical naive Bayes, the weighted hierarchical similarity function has better improvement in ranking accuracy in an abstract collection of a major conference EURO and a website collection of industrial companies.
☆ A Tale of Trust and Accuracy: Base vs. Instruct LLMs in RAG Systems
Retrieval Augmented Generation (RAG) represents a significant advancement in artificial intelligence combining a retrieval phase with a generative phase, with the latter typically being powered by large language models (LLMs). The current common practices in RAG involve using "instructed" LLMs, which are fine-tuned with supervised training to enhance their ability to follow instructions and are aligned with human preferences using state-of-the-art techniques. Contrary to popular belief, our study demonstrates that base models outperform their instructed counterparts in RAG tasks by 20% on average under our experimental settings. This finding challenges the prevailing assumptions about the superiority of instructed LLMs in RAG applications. Further investigations reveal a more nuanced situation, questioning fundamental aspects of RAG and suggesting the need for broader discussions on the topic; or, as Fromm would have it, "Seldom is a glance at the statistics enough to understand the meaning of the figures".
☆ IDentity with Locality: An ideal hash for gene sequence search
Gene sequence search is a fundamental operation in computational genomics. Due to the petabyte scale of genome archives, most gene search systems now use hashing-based data structures such as Bloom Filters (BF). The state-of-the-art systems such as Compact bit-slicing signature index (COBS) and Repeated And Merged Bloom filters (RAMBO) use BF with Random Hash (RH) functions for gene representation and identification. The standard recipe is to cast the gene search problem as a sequence of membership problems testing if each subsequent gene substring (called kmer) of Q is present in the set of kmers of the entire gene database D. We observe that RH functions, which are crucial to the memory and the computational advantage of BF, are also detrimental to the system performance of gene-search systems. While subsequent kmers being queried are likely very similar, RH, oblivious to any similarity, uniformly distributes the kmers to different parts of potentially large BF, thus triggering excessive cache misses and causing system slowdown. We propose a novel hash function called the Identity with Locality (IDL) hash family, which co-locates the keys close in input space without causing collisions. This approach ensures both cache locality and key preservation. IDL functions can be a drop-in replacement for RH functions and help improve the performance of information retrieval systems. We give a simple but practical construction of IDL function families and show that replacing the RH with IDL functions reduces cache misses by a factor of 5x, thus improving query and indexing times of SOTA methods such as COBS and RAMBO by factors up to 2x without compromising their quality. We also provide a theoretical analysis of the false positive rate of BF with IDL functions. Our hash function is the first study that bridges Locality Sensitive Hash (LSH) and RH to obtain cache efficiency.
comment: 13 pages
☆ Decoding Matters: Addressing Amplification Bias and Homogeneity Issue for LLM-based Recommendation
Adapting Large Language Models (LLMs) for recommendation requires careful consideration of the decoding process, given the inherent differences between generating items and natural language. Existing approaches often directly apply LLMs' original decoding methods. However, we find these methods encounter significant challenges: 1) amplification bias -- where standard length normalization inflates scores for items containing tokens with generation probabilities close to 1 (termed ghost tokens), and 2) homogeneity issue -- generating multiple similar or repetitive items for a user. To tackle these challenges, we introduce a new decoding approach named Debiasing-Diversifying Decoding (D3). D3 disables length normalization for ghost tokens to alleviate amplification bias, and it incorporates a text-free assistant model to encourage tokens less frequently generated by LLMs for counteracting recommendation homogeneity. Extensive experiments on real-world datasets demonstrate the method's effectiveness in enhancing accuracy and diversity.
☆ Talking the Talk Does Not Entail Walking the Walk: On the Limits of Large Language Models in Lexical Entailment Recognition
Verbs form the backbone of language, providing the structure and meaning to sentences. Yet, their intricate semantic nuances pose a longstanding challenge. Understanding verb relations through the concept of lexical entailment is crucial for comprehending sentence meanings and grasping verb dynamics. This work investigates the capabilities of eight Large Language Models in recognizing lexical entailment relations among verbs through differently devised prompting strategies and zero-/few-shot settings over verb pairs from two lexical databases, namely WordNet and HyperLex. Our findings unveil that the models can tackle the lexical entailment recognition task with moderately good performance, although at varying degree of effectiveness and under different conditions. Also, utilizing few-shot prompting can enhance the models' performance. However, perfectly solving the task arises as an unmet challenge for all examined LLMs, which raises an emergence for further research developments on this topic.
☆ Generate-then-Ground in Retrieval-Augmented Generation for Multi-hop Question Answering ACL 2024
Multi-Hop Question Answering (MHQA) tasks present a significant challenge for large language models (LLMs) due to the intensive knowledge required. Current solutions, like Retrieval-Augmented Generation, typically retrieve potential documents from an external corpus to read an answer. However, the performance of this retrieve-then-read paradigm is constrained by the retriever and the inevitable noise in the retrieved documents. To mitigate these challenges, we introduce a novel generate-then-ground (GenGround) framework, synergizing the parametric knowledge of LLMs and external documents to solve a multi-hop question. GenGround empowers LLMs to alternate two phases until the final answer is derived: (1) formulate a simpler, single-hop question and directly generate the answer; (2) ground the question-answer pair in retrieved documents, amending any wrong predictions in the answer. We also propose an instructional grounding distillation method to generalize our method into smaller models. Extensive experiments conducted on four datasets illustrate the superiority of our method.
comment: ACL 2024 (main conference)
☆ Leveraging Passage Embeddings for Efficient Listwise Reranking with Large Language Models
Recent studies have demonstrated the effectiveness of using large language language models (LLMs) in passage ranking. The listwise approaches, such as RankGPT, have become new state-of-the-art in this task. However, the efficiency of RankGPT models is limited by the maximum context length and relatively high latency of LLM inference. To address these issues, in this paper, we propose PE-Rank, leveraging the single passage embedding as a good context compression for efficient listwise passage reranking. By treating each passage as a special token, we can directly input passage embeddings into LLMs, thereby reducing input length. Additionally, we introduce an inference method that dynamically constrains the decoding space to these special tokens, accelerating the decoding process. For adapting the model to reranking, we employ listwise learning to rank loss for training. Evaluation results on multiple benchmarks demonstrate that PE-Rank significantly improves efficiency in both prefilling and decoding, while maintaining competitive ranking effectiveness. {The Code is available at \url{https://github.com/liuqi6777/pe_rank}.}
☆ FIRST: Faster Improved Listwise Reranking with Single Token Decoding
Large Language Models (LLMs) have significantly advanced the field of information retrieval, particularly for reranking. Listwise LLM rerankers have showcased superior performance and generalizability compared to existing supervised approaches. However, conventional listwise LLM reranking methods lack efficiency as they provide ranking output in the form of a generated ordered sequence of candidate passage identifiers. Further, they are trained with the typical language modeling objective, which treats all ranking errors uniformly--potentially at the cost of misranking highly relevant passages. Addressing these limitations, we introduce FIRST, a novel listwise LLM reranking approach leveraging the output logits of the first generated identifier to directly obtain a ranked ordering of the candidates. Further, we incorporate a learning-to-rank loss during training, prioritizing ranking accuracy for the more relevant passages. Empirical results demonstrate that FIRST accelerates inference by 50% while maintaining a robust ranking performance with gains across the BEIR benchmark. Finally, to illustrate the practical effectiveness of listwise LLM rerankers, we investigate their application in providing relevance feedback for retrievers during inference. Our results show that LLM rerankers can provide a stronger distillation signal compared to cross-encoders, yielding substantial improvements in retriever recall after relevance feedback.
comment: Preprint
☆ Food Pairing Unveiled: Exploring Recipe Creation Dynamics through Recommender Systems
In the early 2000s, renowned chef Heston Blumenthal formulated his "food pairing" hypothesis, positing that if foods share many flavor compounds, then they tend to taste good when eaten together. In 2011, Ahn et al. conducted a study using a dataset of recipes, ingredients, and flavor compounds, finding that, in Western cuisine, ingredients in recipes often share more flavor compounds than expected by chance, indicating a natural tendency towards food pairing. Building upon Ahn's research, our work applies state-of-the-art collaborative filtering techniques to the dataset, providing a tool that can recommend new foods to add in recipes, retrieve missing ingredients and advise against certain combinations. We create our recommender in two ways, by taking into account ingredients appearances in recipes or shared flavor compounds between foods. While our analysis confirms the existence of food pairing, the recipe-based recommender performs significantly better than the flavor-based one, leading to the conclusion that food pairing is just one of the principles to take into account when creating recipes. Furthermore, and more interestingly, we find that food pairing in data is mostly due to trivial couplings of very similar ingredients, leading to a reconsideration of its current role in recipes, from being an already existing feature to a key to open up new scenarios in gastronomy. Our flavor-based recommender can thus leverage this novel concept and provide a new tool to lead culinary innovation.
♻ ☆ Incentivizing High-Quality Content in Online Recommender Systems
In content recommender systems such as TikTok and YouTube, the platform's recommendation algorithm shapes content producer incentives. Many platforms employ online learning, which generates intertemporal incentives, since content produced today affects recommendations of future content. We study the game between producers and analyze the content created at equilibrium. We show that standard online learning algorithms, such as Hedge and EXP3, unfortunately incentivize producers to create low-quality content, where producers' effort approaches zero in the long run for typical learning rate schedules. Motivated by this negative result, we design learning algorithms that incentivize producers to invest high effort and achieve high user welfare. At a conceptual level, our work illustrates the unintended impact that a platform's learning algorithm can have on content quality and introduces algorithmic approaches to mitigating these effects.
comment: Updated version with revised and expanded content
♻ ☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Unifying Graph Convolution and Contrastive Learning in Collaborative Filtering KDD 2024
Graph-based models and contrastive learning have emerged as prominent methods in Collaborative Filtering (CF). While many existing models in CF incorporate these methods in their design, there seems to be a limited depth of analysis regarding the foundational principles behind them. This paper bridges graph convolution, a pivotal element of graph-based models, with contrastive learning through a theoretical framework. By examining the learning dynamics and equilibrium of the contrastive loss, we offer a fresh lens to understand contrastive learning via graph theory, emphasizing its capability to capture high-order connectivity. Building on this analysis, we further show that the graph convolutional layers often used in graph-based models are not essential for high-order connectivity modeling and might contribute to the risk of oversmoothing. Stemming from our findings, we introduce Simple Contrastive Collaborative Filtering (SCCF), a simple and effective algorithm based on a naive embedding model and a modified contrastive loss. The efficacy of the algorithm is demonstrated through extensive experiments across four public datasets. The experiment code is available at \url{https://github.com/wu1hong/SCCF}. \end{abstract}
comment: KDD 2024
♻ ☆ Passage-specific Prompt Tuning for Passage Reranking in Question Answering with Large Language Models SIGIR24
Effective passage retrieval and reranking methods have been widely utilized to identify suitable candidates in open-domain question answering tasks, recent studies have resorted to LLMs for reranking the retrieved passages by the log-likelihood of the question conditioned on each passage. Although these methods have demonstrated promising results, the performance is notably sensitive to the human-written prompt (or hard prompt), and fine-tuning LLMs can be computationally intensive and time-consuming. Furthermore, this approach limits the leverage of question-passage relevance pairs and passage-specific knowledge to enhance the ranking capabilities of LLMs. In this paper, we propose passage-specific prompt tuning for reranking in open-domain question answering (PSPT): a parameter-efficient method that fine-tunes learnable passage-specific soft prompts, incorporating passage-specific knowledge from a limited set of question-passage relevance pairs. The method involves ranking retrieved passages based on the log-likelihood of the model generating the question conditioned on each passage and the learned soft prompt. We conducted extensive experiments utilizing the Llama-2-chat-7B model across three publicly available open-domain question answering datasets and the results demonstrate the effectiveness of the proposed approach.
comment: Accepted at Gen-IR@SIGIR24
♻ ☆ Searching COVID-19 Clinical Research Using Graph Queries: Algorithm Development and Validation
Objective: This study aims to consider small graphs of concepts and exploit them for expressing graph searches over existing COVID-19-related literature, leveraging the increasing use of graphs to represent and query scientific knowledge and providing a user-friendly search and exploration experience. Methods: We considered the COVID-19 Open Research Dataset corpus and summarized its content by annotating the publications' abstracts using terms selected from the UMLS and the Ontology of Coronavirus Infectious Disease. Then, we built a co-occurrence network that includes all relevant concepts mentioned in the corpus, establishing connections when their mutual information is relevant. A sophisticated graph query engine was built to allow the identification of the best matches of graph queries on the network. It also supports partial matches and suggests potential query completions using shortest paths. Results: We built a large co-occurrence network, consisting of 128,249 entities and 47,198,965 relationships; the GRAPH-SEARCH interface allows users to explore the network by formulating or adapting graph queries; it produces a bibliography of publications, which are globally ranked; and each publication is further associated with the specific parts of the query that it explains, thereby allowing the user to understand each aspect of the matching. Conclusions: Our approach supports the process of query formulation and evidence search upon a large text corpus; it can be reapplied to any scientific domain where documents corpora and curated ontologies are made available.
comment: 18 pages, 11 figures
Multimedia 6
☆ Image Conductor: Precision Control for Interactive Video Synthesis
Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions and object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI for video creation, achieving precise control over motion for interactive video asset generation remains challenging. To this end, we propose Image Conductor, a method for precise control of camera transitions and object movements to generate video assets from a single image. An well-cultivated training strategy is proposed to separate distinct camera and object motion by camera LoRA weights and object LoRA weights. To further address cinematographic variations from ill-posed trajectories, we introduce a camera-free guidance technique during inference, enhancing object movements while eliminating camera transitions. Additionally, we develop a trajectory-oriented video motion data curation pipeline for training. Quantitative and qualitative experiments demonstrate our method's precision and fine-grained control in generating motion-controllable videos from images, advancing the practical application of interactive video synthesis. Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
comment: Project webpage available at https://liyaowei-stu.github.io/project/ImageConductor/
☆ EmpathyEar: An Open-source Avatar Multimodal Empathetic Chatbot ACL 2024
This paper introduces EmpathyEar, a pioneering open-source, avatar-based multimodal empathetic chatbot, to fill the gap in traditional text-only empathetic response generation (ERG) systems. Leveraging the advancements of a large language model, combined with multimodal encoders and generators, EmpathyEar supports user inputs in any combination of text, sound, and vision, and produces multimodal empathetic responses, offering users, not just textual responses but also digital avatars with talking faces and synchronized speeches. A series of emotion-aware instruction-tuning is performed for comprehensive emotional understanding and generation capabilities. In this way, EmpathyEar provides users with responses that achieve a deeper emotional resonance, closely emulating human-like empathy. The system paves the way for the next emotional intelligence, for which we open-source the code for public access.
comment: ACL 2024 Demonstration Paper
♻ ☆ LLIC: Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression
The effective receptive field (ERF) plays an important role in transform coding, which determines how much redundancy can be removed during transform and how many spatial priors can be utilized to synthesize textures during inverse transform. Existing methods rely on stacks of small kernels, whose ERFs remain insufficiently large, or heavy non-local attention mechanisms, which limit the potential of high-resolution image coding. To tackle this issue, we propose Large Receptive Field Transform Coding with Adaptive Weights for Learned Image Compression (LLIC). Specifically, for the first time in the learned image compression community, we introduce a few large kernelbased depth-wise convolutions to reduce more redundancy while maintaining modest complexity. Due to the wide range of image diversity, we further propose a mechanism to augment convolution adaptability through the self-conditioned generation of weights. The large kernels cooperate with non-linear embedding and gate mechanisms for better expressiveness and lighter pointwise interactions. Our investigation extends to refined training methods that unlock the full potential of these large kernels. Moreover, to promote more dynamic inter-channel interactions, we introduce an adaptive channel-wise bit allocation strategy that autonomously generates channel importance factors in a self-conditioned manner. To demonstrate the effectiveness of the proposed transform coding, we align the entropy model to compare with existing transform methods and obtain models LLIC-STF, LLIC-ELIC, and LLIC-TCM. Extensive experiments demonstrate that our proposed LLIC models have significant improvements over the corresponding baselines and reduce the BD-Rate by 9.49%, 9.47%, 10.94% on Kodak over VTM-17.0 Intra, respectively. Our LLIC models achieve state-of-the-art performances and better trade-offs between performance and complexity.
comment: Accepted to IEEE Transactions on Multimedia 2024
♻ ☆ SHMamba: Structured Hyperbolic State Space Model for Audio-Visual Question Answering
The Audio-Visual Question Answering (AVQA) task holds significant potential for applications. Compared to traditional unimodal approaches, the multi-modal input of AVQA makes feature extraction and fusion processes more challenging. Euclidean space is difficult to effectively represent multi-dimensional relationships of data. Especially when extracting and processing data with a tree structure or hierarchical structure, Euclidean space is not suitable as an embedding space. Additionally, the self-attention mechanism in Transformers is effective in capturing the dynamic relationships between elements in a sequence. However, the self-attention mechanism's limitations in window modeling and quadratic computational complexity reduce its effectiveness in modeling long sequences. To address these limitations, we propose SHMamba: Structured Hyperbolic State Space Model to integrate the advantages of hyperbolic geometry and state space models. Specifically, SHMamba leverages the intrinsic properties of hyperbolic space to represent hierarchical structures and complex relationships in audio-visual data. Meanwhile, the state space model captures dynamic changes over time by globally modeling the entire sequence. Furthermore, we introduce an adaptive curvature hyperbolic alignment module and a cross fusion block to enhance the understanding of hierarchical structures and the dynamic exchange of cross-modal information, respectively. Extensive experiments demonstrate that SHMamba outperforms previous methods with fewer parameters and computational costs. Our learnable parameters are reduced by 78.12\%, while the average performance improves by 2.53\%. Experiments show that our method demonstrates superiority among all current major methods and is more suitable for practical application scenarios.
♻ ☆ GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.
comment: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ; GenAI-Bench was first introduced in arxiv:2404.01291. This article extends it with an additional GenAI-Rank benchmark. arXiv admin note: substantial text overlap with arXiv:2404.01291
♻ ☆ GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.
comment: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ; GenAI-Bench was first introduced in arxiv:2404.01291. This article extends it with an additional GenAI-Rank benchmark.
Computer Vision and Pattern Recognition 134
☆ Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
comment: Project website: whiteboard.cs.columbia.edu/
☆ Asynchronous Large Language Model Enhanced Planner for Autonomous Driving
Despite real-time planners exhibiting remarkable performance in autonomous driving, the growing exploration of Large Language Models (LLMs) has opened avenues for enhancing the interpretability and controllability of motion planning. Nevertheless, LLM-based planners continue to encounter significant challenges, including elevated resource consumption and extended inference times, which pose substantial obstacles to practical deployment. In light of these challenges, we introduce AsyncDriver, a new asynchronous LLM-enhanced closed-loop framework designed to leverage scene-associated instruction features produced by LLM to guide real-time planners in making precise and controllable trajectory predictions. On one hand, our method highlights the prowess of LLMs in comprehending and reasoning with vectorized scene data and a series of routing instructions, demonstrating its effective assistance to real-time planners. On the other hand, the proposed framework decouples the inference processes of the LLM and real-time planners. By capitalizing on the asynchronous nature of their inference frequencies, our approach have successfully reduced the computational cost introduced by LLM, while maintaining comparable performance. Experiments show that our approach achieves superior closed-loop evaluation performance on nuPlan's challenging scenarios.
☆ A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. It is widely studied in recent years as a promising and challenging field of Artificial Intelligence Generative Content (AIGC). Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models, which generate images according to text prompts. These models demonstrate remarkable generative capabilities and have become widely used tools for image editing. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs. In this survey, we provide a comprehensive review of multimodal-guided image editing techniques that leverage T2I diffusion models. First, we define the scope of image editing from a holistic perspective and detail various control signals and editing scenarios. We then propose a unified framework to formalize the editing process, categorizing it into two primary algorithm families. This framework offers a design space for users to achieve specific goals. Subsequently, we present an in-depth analysis of each component within this framework, examining the characteristics and applicable scenarios of different combinations. Given that training-based methods learn to directly map the source image to target one under user guidance, we discuss them separately, and introduce injection schemes of source image in different scenarios. Additionally, we review the application of 2D techniques to video editing, highlighting solutions for inter-frame inconsistency. Finally, we discuss open challenges in the field and suggest potential future research directions. We keep tracing related works at https://github.com/xinchengshuai/Awesome-Image-Editing.
comment: Project Page: https://github.com/xinchengshuai/Awesome-Image-Editing
☆ Advancing Fine-Grained Classification by Structure and Subject Preserving Augmentation SP
Fine-grained visual classification (FGVC) involves classifying closely related sub-classes. This task is difficult due to the subtle differences between classes and the high intra-class variance. Moreover, FGVC datasets are typically small and challenging to gather, thus highlighting a significant need for effective data augmentation. Recent advancements in text-to-image diffusion models offer new possibilities for augmenting classification datasets. While these models have been used to generate training data for classification tasks, their effectiveness in full-dataset training of FGVC models remains under-explored. Recent techniques that rely on Text2Image generation or Img2Img methods, often struggle to generate images that accurately represent the class while modifying them to a degree that significantly increases the dataset's diversity. To address these challenges, we present SaSPA: Structure and Subject Preserving Augmentation. Contrary to recent methods, our method does not use real images as guidance, thereby increasing generation flexibility and promoting greater diversity. To ensure accurate class representation, we employ conditioning mechanisms, specifically by conditioning on image edges and subject representation. We conduct extensive experiments and benchmark SaSPA against both traditional and recent generative data augmentation methods. SaSPA consistently outperforms all established baselines across multiple settings, including full dataset training, contextual bias, and few-shot classification. Additionally, our results reveal interesting patterns in using synthetic data for FGVC models; for instance, we find a relationship between the amount of real data used and the optimal proportion of synthetic data. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug.
comment: Under review. Code is available at https://github.com/EyalMichaeli/SaSPA-Aug
☆ Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models
The proliferation of large language models has revolutionized natural language processing tasks, yet it raises profound concerns regarding data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage -- where the model response reveals pieces of such information -- remains inadequately understood. This study examines susceptibility to data leakage by quantifying the phenomenon of memorization in machine learning models, focusing on the evolution of memorization patterns over training. We investigate how the statistical characteristics of training data influence the memories encoded within the model by evaluating how repetition influences memorization. We reproduce findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. Furthermore, we find that sequences which are not apparently memorized after the first encounter can be uncovered throughout the course of training even without subsequent encounters. The presence of these latent memorized sequences presents a challenge for data privacy since they may be hidden at the final checkpoint of the model. To this end, we develop a diagnostic test for uncovering these latent memorized sequences by considering their cross entropy loss.
☆ Consistency Models Made Easy
Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (https://github.com/locuslab/ect) is available.
☆ Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.
☆ IRASim: Learning Interactive Real-Robot Action Simulators
Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints at https: //gen-irasim.github.io.
comment: Opensource, project website: https://gen-irasim.github.io
☆ Invertible Consistency Distillation for Text-Guided Image Editing in Around 7 Steps
Diffusion distillation represents a highly promising direction for achieving faithful text-to-image generation in a few sampling steps. However, despite recent successes, existing distilled models still do not provide the full spectrum of diffusion abilities, such as real image inversion, which enables many precise image manipulation methods. This work aims to enrich distilled text-to-image diffusion models with the ability to effectively encode real images into their latent space. To this end, we introduce invertible Consistency Distillation (iCD), a generalized consistency distillation framework that facilitates both high-quality image synthesis and accurate image encoding in only 3-4 inference steps. Though the inversion problem for text-to-image diffusion models gets exacerbated by high classifier-free guidance scales, we notice that dynamic guidance significantly reduces reconstruction errors without noticeable degradation in generation performance. As a result, we demonstrate that iCD equipped with dynamic guidance may serve as a highly effective tool for zero-shot text-guided image editing, competing with more expensive state-of-the-art alternatives.
comment: Project page: https://yandex-research.github.io/invertible-cd/
☆ Epicardium Prompt-guided Real-time Cardiac Ultrasound Frame-to-volume Registration MICCAI 2024
A comprehensive guidance view for cardiac interventional surgery can be provided by the real-time fusion of the intraoperative 2D images and preoperative 3D volume based on the ultrasound frame-to-volume registration. However, cardiac ultrasound images are characterized by a low signal-to-noise ratio and small differences between adjacent frames, coupled with significant dimension variations between 2D frames and 3D volumes to be registered, resulting in real-time and accurate cardiac ultrasound frame-to-volume registration being a very challenging task. This paper introduces a lightweight end-to-end Cardiac Ultrasound frame-to-volume Registration network, termed CU-Reg. Specifically, the proposed model leverages epicardium prompt-guided anatomical clues to reinforce the interaction of 2D sparse and 3D dense features, followed by a voxel-wise local-global aggregation of enhanced features, thereby boosting the cross-dimensional matching effectiveness of low-quality ultrasound modalities. We further embed an inter-frame discriminative regularization term within the hybrid supervised learning to increase the distinction between adjacent slices in the same ultrasound volume to ensure registration stability. Experimental results on the reprocessed CAMUS dataset demonstrate that our CU-Reg surpasses existing methods in terms of registration accuracy and efficiency, meeting the guidance requirements of clinical cardiac interventional surgery.
comment: This paper has been accepted by MICCAI 2024
☆ Fantastic Copyrighted Beasts and How (Not) to Generate Them
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
☆ MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.
☆ V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.
☆ African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification
Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{https://github.com/gregor-ge/FOCI-Benchmark}.
☆ Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering rather than open-ended caption generation. In this work, in contrast, we offer the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding objectives have little to no effect on object hallucination in open caption generation.
☆ Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines
Small object detection (SOD) has been a longstanding yet challenging task for decades, with numerous datasets and algorithms being developed. However, they mainly focus on either visible or thermal modality, while visible-thermal (RGBT) bimodality is rarely explored. Although some RGBT datasets have been developed recently, the insufficient quantity, limited category, misaligned images and large target size cannot provide an impartial benchmark to evaluate multi-category visible-thermal small object detection (RGBT SOD) algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93K frames and 1.2M manual annotations. RGBT-Tiny contains abundant targets (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of targets are smaller than 16x16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT fusion, detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large targets. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset and SAFit measure, extensive evaluations have been conducted, including 23 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing24/RGBT-Tiny.
☆ On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier
Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
☆ SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.
Self-supervised Multi-actor Social Activity Understanding in Streaming Videos
This work addresses the problem of Social Activity Recognition (SAR), a critical component in real-world tasks like surveillance and assistive robotics. Unlike traditional event understanding approaches, SAR necessitates modeling individual actors' appearance and motions and contextualizing them within their social interactions. Traditional action localization methods fall short due to their single-actor, single-action assumption. Previous SAR research has relied heavily on densely annotated data, but privacy concerns limit their applicability in real-world settings. In this work, we propose a self-supervised approach based on multi-actor predictive learning for SAR in streaming videos. Using a visual-semantic graph structure, we model social interactions, enabling relational reasoning for robust performance with minimal labeled data. The proposed framework achieves competitive performance on standard group activity recognition benchmarks. Evaluation on three publicly available action localization benchmarks demonstrates its generalizability to arbitrary action localization.
comment: 16 pages, 2 figures, 4 pages
☆ Capturing Temporal Components for Time Series Classification
Analyzing sequential data is crucial in many domains, particularly due to the abundance of data collected from the Internet of Things paradigm. Time series classification, the task of categorizing sequential data, has gained prominence, with machine learning approaches demonstrating remarkable performance on public benchmark datasets. However, progress has primarily been in designing architectures for learning representations from raw data at fixed (or ideal) time scales, which can fail to generalize to longer sequences. This work introduces a \textit{compositional representation learning} approach trained on statistically coherent components extracted from sequential data. Based on a multi-scale change space, an unsupervised approach is proposed to segment the sequential data into chunks with similar statistical properties. A sequence-based encoder model is trained in a multi-task setting to learn compositional representations from these temporal components for time series classification. We demonstrate its effectiveness through extensive experiments on publicly available time series classification benchmarks. Evaluating the coherence of segmented components shows its competitive performance on the unsupervised segmentation task.
comment: 15 pages, 2 figures, 4 tables
☆ MM-GTUNets: Unified Multi-Modal Graph Deep Learning for Brain Disorders Prediction
Graph deep learning (GDL) has demonstrated impressive performance in predicting population-based brain disorders (BDs) through the integration of both imaging and non-imaging data. However, the effectiveness of GDL based methods heavily depends on the quality of modeling the multi-modal population graphs and tends to degrade as the graph scale increases. Furthermore, these methods often constrain interactions between imaging and non-imaging data to node-edge interactions within the graph, overlooking complex inter-modal correlations, leading to suboptimal outcomes. To overcome these challenges, we propose MM-GTUNets, an end-to-end graph transformer based multi-modal graph deep learning (MMGDL) framework designed for brain disorders prediction at large scale. Specifically, to effectively leverage rich multi-modal information related to diseases, we introduce Modality Reward Representation Learning (MRRL) which adaptively constructs population graphs using a reward system. Additionally, we employ variational autoencoder to reconstruct latent representations of non-imaging features aligned with imaging features. Based on this, we propose Adaptive Cross-Modal Graph Learning (ACMGL), which captures critical modality-specific and modality-shared features through a unified GTUNet encoder taking advantages of Graph UNet and Graph Transformer, and feature fusion module. We validated our method on two public multi-modal datasets ABIDE and ADHD-200, demonstrating its superior performance in diagnosing BDs. Our code is available at https://github.com/NZWANG/MM-GTUNets.
☆ Video Generation with Learned Action Prior
Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.
☆ CollaFuse: Collaborative Diffusion Models
In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.
comment: 13 pages, 7 figures
☆ FutureNet-LOF: Joint Trajectory Prediction and Lane Occupancy Field Prediction with Future Context Encoding
Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network with future context encoding for the joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2.
comment: 10 pages
☆ Benchmarking Monocular 3D Dog Pose Estimation Using In-The-Wild Motion Capture Data CVPR
We introduce a new benchmark analysis focusing on 3D canine pose estimation from monocular in-the-wild images. A multi-modal dataset 3DDogs-Lab was captured indoors, featuring various dog breeds trotting on a walkway. It includes data from optical marker-based mocap systems, RGBD cameras, IMUs, and a pressure mat. While providing high-quality motion data, the presence of optical markers and limited background diversity make the captured video less representative of real-world conditions. To address this, we created 3DDogs-Wild, a naturalised version of the dataset where the optical markers are in-painted and the subjects are placed in diverse environments, enhancing its utility for training RGB image-based pose detectors. We show that using the 3DDogs-Wild to train the models leads to improved performance when evaluating on in-the-wild data. Additionally, we provide a thorough analysis using various pose estimation models, revealing their respective strengths and weaknesses. We believe that our findings, coupled with the datasets provided, offer valuable insights for advancing 3D animal pose estimation.
comment: 5 pages, 8 figures, including supplementary, CV4Animals Workshop 2024 (CVPRW)
☆ WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.
comment: 26 pages,13 figures
☆ ATAC-Net: Zoomed view works better for Anomaly Detection
The application of deep learning in visual anomaly detection has gained widespread popularity due to its potential use in quality control and manufacturing. Current standard methods are Unsupervised, where a clean dataset is utilised to detect deviations and flag anomalies during testing. However, incorporating a few samples when the type of anomalies is known beforehand can significantly enhance performance. Thus, we propose ATAC-Net, a framework that trains to detect anomalies from a minimal set of known prior anomalies. Furthermore, we introduce attention-guided cropping, which provides a closer view of suspect regions during the training phase. Our framework is a reliable and easy-to-understand system for detecting anomalies, and we substantiate its superiority to some of the current state-of-the-art techniques in a comparable setting.
☆ Enhanced Bank Check Security: Introducing a Novel Dataset and Transformer-Based Approach for Detection and Verification
Automated signature verification on bank checks is critical for fraud prevention and ensuring transaction authenticity. This task is challenging due to the coexistence of signatures with other textual and graphical elements on real-world documents. Verification systems must first detect the signature and then validate its authenticity, a dual challenge often overlooked by current datasets and methodologies focusing only on verification. To address this gap, we introduce a novel dataset specifically designed for signature verification on bank checks. This dataset includes a variety of signature styles embedded within typical check elements, providing a realistic testing ground for advanced detection methods. Moreover, we propose a novel approach for writer-independent signature verification using an object detection network. Our detection-based verification method treats genuine and forged signatures as distinct classes within an object detection framework, effectively handling both detection and verification. We employ a DINO-based network augmented with a dilation module to detect and verify signatures on check images simultaneously. Our approach achieves an AP of 99.2 for genuine and 99.4 for forged signatures, a significant improvement over the DINO baseline, which scored 93.1 and 89.3 for genuine and forged signatures, respectively. This improvement highlights our dilation module's effectiveness in reducing both false positives and negatives. Our results demonstrate substantial advancements in detection-based signature verification technology, offering enhanced security and efficiency in financial document processing.
comment: Accepted for publication in 16th IAPR International Workshop on Document Analysis Systems 2024
☆ PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images, which is crucial for various applications such as human-machine interaction, embodied AI, and autonomous driving. While current models show promising results, they are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment and thus posing safety risks in practical scenarios. To address this issue, we introduce PoseBench, a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruption. We evaluated 60 representative models, including top-down, bottom-up, heatmap-based, regression-based, and classification-based methods, across three datasets for human and animal pose estimation. Our evaluation involves 10 types of corruption in four categories: 1) blur and noise, 2) compression and color loss, 3) severe lighting, and 4) masks. Our findings reveal that state-of-the-art models are vulnerable to common real-world corruptions and exhibit distinct behaviors when tackling human and animal pose estimation tasks. To improve model robustness, we delve into various design considerations, including input resolution, pre-training datasets, backbone capacity, post-processing, and data augmentations. We hope that our benchmark will serve as a foundation for advancing research in robust pose estimation. The benchmark and source code will be released at https://xymsh.github.io/PoseBench
comment: Technical report. Project page: https://xymsh.github.io/PoseBench/
☆ Mask the Unknown: Assessing Different Strategies to Handle Weak Annotations in the MICCAI2023 Mediastinal Lymph Node Quantification Challenge
Pathological lymph node delineation is crucial in cancer diagnosis, progression assessment, and treatment planning. The MICCAI 2023 Lymph Node Quantification Challenge published the first public dataset for pathological lymph node segmentation in the mediastinum. As lymph node annotations are expensive, the challenge was formed as a weakly supervised learning task, where only a subset of all lymph nodes in the training set have been annotated. For the challenge submission, multiple methods for training on these weakly supervised data were explored, including noisy label training, loss masking of unlabeled data, and an approach that integrated the TotalSegmentator toolbox as a form of pseudo labeling in order to reduce the number of unknown voxels. Furthermore, multiple public TCIA datasets were incorporated into the training to improve the performance of the deep learning model. Our submitted model achieved a Dice score of 0.628 and an average symmetric surface distance of 5.8~mm on the challenge test set. With our submitted model, we accomplished third rank in the MICCAI2023 LNQ challenge. A finding of our analysis was that the integration of all visible, including non-pathological, lymph nodes improved the overall segmentation performance on pathological lymph nodes of the test set. Furthermore, segmentation models trained only on clinically enlarged lymph nodes, as given in the challenge scenario, could not generalize to smaller pathological lymph nodes. The code and model for the challenge submission are available at \url{https://gitlab.lrz.de/compai/MediastinalLymphNodeSegmentation}.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:008
☆ Deblurring Neural Radiance Fields with Event-driven Bundle Adjustment
Neural Radiance Fields (NeRF) achieve impressive 3D representation learning and novel view synthesis results with high-quality multi-view images as input. However, motion blur in images often occurs in low-light and high-speed motion scenes, which significantly degrade the reconstruction quality of NeRF. Previous deblurring NeRF methods are struggling to estimate information during the exposure time, unable to accurately model the motion blur. In contrast, the bio-inspired event camera measuring intensity changes with high temporal resolution makes up this information deficiency. In this paper, we propose Event-driven Bundle Adjustment for Deblurring Neural Radiance Fields (EBAD-NeRF) to jointly optimize the learnable poses and NeRF parameters by leveraging the hybrid event-RGB data. An intensity-change-metric event loss and a photo-metric blur loss are introduced to strengthen the explicit modeling of camera motion blur. Experiment results on both synthetic data and real captured data demonstrate that EBAD-NeRF can obtain accurate camera poses during the exposure time and learn sharper 3D representations compared to prior works.
☆ Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep Learning
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.
comment: pre-print of the journal paper
☆ FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image Segmentation
Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using data from only one source domain to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. This paper proposes a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs epistemic uncertainty to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the applicability of the model in medical imaging modalities.
comment: 40 pages, 7 figures, 5 tables
☆ QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis
In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research.
comment: 12 pages, 7 figures
☆ Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens Distortion
Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.
comment: 16 pages, 7 figures, submitted to Scientific Reports
☆ Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach SP
Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks.
comment: Accepted to TIP 2024. Code available: https://github.com/mc-lan/SP-TCL
☆ MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset
Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multi-talk.github.io/.
comment: Interspeech 2024
☆ Zero-Shot Image Denoising for High-Resolution Electron Microscopy
High-resolution electron microscopy (HREM) imaging technique is a powerful tool for directly visualizing a broad range of materials in real-space. However, it faces challenges in denoising due to ultra-low signal-to-noise ratio (SNR) and scarce data availability. In this work, we propose Noise2SR, a zero-shot self-supervised learning (ZS-SSL) denoising framework for HREM. Within our framework, we propose a super-resolution (SR) based self-supervised training strategy, incorporating the Random Sub-sampler module. The Random Sub-sampler is designed to generate approximate infinite noisy pairs from a single noisy image, serving as an effective data augmentation in zero-shot denoising. Noise2SR trains the network with paired noisy images of different resolutions, which is conducted via SR strategy. The SR-based training facilitates the network adopting more pixels for supervision, and the random sub-sampling helps compel the network to learn continuous signals enhancing the robustness. Meanwhile, we mitigate the uncertainty caused by random-sampling by adopting minimum mean squared error (MMSE) estimation for the denoised results. With the distinctive integration of training strategy and proposed designs, Noise2SR can achieve superior denoising performance using a single noisy HREM image. We evaluate the performance of Noise2SR in both simulated and real HREM denoising tasks. It outperforms state-of-the-art ZS-SSL methods and achieves comparable denoising performance with supervised methods. The success of Noise2SR suggests its potential for improving the SNR of images in material imaging domains.
comment: 12 pages, 12 figures
☆ Unleashing the Potential of Tracklets for Unsupervised Video Person Re-Identification
With rich temporal-spatial information, video-based person re-identification methods have shown broad prospects. Although tracklets can be easily obtained with ready-made tracking models, annotating identities is still expensive and impractical. Therefore, some video-based methods propose using only a few identity annotations or camera labels to facilitate feature learning. They also simply average the frame features of each tracklet, overlooking unexpected variations and inherent identity consistency within tracklets. In this paper, we propose the Self-Supervised Refined Clustering (SSR-C) framework without relying on any annotation or auxiliary information to promote unsupervised video person re-identification. Specifically, we first propose the Noise-Filtered Tracklet Partition (NFTP) module to reduce the feature bias of tracklets caused by noisy tracking results, and sequentially partition the noise-filtered tracklets into "sub-tracklets". Then, we cluster and further merge sub-tracklets using the self-supervised signal from tracklet partition, which is enhanced through a progressive strategy to generate reliable pseudo labels, facilitating intra-class cross-tracklet aggregation. Moreover, we propose the Class Smoothing Classification (CSC) loss to efficiently promote model learning. Extensive experiments on the MARS and DukeMTMC-VideoReID datasets demonstrate that our proposed SSR-C for unsupervised video person re-identification achieves state-of-the-art results and is comparable to advanced supervised methods.
comment: The first two authors contributed equally
☆ MEAT: Median-Ensemble Adversarial Training for Improving Robustness and Generalization
Self-ensemble adversarial training methods improve model robustness by ensembling models at different training epochs, such as model weight averaging (WA). However, previous research has shown that self-ensemble defense methods in adversarial training (AT) still suffer from robust overfitting, which severely affects the generalization performance. Empirically, in the late phases of training, the AT becomes more overfitting to the extent that the individuals for weight averaging also suffer from overfitting and produce anomalous weight values, which causes the self-ensemble model to continue to undergo robust overfitting due to the failure in removing the weight anomalies. To solve this problem, we aim to tackle the influence of outliers in the weight space in this work and propose an easy-to-operate and effective Median-Ensemble Adversarial Training (MEAT) method to solve the robust overfitting phenomenon existing in self-ensemble defense from the source by searching for the median of the historical model weights. Experimental results show that MEAT achieves the best robustness against the powerful AutoAttack and can effectively allievate the robust overfitting. We further demonstrate that most defense methods can improve robust generalization and robustness by combining with MEAT.
☆ DuMapNet: An End-to-End Vectorization System for City-Scale Lane-Level Map Generation KDD 2024
Generating city-scale lane-level maps faces significant challenges due to the intricate urban environments, such as blurred or absent lane markings. Additionally, a standard lane-level map requires a comprehensive organization of lane groupings, encompassing lane direction, style, boundary, and topology, yet has not been thoroughly examined in prior research. These obstacles result in labor-intensive human annotation and high maintenance costs. This paper overcomes these limitations and presents an industrial-grade solution named DuMapNet that outputs standardized, vectorized map elements and their topology in an end-to-end paradigm. To this end, we propose a group-wise lane prediction (GLP) system that outputs vectorized results of lane groups by meticulously tailoring a transformer-based network. Meanwhile, to enhance generalization in challenging scenarios, such as road wear and occlusions, as well as to improve global consistency, a contextual prompts encoder (CPE) module is proposed, which leverages the predicted results of spatial neighborhoods as contextual information. Extensive experiments conducted on large-scale real-world datasets demonstrate the superiority and effectiveness of DuMapNet. Additionally, DuMap-Net has already been deployed in production at Baidu Maps since June 2023, supporting lane-level map generation tasks for over 360 cities while bringing a 95% reduction in costs. This demonstrates that DuMapNet serves as a practical and cost-effective industrial solution for city-scale lane-level map generation.
comment: Accepted by KDD 2024, camera-ready version
☆ E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion EMNLP 2024
Online GUI navigation on mobile devices has driven a lot of attention recent years since it contributes to many real-world applications. With the rapid development of large language models (LLM), multimodal large language models (MLLM) have tremendous potential on this task. However, existing MLLMs need high quality data to improve its abilities of making the correct navigation decisions according to the human user inputs. In this paper, we developed a novel and highly valuable dataset, named \textbf{E-ANT}, as the first Chinese GUI navigation dataset that contains real human behaviour and high quality screenshots with annotations, containing nearly 40,000 real human traces over 5000+ different tinyAPPs. Furthermore, we evaluate various powerful MLLMs on E-ANT and show their experiments results with sufficient ablations. We believe that our proposed dataset will be beneficial for both the evaluation and development of GUI navigation and LLM/MLLM decision-making capabilities.
comment: 9 pages, 5 figures, Under review at ARR (for EMNLP 2024)
☆ CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. While substantial progress has been made in understanding these interactive modalities in ground-level navigation, aerial navigation remains largely underexplored. This is primarily due to the scarcity of resources suitable for real-world, city-scale aerial navigation studies. To bridge this gap, we introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset. The results using this dataset reveal the following key findings: (i) Our aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies; (ii) The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. Our dataset and code are available at https://water-cookie.github.io/city-nav-proj/
comment: The first two authors are equally contributed
☆ LeYOLO, New Scalable and Efficient CNN Architecture for Object Detection
Computational efficiency in deep neural networks is critical for object detection, especially as newer models prioritize speed over efficient computation (FLOP). This evolution has somewhat left behind embedded and mobile-oriented AI object detection applications. In this paper, we focus on design choices of neural network architectures for efficient object detection computation based on FLOP and propose several optimizations to enhance the efficiency of YOLO-based models. Firstly, we introduce an efficient backbone scaling inspired by inverted bottlenecks and theoretical insights from the Information Bottleneck principle. Secondly, we present the Fast Pyramidal Architecture Network (FPAN), designed to facilitate fast multiscale feature sharing while reducing computational resources. Lastly, we propose a Decoupled Network-in-Network (DNiN) detection head engineered to deliver rapid yet lightweight computations for classification and regression tasks. Building upon these optimizations and leveraging more efficient backbones, this paper contributes to a new scaling paradigm for object detection and YOLO-centric models called LeYOLO. Our contribution consistently outperforms existing models in various resource constraints, achieving unprecedented accuracy and flop ratio. Notably, LeYOLO-Small achieves a competitive mAP score of 38.2% on the COCOval with just 4.5 FLOP(G), representing a 42% reduction in computational load compared to the latest state-of-the-art YOLOv9-Tiny model while achieving similar accuracy. Our novel model family achieves a FLOP-to-accuracy ratio previously unattained, offering scalability that spans from ultra-low neural network configurations (< 1 GFLOP) to efficient yet demanding object detection setups (> 4 GFLOPs) with 25.2, 31.3, 35.2, 38.2, 39.3 and 41 mAP for 0.66, 1.47, 2.53, 4.51, 5.8 and 8.4 FLOP(G).
☆ Mitigating the Human-Robot Domain Discrepancy in Visual Pre-training for Robotic Manipulation
Learning generalizable visual dynamic representation across different embodied environments is crucial for real-world robotic manipulation. As the scale and diversity of robot demonstration data are limited, recent works have turned to large-scale pre-training using human data. However, the morphological differences between humans and robots introduce a significant human-robot domain discrepancy, challenging the generalization of these human-data pre-trained models to downstream manipulation tasks. To address this, we propose a novel adaptation paradigm that utilizes readily available paired human-robot video data to bridge the discrepancy. Following this paradigm, our method exploits a human-robot contrastive alignment loss to align the semantics of human and robot videos, adapting pre-trained models to the robotic domain in a parameter-efficient manner. The experiments demonstrate significant improvements on 25 tasks across three different benchmarks, where the single-task, language-conditioned multi-task settings are covered, and two different pre-trained models are evaluated. On the large RLBench benchmark, our adaptation method achieves an average improvement of $8.9\%$ in success rate over the pre-trained R3M model across multiple tasks. We will release the code and models upon acceptance.
☆ Uncertainty and Self-Supervision in Single-View Depth
Single-view depth estimation refers to the ability to derive three-dimensional information per pixel from a single two-dimensional image. Single-view depth estimation is an ill-posed problem because there are multiple depth solutions that explain 3D geometry from a single view. While deep neural networks have been shown to be effective at capturing depth from a single view, the majority of current methodologies are deterministic in nature. Accounting for uncertainty in the predictions can avoid disastrous consequences when applied to fields such as autonomous driving or medical robotics. We have addressed this problem by quantifying the uncertainty of supervised single-view depth for Bayesian deep neural networks. There are scenarios, especially in medicine in the case of endoscopic images, where such annotated data is not available. To alleviate the lack of data, we present a method that improves the transition from synthetic to real domain methods. We introduce an uncertainty-aware teacher-student architecture that is trained in a self-supervised manner, taking into account the teacher uncertainty. Given the vast amount of unannotated data and the challenges associated with capturing annotated depth in medical minimally invasive procedures, we advocate a fully self-supervised approach that only requires RGB images and the geometric and photometric calibration of the endoscope. In endoscopic imaging, the camera and light sources are co-located at a small distance from the target surfaces. This setup indicates that brighter areas of the image are nearer to the camera, while darker areas are further away. Building on this observation, we exploit the fact that for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance. We propose the use of illumination as a strong single-view self-supervisory signal for deep neural networks.
comment: Doctoral thesis
☆ Evaluation of Deep Learning Semantic Segmentation for Land Cover Mapping on Multispectral, Hyperspectral and High Spatial Aerial Imagery
In the rise of climate change, land cover mapping has become such an urgent need in environmental monitoring. The accuracy of land cover classification has gotten increasingly based on the improvement of remote sensing data. Land cover classification using satellite imageries has been explored and become more prevalent in recent years, but the methodologies remain some drawbacks of subjective and time-consuming. Some deep learning techniques have been utilized to overcome these limitations. However, most studies implemented just one image type to evaluate algorithms for land cover mapping. Therefore, our study conducted deep learning semantic segmentation in multispectral, hyperspectral, and high spatial aerial image datasets for landcover mapping. This research implemented a semantic segmentation method such as Unet, Linknet, FPN, and PSPnet for categorizing vegetation, water, and others (i.e., soil and impervious surface). The LinkNet model obtained high accuracy in IoU (Intersection Over Union) at 0.92 in all datasets, which is comparable with other mentioned techniques. In evaluation with different image types, the multispectral images showed higher performance with the IoU, and F1-score are 0.993 and 0.997, respectively. Our outcome highlighted the efficiency and broad applicability of LinkNet and multispectral image on land cover classification. This research contributes to establishing an approach on landcover segmentation via open source for long-term future application.
Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging Dataset
Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.
☆ Live Video Captioning
Dense video captioning is the task that involves the detection and description of events within video sequences. While traditional approaches focus on offline solutions where the entire video of analysis is available for the captioning model, in this work we introduce a paradigm shift towards Live Video Captioning (LVC). In LVC, dense video captioning models must generate captions for video streams in an online manner, facing important constraints such as having to work with partial observations of the video, the need for temporal anticipation and, of course, ensuring ideally a real-time response. In this work we formally introduce the novel problem of LVC and propose new evaluation metrics tailored for the online scenario, demonstrating their superiority over traditional metrics. We also propose an LVC model integrating deformable transformers and temporal filtering to address the LVC new challenges. Experimental evaluations on the ActivityNet Captions dataset validate the effectiveness of our approach, highlighting its performance in LVC compared to state-of-the-art offline methods. Results of our model as well as an evaluation kit with the novel metrics integrated are made publicly available to encourage further research on LVC.
☆ Trusting Semantic Segmentation Networks
Semantic segmentation has become an important task in computer vision with the growth of self-driving cars, medical image segmentation, etc. Although current models provide excellent results, they are still far from perfect and while there has been significant work in trying to improve the performance, both with respect to accuracy and speed of segmentation, there has been little work which analyses the failure cases of such systems. In this work, we aim to provide an analysis of how segmentation fails across different models and consider the question of whether these can be predicted reasonably at test time. To do so, we explore existing uncertainty-based metrics and see how well they correlate with misclassifications, allowing us to define the degree of trust we put in the output of our prediction models. Through several experiments on three different models across three datasets, we show that simple measures such as entropy can be used to capture misclassification with high recall rates.
☆ VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
☆ CriDiff: Criss-cross Injection Diffusion Framework via Generative Pre-train for Prostate Segmentation MICCAI 2024
Recently, the Diffusion Probabilistic Model (DPM)-based methods have achieved substantial success in the field of medical image segmentation. However, most of these methods fail to enable the diffusion model to learn edge features and non-edge features effectively and to inject them efficiently into the diffusion backbone. Additionally, the domain gap between the images features and the diffusion model features poses a great challenge to prostate segmentation. In this paper, we proposed CriDiff, a two-stage feature injecting framework with a Crisscross Injection Strategy (CIS) and a Generative Pre-train (GP) approach for prostate segmentation. The CIS maximizes the use of multi-level features by efficiently harnessing the complementarity of high and low-level features. To effectively learn multi-level of edge features and non-edge features, we proposed two parallel conditioners in the CIS: the Boundary Enhance Conditioner (BEC) and the Core Enhance Conditioner (CEC), which discriminatively model the image edge regions and non-edge regions, respectively. Moreover, the GP approach eases the inconsistency between the images features and the diffusion model without adding additional parameters. Extensive experiments on four benchmark datasets demonstrate the effectiveness of the proposed method and achieve state-of-the-art performance on four evaluation metrics.
comment: Accepted in MICCAI 2024
☆ EvSegSNN: Neuromorphic Semantic Segmentation for Event Data
Semantic segmentation is an important computer vision task, particularly for scene understanding and navigation of autonomous vehicles and UAVs. Several variations of deep neural network architectures have been designed to tackle this task. However, due to their huge computational costs and their high memory consumption, these models are not meant to be deployed on resource-constrained systems. To address this limitation, we introduce an end-to-end biologically inspired semantic segmentation approach by combining Spiking Neural Networks (SNNs, a low-power alternative to classical neural networks) with event cameras whose output data can directly feed these neural network inputs. We have designed EvSegSNN, a biologically plausible encoder-decoder U-shaped architecture relying on Parametric Leaky Integrate and Fire neurons in an objective to trade-off resource usage against performance. The experiments conducted on DDD17 demonstrate that EvSegSNN outperforms the closest state-of-the-art model in terms of MIoU while reducing the number of parameters by a factor of $1.6$ and sparing a batch normalization stage.
☆ Detecting sexually explicit content in the context of the child sexual abuse materials (CSAM): end-to-end classifiers and region-based networks ECML
Child sexual abuse materials (CSAM) pose a significant threat to the safety and well-being of children worldwide. Detecting and preventing the distribution of such materials is a critical task for law enforcement agencies and technology companies. As content moderation is often manual, developing an automated detection system can help reduce human reviewers' exposure to potentially harmful images and accelerate the process of counteracting. This study presents methods for classifying sexually explicit content, which plays a crucial role in the automated CSAM detection system. Several approaches are explored to solve the task: an end-to-end classifier, a classifier with person detection and a private body parts detector. All proposed methods are tested on the images obtained from the online tool for reporting illicit content. Due to legal constraints, access to the data is limited, and all algorithms are executed remotely on the isolated server. The end-to-end classifier yields the most promising results, with an accuracy of 90.17%, after augmenting the training set with the additional neutral samples and adult pornography. While detection-based methods may not achieve higher accuracy rates and cannot serve as a final classifier on their own, their inclusion in the system can be beneficial. Human body-oriented approaches generate results that are easier to interpret, and obtaining more interpretable results is essential when analyzing models that are trained without direct access to data.
comment: Accepted at ECML PKDD 2023 - SoGOOD Workshop
☆ ExVideo: Extending Video Diffusion Models via Parameter-Efficient Post-Tuning
Recently, advancements in video synthesis have attracted significant attention. Video synthesis models such as AnimateDiff and Stable Video Diffusion have demonstrated the practical applicability of diffusion models in creating dynamic visual content. The emergence of SORA has further spotlighted the potential of video generation technologies. Nonetheless, the extension of video lengths has been constrained by the limitations in computational resources. Most existing video synthesis models can only generate short video clips. In this paper, we propose a novel post-tuning methodology for video synthesis models, called ExVideo. This approach is designed to enhance the capability of current video synthesis models, allowing them to produce content over extended temporal durations while incurring lower training expenditures. In particular, we design extension strategies across common temporal model architectures respectively, including 3D convolution, temporal attention, and positional embedding. To evaluate the efficacy of our proposed post-tuning approach, we conduct extension training on the Stable Video Diffusion model. Our approach augments the model's capacity to generate up to $5\times$ its original number of frames, requiring only 1.5k GPU hours of training on a dataset comprising 40k videos. Importantly, the substantial increase in video length doesn't compromise the model's innate generalization capabilities, and the model showcases its advantages in generating videos of diverse styles and resolutions. We will release the source code and the enhanced model publicly.
comment: 8 pages, 5 figures
☆ Towards Event-oriented Long Video Understanding
With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
comment: Work on progress
☆ Boosting Hyperspectral Image Classification with Gate-Shift-Fuse Mechanisms in a Novel CNN-Transformer Approach
During the process of classifying Hyperspectral Image (HSI), every pixel sample is categorized under a land-cover type. CNN-based techniques for HSI classification have notably advanced the field by their adept feature representation capabilities. However, acquiring deep features remains a challenge for these CNN-based methods. In contrast, transformer models are adept at extracting high-level semantic features, offering a complementary strength. This paper's main contribution is the introduction of an HSI classification model that includes two convolutional blocks, a Gate-Shift-Fuse (GSF) block and a transformer block. This model leverages the strengths of CNNs in local feature extraction and transformers in long-range context modelling. The GSF block is designed to strengthen the extraction of local and global spatial-spectral features. An effective attention mechanism module is also proposed to enhance the extraction of information from HSI cubes. The proposed method is evaluated on four well-known datasets (the Indian Pines, Pavia University, WHU-WHU-Hi-LongKou and WHU-Hi-HanChuan), demonstrating that the proposed framework achieves superior results compared to other models.
☆ Prediction and Reference Quality Adaptation for Learned Video Compression
Temporal prediction is one of the most important technologies for video compression. Various prediction coding modes are designed in traditional video codecs. Traditional video codecs will adaptively to decide the optimal coding mode according to the prediction quality and reference quality. Recently, learned video codecs have made great progress. However, they ignore the prediction and reference quality adaptation, which leads to incorrect utilization of temporal prediction and reconstruction error propagation. Therefore, in this paper, we first propose a confidence-based prediction quality adaptation (PQA) module to provide explicit discrimination for the spatial and channel-wise prediction quality difference. With this module, the prediction with low quality will be suppressed and that with high quality will be enhanced. The codec can adaptively decide which spatial or channel location of predictions to use. Then, we further propose a reference quality adaptation (RQA) module and an associated repeat-long training strategy to provide dynamic spatially variant filters for diverse reference qualities. With the filters, it is easier for our codec to achieve the target reconstruction quality according to reference qualities, thus reducing the propagation of reconstruction errors. Experimental results show that our codec obtains higher compression performance than the reference software of H.266/VVC and the previous state-of-the-art learned video codecs in both RGB and YUV420 colorspaces.
☆ HeartBeat: Towards Controllable Echocardiography Video Synthesis with Multimodal Conditions-Guided Diffusion Models MICCAI 2024
Echocardiography (ECHO) video is widely used for cardiac examination. In clinical, this procedure heavily relies on operator experience, which needs years of training and maybe the assistance of deep learning-based systems for enhanced accuracy and efficiency. However, it is challenging since acquiring sufficient customized data (e.g., abnormal cases) for novice training and deep model development is clinically unrealistic. Hence, controllable ECHO video synthesis is highly desirable. In this paper, we propose a novel diffusion-based framework named HeartBeat towards controllable and high-fidelity ECHO video synthesis. Our highlight is three-fold. First, HeartBeat serves as a unified framework that enables perceiving multimodal conditions simultaneously to guide controllable generation. Second, we factorize the multimodal conditions into local and global ones, with two insertion strategies separately provided fine- and coarse-grained controls in a composable and flexible manner. In this way, users can synthesize ECHO videos that conform to their mental imagery by combining multimodal control signals. Third, we propose to decouple the visual concepts and temporal dynamics learning using a two-stage training scheme for simplifying the model training. One more interesting thing is that HeartBeat can easily generalize to mask-guided cardiac MRI synthesis in a few shots, showcasing its scalability to broader applications. Extensive experiments on two public datasets show the efficacy of the proposed HeartBeat.
comment: Accepted by MICCAI 2024
☆ Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images
Recent advancements in autoregressive networks with linear complexity have driven significant research progress, demonstrating exceptional performance in large language models. A representative model is the Extended Long Short-Term Memory (xLSTM), which incorporates gating mechanisms and memory structures, performing comparably to Transformer architectures in long-sequence language tasks. Autoregressive networks such as xLSTM can utilize image serialization to extend their application to visual tasks such as classification and segmentation. Although existing studies have demonstrated Vision-LSTM's impressive results in image classification, its performance in image semantic segmentation remains unverified. Our study represents the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. This evaluation is based on a specifically designed encoder-decoder architecture named Seg-LSTM, and comparisons with state-of-the-art segmentation networks. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests. Future research directions for enhancing Vision-LSTM are recommended. The source code is available from https://github.com/zhuqinfeng1999/Seg-LSTM.
☆ CMTNet: Convolutional Meets Transformer Network for Hyperspectral Images Classification
Hyperspectral remote sensing (HIS) enables the detailed capture of spectral information from the Earth's surface, facilitating precise classification and identification of surface crops due to its superior spectral diagnostic capabilities. However, current convolutional neural networks (CNNs) focus on local features in hyperspectral data, leading to suboptimal performance when classifying intricate crop types and addressing imbalanced sample distributions. In contrast, the Transformer framework excels at extracting global features from hyperspectral imagery. To leverage the strengths of both approaches, this research introduces the Convolutional Meet Transformer Network (CMTNet). This innovative model includes a spectral-spatial feature extraction module for shallow feature capture, a dual-branch structure combining CNN and Transformer branches for local and global feature extraction, and a multi-output constraint module that enhances classification accuracy through multi-output loss calculations and cross constraints across local, international, and joint features. Extensive experiments conducted on three datasets (WHU-Hi-LongKou, WHU-Hi-HanChuan, and WHU-Hi-HongHu) demonstrate that CTDBNet significantly outperforms other state-of-the-art networks in classification performance, validating its effectiveness in hyperspectral crop classification.
comment: 15 pages, 11figures
☆ Towards Multi-modality Fusion and Prototype-based Feature Refinement for Clinically Significant Prostate Cancer Classification in Transrectal Ultrasound
Prostate cancer is a highly prevalent cancer and ranks as the second leading cause of cancer-related deaths in men globally. Recently, the utilization of multi-modality transrectal ultrasound (TRUS) has gained significant traction as a valuable technique for guiding prostate biopsies. In this study, we propose a novel learning framework for clinically significant prostate cancer (csPCa) classification using multi-modality TRUS. The proposed framework employs two separate 3D ResNet-50 to extract distinctive features from B-mode and shear wave elastography (SWE). Additionally, an attention module is incorporated to effectively refine B-mode features and aggregate the extracted features from both modalities. Furthermore, we utilize few shot segmentation task to enhance the capacity of classification encoder. Due to the limited availability of csPCa masks, a prototype correction module is employed to extract representative prototypes of csPCa. The performance of the framework is assessed on a large-scale dataset consisting of 512 TRUS videos with biopsy-proved prostate cancer. The results demonstrate the strong capability in accurately identifying csPCa, achieving an area under the curve (AUC) of 0.86. Moreover, the framework generates visual class activation mapping (CAM), which can serve as valuable assistance for localizing csPCa. These CAM images may offer valuable guidance during TRUS-guided targeted biopsies, enhancing the efficacy of the biopsy procedure.The code is available at https://github.com/2313595986/SmileCode.
☆ Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics Data
Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored using machine learning and metabolomics information to identify which cataract patients suffered from dry eye disease. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. To address this challenge, we conducted a comparative analysis of nine machine learning models on three metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set's challenges. The logistic regression model overall performed the best, achieving the highest area under the curve score of 0.8378, balanced accuracy of 0.735, Matthew's correlation coefficient of 0.5147, an F1-score of 0.8513, and a specificity of 0.5667. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance.
☆ VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
Recent advances in Large Vision-Language Models (LVLMs) have significantly improve performance in image comprehension tasks, such as formatted charts and rich-content images. Yet, Graphical User Interface (GUI) pose a greater challenge due to their structured format and detailed textual information. Existing LVLMs often overly depend on internal knowledge and neglect image content, resulting in hallucinations and incorrect responses in GUI comprehension.To address these issues, we introduce VGA, a fine-tuned model designed for comprehensive GUI understanding. Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations. We first construct a Vision Question Answering (VQA) dataset of 63.8k high-quality examples with our propose Referent Method, which ensures the model's responses are highly depend on visual content within the image. We then design a two-stage fine-tuning method called Foundation and Advanced Comprehension (FAC) to enhance both the model's ability to extract information from image content and alignment with human intent. Experiments show that our approach enhances the model's ability to extract information from images and achieves state-of-the-art results in GUI understanding tasks. Our dataset and fine-tuning script will be released soon.
comment: 18 pages
☆ Perspective+ Unet: Enhancing Segmentation with Bi-Path Fusion and Efficient Non-Local Attention for Superior Receptive Fields
Precise segmentation of medical images is fundamental for extracting critical clinical information, which plays a pivotal role in enhancing the accuracy of diagnoses, formulating effective treatment plans, and improving patient outcomes. Although Convolutional Neural Networks (CNNs) and non-local attention methods have achieved notable success in medical image segmentation, they either struggle to capture long-range spatial dependencies due to their reliance on local features, or face significant computational and feature integration challenges when attempting to address this issue with global attention mechanisms. To overcome existing limitations in medical image segmentation, we propose a novel architecture, Perspective+ Unet. This framework is characterized by three major innovations: (i) It introduces a dual-pathway strategy at the encoder stage that combines the outcomes of traditional and dilated convolutions. This not only maintains the local receptive field but also significantly expands it, enabling better comprehension of the global structure of images while retaining detail sensitivity. (ii) The framework incorporates an efficient non-local transformer block, named ENLTB, which utilizes kernel function approximation for effective long-range dependency capture with linear computational and spatial complexity. (iii) A Spatial Cross-Scale Integrator strategy is employed to merge global dependencies and local contextual cues across model stages, meticulously refining features from various levels to harmonize global and local information. Experimental results on the ACDC and Synapse datasets demonstrate the effectiveness of our proposed Perspective+ Unet. The code is available in the supplementary material.
comment: 13 pages, 5 figures
☆ Gaze-directed Vision GNN for Mitigating Shortcut Learning in Medical Image
Deep neural networks have demonstrated remarkable performance in medical image analysis. However, its susceptibility to spurious correlations due to shortcut learning raises concerns about network interpretability and reliability. Furthermore, shortcut learning is exacerbated in medical contexts where disease indicators are often subtle and sparse. In this paper, we propose a novel gaze-directed Vision GNN (called GD-ViG) to leverage the visual patterns of radiologists from gaze as expert knowledge, directing the network toward disease-relevant regions, and thereby mitigating shortcut learning. GD-ViG consists of a gaze map generator (GMG) and a gaze-directed classifier (GDC). Combining the global modelling ability of GNNs with the locality of CNNs, GMG generates the gaze map based on radiologists' visual patterns. Notably, it eliminates the need for real gaze data during inference, enhancing the network's practical applicability. Utilizing gaze as the expert knowledge, the GDC directs the construction of graph structures by incorporating both feature distances and gaze distances, enabling the network to focus on disease-relevant foregrounds. Thereby avoiding shortcut learning and improving the network's interpretability. The experiments on two public medical image datasets demonstrate that GD-ViG outperforms the state-of-the-art methods, and effectively mitigates shortcut learning. Our code is available at https://github.com/SX-SS/GD-ViG.
☆ Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks
We introduce a fast Self-adapting Forward-Forward Network (SaFF-Net) for medical imaging analysis, mitigating power consumption and resource limitations, which currently primarily stem from the prevalent reliance on back-propagation for model training and fine-tuning. Building upon the recently proposed Forward-Forward Algorithm (FFA), we introduce the Convolutional Forward-Forward Algorithm (CFFA), a parameter-efficient reformulation that is suitable for advanced image analysis and overcomes the speed and generalisation constraints of the original FFA. To address hyper-parameter sensitivity of FFAs we are also introducing a self-adapting framework SaFF-Net fine-tuning parameters during warmup and training in parallel. Our approach enables more effective model training and eliminates the previously essential requirement for an arbitrarily chosen Goodness function in FFA. We evaluate our approach on several benchmarking datasets in comparison with standard Back-Propagation (BP) neural networks showing that FFA-based networks with notably fewer parameters and function evaluations can compete with standard models, especially, in one-shot scenarios and large batch sizes. The code will be available at the time of the conference.
comment: Under Review
☆ LGmap: Local-to-Global Mapping Network for Online Long-Range Vectorized HD Map Construction
This report introduces the first-place winning solution for the Autonomous Grand Challenge 2024 - Mapless Driving. In this report, we introduce a novel online mapping pipeline LGmap, which adept at long-range temporal model. Firstly, we propose symmetric view transformation(SVT), a hybrid view transformation module. Our approach overcomes the limitations of forward sparse feature representation and utilizing depth perception and SD prior information. Secondly, we propose hierarchical temporal fusion(HTF) module. It employs temporal information from local to global, which empowers the construction of long-range HD map with high stability. Lastly, we propose a novel ped-crossing resampling. The simplified ped crossing representation accelerates the instance attention based decoder convergence performance. Our method achieves 0.66 UniScore in the Mapless Driving OpenLaneV2 test set.
☆ Image anomaly detection and prediction scheme based on SSA optimized ResNet50-BiGRU model
Image anomaly detection is a popular research direction, with many methods emerging in recent years due to rapid advancements in computing. The use of artificial intelligence for image anomaly detection has been widely studied. By analyzing images of athlete posture and movement, it is possible to predict injury status and suggest necessary adjustments. Most existing methods rely on convolutional networks to extract information from irrelevant pixel data, limiting model accuracy. This paper introduces a network combining Residual Network (ResNet) and Bidirectional Gated Recurrent Unit (BiGRU), which can predict potential injury types and provide early warnings by analyzing changes in muscle and bone poses from video images. To address the high complexity of this network, the Sparrow search algorithm was used for optimization. Experiments conducted on four datasets demonstrated that our model has the smallest error in image anomaly detection compared to other models, showing strong adaptability. This provides a new approach for anomaly detection and predictive analysis in images, contributing to the sustainable development of human health and performance.
☆ Knowledge-driven Subspace Fusion and Gradient Coordination for Multi-modal Learning
Multi-modal learning plays a crucial role in cancer diagnosis and prognosis. Current deep learning based multi-modal approaches are often limited by their abilities to model the complex correlations between genomics and histology data, addressing the intrinsic complexity of tumour ecosystem where both tumour and microenvironment contribute to malignancy. We propose a biologically interpretative and robust multi-modal learning framework to efficiently integrate histology images and genomics by decomposing the feature subspace of histology images and genomics, reflecting distinct tumour and microenvironment features. To enhance cross-modal interactions, we design a knowledge-driven subspace fusion scheme, consisting of a cross-modal deformable attention module and a gene-guided consistency strategy. Additionally, in pursuit of dynamically optimizing the subspace knowledge, we further propose a novel gradient coordination learning strategy. Extensive experiments demonstrate the effectiveness of the proposed method, outperforming state-of-the-art techniques in three downstream tasks of glioma diagnosis, tumour grading, and survival analysis. Our code is available at https://github.com/helenypzhang/Subspace-Multimodal-Learning.
☆ Similarity-aware Syncretic Latent Diffusion Model for Medical Image Translation with Representation Learning
Non-contrast CT (NCCT) imaging may reduce image contrast and anatomical visibility, potentially increasing diagnostic uncertainty. In contrast, contrast-enhanced CT (CECT) facilitates the observation of regions of interest (ROI). Leading generative models, especially the conditional diffusion model, demonstrate remarkable capabilities in medical image modality transformation. Typical conditional diffusion models commonly generate images with guidance of segmentation labels for medical modal transformation. Limited access to authentic guidance and its low cardinality can pose challenges to the practical clinical application of conditional diffusion models. To achieve an equilibrium of generative quality and clinical practices, we propose a novel Syncretic generative model based on the latent diffusion model for medical image translation (S$^2$LDM), which can realize high-fidelity reconstruction without demand of additional condition during inference. S$^2$LDM enhances the similarity in distinct modal images via syncretic encoding and diffusing, promoting amalgamated information in the latent space and generating medical images with more details in contrast-enhanced regions. However, syncretic latent spaces in the frequency domain tend to favor lower frequencies, commonly locate in identical anatomic structures. Thus, S$^2$LDM applies adaptive similarity loss and dynamic similarity to guide the generation and supplements the shortfall in high-frequency details throughout the training process. Quantitative experiments confirm the effectiveness of our approach in medical image translation. Our code will release lately.
☆ SSAD: Self-supervised Auxiliary Detection Framework for Panoramic X-ray based Dental Disease Diagnosis
Panoramic X-ray is a simple and effective tool for diagnosing dental diseases in clinical practice. When deep learning models are developed to assist dentist in interpreting panoramic X-rays, most of their performance suffers from the limited annotated data, which requires dentist's expertise and a lot of time cost. Although self-supervised learning (SSL) has been proposed to address this challenge, the two-stage process of pretraining and fine-tuning requires even more training time and computational resources. In this paper, we present a self-supervised auxiliary detection (SSAD) framework, which is plug-and-play and compatible with any detectors. It consists of a reconstruction branch and a detection branch. Both branches are trained simultaneously, sharing the same encoder, without the need for finetuning. The reconstruction branch learns to restore the tooth texture of healthy or diseased teeth, while the detection branch utilizes these learned features for diagnosis. To enhance the encoder's ability to capture fine-grained features, we incorporate the image encoder of SAM to construct a texture consistency (TC) loss, which extracts image embedding from the input and output of reconstruction branch, and then enforces both embedding into the same feature space. Extensive experiments on the public DENTEX dataset through three detection tasks demonstrate that the proposed SSAD framework achieves state-of-the-art performance compared to mainstream object detection methods and SSL methods. The code is available at https://github.com/Dylonsword/SSAD
☆ Towards the in-situ Trunk Identification and Length Measurement of Sea Cucumbers via Bézier Curve Modelling
We introduce a novel vision-based framework for in-situ trunk identification and length measurement of sea cucumbers, which plays a crucial role in the monitoring of marine ranching resources and mechanized harvesting. To model sea cucumber trunk curves with varying degrees of bending, we utilize the parametric B\'{e}zier curve due to its computational simplicity, stability, and extensive range of transformation possibilities. Then, we propose an end-to-end unified framework that combines parametric B\'{e}zier curve modeling with the widely used You-Only-Look-Once (YOLO) pipeline, abbreviated as TISC-Net, and incorporates effective funnel activation and efficient multi-scale attention modules to enhance curve feature perception and learning. Furthermore, we propose incorporating trunk endpoint loss as an additional constraint to effectively mitigate the impact of endpoint deviations on the overall curve. Finally, by utilizing the depth information of pixels located along the trunk curve captured by a binocular camera, we propose accurately estimating the in-situ length of sea cucumbers through space curve integration. We established two challenging benchmark datasets for curve-based in-situ sea cucumber trunk identification. These datasets consist of over 1,000 real-world marine environment images of sea cucumbers, accompanied by B\'{e}zier format annotations. We conduct evaluation on SC-ISTI, for which our method achieves mAP50 above 0.9 on both object detection and trunk identification tasks. Extensive length measurement experiments demonstrate that the average absolute relative error is around 0.15.
☆ 2nd Place Solution for MeViS Track in CVPR 2024 PVUW Workshop: Motion Expression guided Video Segmentation
Motion Expression guided Video Segmentation is a challenging task that aims at segmenting objects in the video based on natural language expressions with motion descriptions. Unlike the previous referring video object segmentation (RVOS), this task focuses more on the motion in video content for language-guided video object segmentation, requiring an enhanced ability to model longer temporal, motion-oriented vision-language data. In this report, based on the RVOS methods, we successfully introduce mask information obtained from the video instance segmentation model as preliminary information for temporal enhancement and employ SAM for spatial refinement. Finally, our method achieved a score of 49.92 J &F in the validation phase and 54.20 J &F in the test phase, securing the final ranking of 2nd in the MeViS Track at the CVPR 2024 PVUW Challenge.
☆ PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
☆ From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of "gender bias" and "hallucination", showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
☆ The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation Applications
The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
☆ ImageFlowNet: Forecasting Multiscale Trajectories of Disease Progression with Irregularly-Sampled Longitudinal Medical Images
The forecasting of disease progression from images is a holy grail for clinical decision making. However, this task is complicated by the inherent high dimensionality, temporal sparsity and sampling irregularity in longitudinal image acquisitions. Existing methods often rely on extracting hand-crafted features and performing time-series analysis in this vector space, leading to a loss of rich spatial information within the images. To overcome these challenges, we introduce ImageFlowNet, a novel framework that learns latent-space flow fields that evolve multiscale representations in joint embedding spaces using neural ODEs and SDEs to model disease progression in the image domain. Notably, ImageFlowNet learns multiscale joint representation spaces by combining cohorts of patients together so that information can be transferred between the patient samples. The dynamics then provide plausible trajectories of progression, with the SDE providing alternative trajectories from the same starting point. We provide theoretical insights that support our formulation of ODEs, and motivate our regularizations involving high-level visual features, latent space organization, and trajectory smoothness. We then demonstrate ImageFlowNet's effectiveness through empirical evaluations on three longitudinal medical image datasets depicting progression in retinal geographic atrophy, multiple sclerosis, and glioblastoma.
☆ Evaluating Numerical Reasoning in Text-to-Image Models
Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
☆ Regularized Distribution Matching Distillation for One-step Unpaired Image-to-Image Translation
Diffusion distillation methods aim to compress the diffusion models into efficient one-step generators while trying to preserve quality. Among them, Distribution Matching Distillation (DMD) offers a suitable framework for training general-form one-step generators, applicable beyond unconditional generation. In this work, we introduce its modification, called Regularized Distribution Matching Distillation, applicable to unpaired image-to-image (I2I) problems. We demonstrate its empirical performance in application to several translation tasks, including 2D examples and I2I between different image datasets, where it performs on par or better than multi-step diffusion baselines.
☆ This Looks Better than That: Better Interpretable Models with ProtoPNeXt
Prototypical-part models are a popular interpretable alternative to black-box deep learning models for computer vision. However, they are difficult to train, with high sensitivity to hyperparameter tuning, inhibiting their application to new datasets and our understanding of which methods truly improve their performance. To facilitate the careful study of prototypical-part networks (ProtoPNets), we create a new framework for integrating components of prototypical-part models -- ProtoPNeXt. Using ProtoPNeXt, we show that applying Bayesian hyperparameter tuning and an angular prototype similarity metric to the original ProtoPNet is sufficient to produce new state-of-the-art accuracy for prototypical-part models on CUB-200 across multiple backbones. We further deploy this framework to jointly optimize for accuracy and prototype interpretability as measured by metrics included in ProtoPNeXt. Using the same resources, this produces models with substantially superior semantics and changes in accuracy between +1.3% and -1.5%. The code and trained models will be made publicly available upon publication.
☆ Holistic Evaluation for Interleaved Text-and-Image Generation
Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
comment: Work in progress. 13 pages, 5 figure, 6 tables
☆ Stylebreeder: Exploring and Democratizing Artistic Styles through Text-to-Image Models
Text-to-image models are becoming increasingly popular, revolutionizing the landscape of digital art creation by enabling highly detailed and creative visual content generation. These models have been widely employed across various domains, particularly in art generation, where they facilitate a broad spectrum of creative expression and democratize access to artistic creation. In this paper, we introduce \texttt{STYLEBREEDER}, a comprehensive dataset of 6.8M images and 1.8M prompts generated by 95K users on Artbreeder, a platform that has emerged as a significant hub for creative exploration with over 13M users. We introduce a series of tasks with this dataset aimed at identifying diverse artistic styles, generating personalized content, and recommending styles based on user interests. By documenting unique, user-generated styles that transcend conventional categories like 'cyberpunk' or 'Picasso,' we explore the potential for unique, crowd-sourced styles that could provide deep insights into the collective creative psyche of users worldwide. We also evaluate different personalization methods to enhance artistic expression and introduce a style atlas, making these models available in LoRA format for public use. Our research demonstrates the potential of text-to-image diffusion models to uncover and promote unique artistic expressions, further democratizing AI in art and fostering a more diverse and inclusive artistic community. The dataset, code and models are available at https://stylebreeder.github.io under a Public Domain (CC0) license.
☆ ICAL: Continual Learning of Multimodal Agents by Transforming Trajectories into Actionable Insights
Large-scale generative language and vision-language models (LLMs and VLMs) excel in few-shot in-context learning for decision making and instruction following. However, they require high-quality exemplar demonstrations to be included in their context window. In this work, we ask: Can LLMs and VLMs generate their own prompt examples from generic, sub-optimal demonstrations? We propose In-Context Abstraction Learning (ICAL), a method that builds a memory of multimodal experience insights from sub-optimal demonstrations and human feedback. Given a noisy demonstration in a new domain, VLMs abstract the trajectory into a general program by fixing inefficient actions and annotating cognitive abstractions: task relationships, object state changes, temporal subgoals, and task construals. These abstractions are refined and adapted interactively through human feedback while the agent attempts to execute the trajectory in a similar environment. The resulting abstractions, when used as exemplars in the prompt, significantly improve decision-making in retrieval-augmented LLM and VLM agents. Our ICAL agent surpasses the state-of-the-art in dialogue-based instruction following in TEACh, multimodal web agents in VisualWebArena, and action anticipation in Ego4D. In TEACh, we achieve a 12.6% improvement in goal-condition success. In VisualWebArena, our task success rate improves over the SOTA from 14.3% to 22.7%. In Ego4D action forecasting, we improve over few-shot GPT-4V and remain competitive with supervised models. We show finetuning our retrieval-augmented in-context agent yields additional improvements. Our approach significantly reduces reliance on expert-crafted examples and consistently outperforms in-context learning from action plans that lack such insights.
comment: Project website: http://ical-learning.github.io/
♻ ☆ Photometry of Saturated Stars with Neural Networks
We use a multilevel perceptron (MLP) neural network to obtain photometry of saturated stars in the All-Sky Automated Survey for Supernovae (ASAS-SN). The MLP can obtain fairly unbiased photometry for stars from g~4 to 14~mag, particularly compared to the dispersion (15%-85% 1sigma range around the median) of 0.12 mag for saturated (g<11.5 mag) stars. More importantly, the light curve of a non-variable saturated star has a median dispersion of only 0.037 mag. The MLP light curves are, in many cases, spectacularly better than those provided by the standard ASAS-SN pipelines. While the network was trained on g band data from only one of ASAS-SN's 20 cameras, initial experiments suggest that it can be used for any camera and the older ASAS-SN V band data as well. The dominant problems seem to be associated with correctable issues in the ASAS-SN data reduction pipeline for saturated stars more than the MLP itself. The method is publicly available as a light curve option on ASAS-SN Sky Patrol v1.0.
comment: accepted by ApJ
♻ ☆ MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).
♻ ☆ Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
comment: Project page: https://vision.cs.utexas.edu/projects/action2sound
♻ ☆ STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
♻ ☆ RankCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
comment: 12 pages, 4 figures, 6 tables. Code and model checkpoints are available at https://github.com/Jam1ezhang/RankCLIP
♻ ☆ Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations CVPR2024
We present an approach for analyzing grouping information contained within a neural network's activations, permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work, our method conducts a holistic analysis of a network's activation state, leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering, we formulate this analysis in terms of an optimization objective involving a set of affinity matrices, each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis, including, in the latter, both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout, whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway), while value vectors refine a semantic category representation (a `what' pathway).
comment: Accepted to CVPR2024 (Highlight)
♻ ☆ DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking
Identifying the origin of data is crucial for data provenance, with applications including data ownership protection, media forensics, and detecting AI-generated content. A standard approach involves embedding-based retrieval techniques that match query data with entries in a reference dataset. However, this method is not robust against benign and malicious edits. To address this, we propose Data Retrieval with Error-corrected codes and Watermarking (DREW). DREW randomly clusters the reference dataset, injects unique error-controlled watermark keys into each cluster, and uses these keys at query time to identify the appropriate cluster for a given sample. After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches. The integration of error control codes (ECC) ensures reliable cluster assignments, enabling the method to perform retrieval on the entire dataset in case the ECC algorithm cannot detect the correct cluster with high confidence. This makes DREW maintain baseline performance, while also providing opportunities for performance improvements due to the increased likelihood of correctly matching queries to their origin when performing retrieval on a smaller subset of the dataset. Depending on the watermark technique used, DREW can provide substantial improvements in retrieval accuracy (up to 40\% for some datasets and modification types) across multiple datasets and state-of-the-art embedding models (e.g., DinoV2, CLIP), making our method a promising solution for secure and reliable source identification. The code is available at https://github.com/mehrdadsaberi/DREW
♻ ☆ Automated Evaluation of Large Vision-Language Models on Self-driving Corner Cases
Large Vision-Language Models (LVLMs) have received widespread attention in advancing the interpretable self-driving. Existing evaluations of LVLMs primarily focus on the multi-faceted capabilities in natural circumstances, lacking automated and quantifiable assessment for self-driving, let alone the severe road corner cases. In this paper, we propose CODA-LM, the very first benchmark for the automatic evaluation of LVLMs for self-driving corner cases. We adopt a hierarchical data structure to prompt powerful LVLMs to analyze complex driving scenes and generate high-quality pre-annotation for human annotators, and for LVLM evaluation, we show that using the text-only large language models (LLMs) as judges reveals even better alignment with human preferences than the LVLM judges. Moreover, with CODA-LM, we build CODA-VLM, a new driving LVLM surpassing all the open-sourced counterparts on CODA-LM. Our CODA-VLM performs comparably with GPT-4V, even surpassing GPT-4V by +21.42% on the regional perception task. We hope CODA-LM can become the catalyst to promote interpretable self-driving empowered by LVLMs.
comment: Project Page: https://coda-dataset.github.io/coda-lm/
♻ ☆ Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
♻ ☆ Safety of Multimodal Large Language Models on Images and Texts IJCAI2024
Attracted by the impressive power of Multimodal Large Language Models (MLLMs), the public is increasingly utilizing them to improve the efficiency of daily work. Nonetheless, the vulnerabilities of MLLMs to unsafe instructions bring huge safety risks when these models are deployed in real-world scenarios. In this paper, we systematically survey current efforts on the evaluation, attack, and defense of MLLMs' safety on images and text. We begin with introducing the overview of MLLMs on images and text and understanding of safety, which helps researchers know the detailed scope of our survey. Then, we review the evaluation datasets and metrics for measuring the safety of MLLMs. Next, we comprehensively present attack and defense techniques related to MLLMs' safety. Finally, we analyze several unsolved issues and discuss promising research directions. The latest papers are continually collected at https://github.com/isXinLiu/MLLM-Safety-Collection.
comment: Accepted at IJCAI2024
♻ ☆ GLIMPSE: Generalized Local Imaging with MLPs
Deep learning is the current de facto state of the art in tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a convolutional neural network (CNN) which then computes the reconstruction. Despite strong results on 'in-distribution' test data similar to the training data, backprojection from sparse-view data delocalizes singularities, so these approaches require a large receptive field to perform well. As a consequence, they overfit to certain global structures which leads to poor generalization on out-of-distribution (OOD) samples. Moreover, their memory complexity and training time scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions, especially in 3D: a standard U-Net requires a substantial 140GB of memory and 2600 seconds per epoch on a research-grade GPU when training on 1024x1024 images. In this paper, we introduce GLIMPSE, a local processing neural network for computed tomography which reconstructs a pixel value by feeding only the measurements associated with the neighborhood of the pixel to a simple MLP. While achieving comparable or better performance with successful CNNs like the U-Net on in-distribution test data, GLIMPSE significantly outperforms them on OOD samples while maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images. Further, we built GLIMPSE to be fully differentiable, which enables feats such as recovery of accurate projection angles if they are out of calibration.
comment: 12 pages, 10 figures
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification ICIP 2024
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which, for image classification, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as vectors and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
comment: Accepted to IEEE ICIP 2024
♻ ☆ Towards Enhanced Analysis of Lung Cancer Lesions in EBUS-TBNA -- A Semi-Supervised Video Object Detection Method
This study aims to establish a computer-aided diagnostic system for lung lesions using endobronchial ultrasound (EBUS) to assist physicians in identifying lesion areas. During EBUS-transbronchial needle aspiration (EBUS-TBNA) procedures, hysicians rely on grayscale ultrasound images to determine the location of lesions. However, these images often contain significant noise and can be influenced by surrounding tissues or blood vessels, making identification challenging. Previous research has lacked the application of object detection models to EBUS-TBNA, and there has been no well-defined solution for the lack of annotated data in the EBUS-TBNA dataset. In related studies on ultrasound images, although models have been successful in capturing target regions for their respective tasks, their training and predictions have been based on two-dimensional images, limiting their ability to leverage temporal features for improved predictions. This study introduces a three-dimensional video-based object detection model. It first generates a set of improved queries using a diffusion model, then captures temporal correlations through an attention mechanism. A filtering mechanism selects relevant information from previous frames to pass to the current frame. Subsequently, a teacher-student model training approach is employed to further optimize the model using unlabeled data. By incorporating various data augmentation and feature alignment, the model gains robustness against interference. Test results demonstrate that this model, which captures spatiotemporal information and employs semi-supervised learning methods, achieves an Average Precision (AP) of 48.7 on the test dataset, outperforming other models. It also achieves an Average Recall (AR) of 79.2, significantly leading over existing models.
♻ ☆ Keep Moving: identifying task-relevant subspaces to maximise plasticity for newly learned tasks
Continual learning algorithms strive to acquire new knowledge while preserving prior information. Often, these algorithms emphasise stability and restrict network updates upon learning new tasks. In many cases, such restrictions come at a cost to the model's plasticity, i.e. the model's ability to adapt to the requirements of a new task. But is all change detrimental? Here, we approach this question by proposing that activation spaces in neural networks can be decomposed into two subspaces: a readout range in which change affects prior tasks and a null space in which change does not alter prior performance. Based on experiments with this novel technique, we show that, indeed, not all activation change is associated with forgetting. Instead, only change in the subspace visible to the readout of a task can lead to decreased stability, while restricting change outside of this subspace is associated only with a loss of plasticity. Analysing various commonly used algorithms, we show that regularisation-based techniques do not fully disentangle the two spaces and, as a result, restrict plasticity more than need be. We expand our results by investigating a linear model in which we can manipulate learning in the two subspaces directly and thus causally link activation changes to stability and plasticity. For hierarchical, nonlinear cases, we present an approximation that enables us to estimate functionally relevant subspaces at every layer of a deep nonlinear network, corroborating our previous insights. Together, this work provides novel means to derive insights into the mechanisms behind stability and plasticity in continual learning and may serve as a diagnostic tool to guide developments of future continual learning algorithms that stabilise inference while allowing maximal space for learning.
comment: 17 pages, 7 figures, Revision now accepted at COLLAS 2024
♻ ☆ High-Resolution Building and Road Detection from Sentinel-2
Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 78.3% mIoU, compared to the high-resolution teacher model accuracy of 85.3% mIoU. We also describe a related method for counting individual buildings in a Sentinel-2 patch which achieves R^2 = 0.91 against true counts. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.
♻ ☆ Exploring Fine-Grained Representation and Recomposition for Cloth-Changing Person Re-Identification
Cloth-changing person Re-IDentification (Re-ID) is a particularly challenging task, suffering from two limitations of inferior discriminative features and limited training samples. Existing methods mainly leverage auxiliary information to facilitate identity-relevant feature learning, including soft-biometrics features of shapes or gaits, and additional labels of clothing. However, this information may be unavailable in real-world applications. In this paper, we propose a novel FIne-grained Representation and Recomposition (FIRe$^{2}$) framework to tackle both limitations without any auxiliary annotation or data. Specifically, we first design a Fine-grained Feature Mining (FFM) module to separately cluster images of each person. Images with similar so-called fine-grained attributes (e.g., clothes and viewpoints) are encouraged to cluster together. An attribute-aware classification loss is introduced to perform fine-grained learning based on cluster labels, which are not shared among different people, promoting the model to learn identity-relevant features. Furthermore, to take full advantage of fine-grained attributes, we present a Fine-grained Attribute Recomposition (FAR) module by recomposing image features with different attributes in the latent space. It significantly enhances robust feature learning. Extensive experiments demonstrate that FIRe$^{2}$ can achieve state-of-the-art performance on five widely-used cloth-changing person Re-ID benchmarks. The code is available at https://github.com/QizaoWang/FIRe-CCReID.
comment: Accepted by IEEE TIFS 2024
♻ ☆ Precipitation Downscaling with Spatiotemporal Video Diffusion
In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.
♻ ☆ Stable Phase Retrieval with Mirror Descent
In this paper, we aim to reconstruct an n-dimensional real vector from m phaseless measurements corrupted by an additive noise. We extend the noiseless framework developed in [15], based on mirror descent (or Bregman gradient descent), to deal with noisy measurements and prove that the procedure is stable to (small enough) additive noise. In the deterministic case, we show that mirror descent converges to a critical point of the phase retrieval problem, and if the algorithm is well initialized and the noise is small enough, the critical point is near the true vector up to a global sign change. When the measurements are i.i.d Gaussian and the signal-to-noise ratio is large enough, we provide global convergence guarantees that ensure that with high probability, mirror descent converges to a global minimizer near the true vector (up to a global sign change), as soon as the number of measurements m is large enough. The sample complexity bound can be improved if a spectral method is used to provide a good initial guess. We complement our theoretical study with several numerical results showing that mirror descent is both a computationally and statistically efficient scheme to solve the phase retrieval problem.
♻ ☆ RobMOT: Robust 3D Multi-Object Tracking by Observational Noise and State Estimation Drift Mitigation on LiDAR PointCloud
This work addresses limitations in recent 3D tracking-by-detection methods, focusing on identifying legitimate trajectories and addressing state estimation drift in Kalman filters. Current methods rely heavily on threshold-based filtering of false positive detections using detection scores to prevent ghost trajectories. However, this approach is inadequate for distant and partially occluded objects, where detection scores tend to drop, potentially leading to false positives exceeding the threshold. Additionally, the literature generally treats detections as precise localizations of objects. Our research reveals that noise in detections impacts localization information, causing trajectory drift for occluded objects and hindering recovery. To this end, we propose a novel online track validity mechanism that temporally distinguishes between legitimate and ghost tracks, along with a multi-stage observational gating process for incoming observations. This mechanism significantly improves tracking performance, with a $6.28\%$ in HOTA and a $17.87\%$ increase in MOTA. We also introduce a refinement to the Kalman filter that enhances noise mitigation in trajectory drift, leading to more robust state estimation for occluded objects. Our framework, RobMOT, outperforms state-of-the-art methods, including deep learning approaches, across various detectors, achieving up to a $4\%$ margin in HOTA and $6\%$ in MOTA. RobMOT excels under challenging conditions, such as prolonged occlusions and tracking distant objects, with up to a 59\% improvement in processing latency.
♻ ☆ Can Better Text Semantics in Prompt Tuning Improve VLM Generalization?
Going beyond mere fine-tuning of vision-language models (VLMs), learnable prompt tuning has emerged as a promising, resource-efficient alternative. Despite their potential, effectively learning prompts faces the following challenges: (i) training in a low-shot scenario results in overfitting, limiting adaptability, and yielding weaker performance on newer classes or datasets; (ii) prompt-tuning's efficacy heavily relies on the label space, with decreased performance in large class spaces, signaling potential gaps in bridging image and class concepts. In this work, we investigate whether better text semantics can help address these concerns. In particular, we introduce a prompt-tuning method that leverages class descriptions obtained from Large Language Models (LLMs). These class descriptions are used to bridge image and text modalities. Our approach constructs part-level description-guided image and text features, which are subsequently aligned to learn more generalizable prompts. Our comprehensive experiments conducted across 11 benchmark datasets show that our method outperforms established methods, demonstrating substantial improvements.
♻ ☆ Open-YOLO 3D: Towards Fast and Accurate Open-Vocabulary 3D Instance Segmentation
Recent works on open-vocabulary 3D instance segmentation show strong promise, but at the cost of slow inference speed and high computation requirements. This high computation cost is typically due to their heavy reliance on 3D clip features, which require computationally expensive 2D foundation models like Segment Anything (SAM) and CLIP for multi-view aggregation into 3D. As a consequence, this hampers their applicability in many real-world applications that require both fast and accurate predictions. To this end, we propose a fast yet accurate open-vocabulary 3D instance segmentation approach, named Open-YOLO 3D, that effectively leverages only 2D object detection from multi-view RGB images for open-vocabulary 3D instance segmentation. We address this task by generating class-agnostic 3D masks for objects in the scene and associating them with text prompts. We observe that the projection of class-agnostic 3D point cloud instances already holds instance information; thus, using SAM might only result in redundancy that unnecessarily increases the inference time. We empirically find that a better performance of matching text prompts to 3D masks can be achieved in a faster fashion with a 2D object detector. We validate our Open-YOLO 3D on two benchmarks, ScanNet200 and Replica, under two scenarios: (i) with ground truth masks, where labels are required for given object proposals, and (ii) with class-agnostic 3D proposals generated from a 3D proposal network. Our Open-YOLO 3D achieves state-of-the-art performance on both datasets while obtaining up to $\sim$16$\times$ speedup compared to the best existing method in literature. On ScanNet200 val. set, our Open-YOLO 3D achieves mean average precision (mAP) of 24.7\% while operating at 22 seconds per scene. Code and model are available at github.com/aminebdj/OpenYOLO3D.
♻ ☆ Emergent Dynamics in Neural Cellular Automata
Neural Cellular Automata (NCA) models are trainable variations of traditional Cellular Automata (CA). Emergent motion in the patterns created by NCA has been successfully applied to synthesize dynamic textures. However, the conditions required for an NCA to display dynamic patterns remain unexplored. Here, we investigate the relationship between the NCA architecture and the emergent dynamics of the trained models. Specifically, we vary the number of channels in the cell state and the number of hidden neurons in the MultiLayer Perceptron (MLP), and draw a relationship between the combination of these two variables and the motion strength between successive frames. Our analysis reveals that the disparity and proportionality between these two variables have a strong correlation with the emergent dynamics in the NCA output. We thus propose a design principle for creating dynamic NCA.
comment: 2 pages
♻ ☆ Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.
comment: Medical Imaging with Deep Learning (MIDL) 2024 (Oral)
♻ ☆ FMARS: Annotating Remote Sensing Images for Disaster Management using Foundation Models
Very-High Resolution (VHR) remote sensing imagery is increasingly accessible, but often lacks annotations for effective machine learning applications. Recent foundation models like GroundingDINO and Segment Anything (SAM) provide opportunities to automatically generate annotations. This study introduces FMARS (Foundation Model Annotations in Remote Sensing), a methodology leveraging VHR imagery and foundation models for fast and robust annotation. We focus on disaster management and provide a large-scale dataset with labels obtained from pre-event imagery over 19 disaster events, derived from the Maxar Open Data initiative. We train segmentation models on the generated labels, using Unsupervised Domain Adaptation (UDA) techniques to increase transferability to real-world scenarios. Our results demonstrate the effectiveness of leveraging foundation models to automatically annotate remote sensing data at scale, enabling robust downstream models for critical applications. Code and dataset are available at \url{https://github.com/links-ads/igarss-fmars}.
comment: Accepted at IGARSS 2024, 5 pages. Revised and corrected version
♻ ☆ Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact Reduction
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
♻ ☆ SyncTweedies: A General Generative Framework Based on Synchronized Diffusions
We introduce a general framework for generating diverse visual content, including ambiguous images, panorama images, mesh textures, and Gaussian splat textures, by synchronizing multiple diffusion processes. We present exhaustive investigation into all possible scenarios for synchronizing multiple diffusion processes through a canonical space and analyze their characteristics across applications. In doing so, we reveal a previously unexplored case: averaging the outputs of Tweedie's formula while conducting denoising in multiple instance spaces. This case also provides the best quality with the widest applicability to downstream tasks. We name this case SyncTweedies. In our experiments generating visual content aforementioned, we demonstrate the superior quality of generation by SyncTweedies compared to other synchronization methods, optimization-based and iterative-update-based methods.
comment: Project page: https://synctweedies.github.io/
♻ ☆ Sim-to-Real Transfer via 3D Feature Fields for Vision-and-Language Navigation
Vision-and-language navigation (VLN) enables the agent to navigate to a remote location in 3D environments following the natural language instruction. In this field, the agent is usually trained and evaluated in the navigation simulators, lacking effective approaches for sim-to-real transfer. The VLN agents with only a monocular camera exhibit extremely limited performance, while the mainstream VLN models trained with panoramic observation, perform better but are difficult to deploy on most monocular robots. For this case, we propose a sim-to-real transfer approach to endow the monocular robots with panoramic traversability perception and panoramic semantic understanding, thus smoothly transferring the high-performance panoramic VLN models to the common monocular robots. In this work, the semantic traversable map is proposed to predict agent-centric navigable waypoints, and the novel view representations of these navigable waypoints are predicted through the 3D feature fields. These methods broaden the limited field of view of the monocular robots and significantly improve navigation performance in the real world. Our VLN system outperforms previous SOTA monocular VLN methods in R2R-CE and RxR-CE benchmarks within the simulation environments and is also validated in real-world environments, providing a practical and high-performance solution for real-world VLN.
comment: Submitted to CoRL 2024. The code is available at https://github.com/MrZihan/Sim2Real-VLN-3DFF
♻ ☆ MVGamba: Unify 3D Content Generation as State Space Sequence Modeling
Recent 3D large reconstruction models (LRMs) can generate high-quality 3D content in sub-seconds by integrating multi-view diffusion models with scalable multi-view reconstructors. Current works further leverage 3D Gaussian Splatting as 3D representation for improved visual quality and rendering efficiency. However, we observe that existing Gaussian reconstruction models often suffer from multi-view inconsistency and blurred textures. We attribute this to the compromise of multi-view information propagation in favor of adopting powerful yet computationally intensive architectures (e.g., Transformers). To address this issue, we introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor based on the RNN-like State Space Model (SSM). Our Gaussian reconstructor propagates causal context containing multi-view information for cross-view self-refinement while generating a long sequence of Gaussians for fine-detail modeling with linear complexity. With off-the-shelf multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Extensive experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1\times$ of the model size.
♻ ☆ SSUMamba: Spatial-Spectral Selective State Space Model for Hyperspectral Image Denoising
Denoising is a crucial preprocessing step for hyperspectral images (HSIs) due to noise arising from intraimaging mechanisms and environmental factors. Long-range spatial-spectral correlation modeling is beneficial for HSI denoising but often comes with high computational complexity. Based on the state space model (SSM), Mamba is known for its remarkable long-range dependency modeling capabilities and computational efficiency. Building on this, we introduce a memory-efficient spatial-spectral UMamba (SSUMamba) for HSI denoising, with the spatial-spectral continuous scan (SSCS) Mamba being the core component. SSCS Mamba alternates the row, column, and band in six different orders to generate the sequence and uses the bidirectional SSM to exploit long-range spatial-spectral dependencies. In each order, the images are rearranged between adjacent scans to ensure spatial-spectral continuity. Additionally, 3D convolutions are embedded into the SSCS Mamba to enhance local spatial-spectral modeling. Experiments demonstrate that SSUMamba achieves superior denoising results with lower memory consumption per batch compared to transformer-based methods. The source code is available at https://github.com/lronkitty/SSUMamba.
♻ ☆ Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
♻ ☆ NeRF-Guided Unsupervised Learning of RGB-D Registration
This paper focuses on training a robust RGB-D registration model without ground-truth pose supervision. Existing methods usually adopt a pairwise training strategy based on differentiable rendering, which enforces the photometric and the geometric consistency between the two registered frames as supervision. However, this frame-to-frame framework suffers from poor multi-view consistency due to factors such as lighting changes, geometry occlusion and reflective materials. In this paper, we present NeRF-UR, a novel frame-to-model optimization framework for unsupervised RGB-D registration. Instead of frame-to-frame consistency, we leverage the neural radiance field (NeRF) as a global model of the scene and use the consistency between the input and the NeRF-rerendered frames for pose optimization. This design can significantly improve the robustness in scenarios with poor multi-view consistency and provides better learning signal for the registration model. Furthermore, to bootstrap the NeRF optimization, we create a synthetic dataset, Sim-RGBD, through a photo-realistic simulator to warm up the registration model. By first training the registration model on Sim-RGBD and later unsupervisedly fine-tuning on real data, our framework enables distilling the capability of feature extraction and registration from simulation to reality. Our method outperforms the state-of-the-art counterparts on two popular indoor RGB-D datasets, ScanNet and 3DMatch. Code and models will be released for paper reproduction.
♻ ☆ MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation ACL 2024
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.
comment: LLM/VLM-based VLN Agents. Accepted to ACL 2024. Project: https://chen-judge.github.io/MapGPT/
♻ ☆ mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.
comment: ALVR Workshop 2024
♻ ☆ Modeling 3D Infant Kinetics Using Adaptive Graph Convolutional Networks
Reliable methods for the neurodevelopmental assessment of infants are essential for early detection of medical issues that may need prompt interventions. Spontaneous motor activity, or 'kinetics', is shown to provide a powerful surrogate measure of upcoming neurodevelopment. However, its assessment is by and large qualitative and subjective, focusing on visually identified, age-specific gestures. Here, we follow an alternative approach, predicting infants' neurodevelopmental maturation based on data-driven evaluation of individual motor patterns. We utilize 3D video recordings of infants processed with pose-estimation to extract spatio-temporal series of anatomical landmarks, and apply adaptive graph convolutional networks to predict the actual age. We show that our data-driven approach achieves improvement over traditional machine learning baselines based on manually engineered features.
comment: 10 pages, 3 figures. Code repository available via https://github.com/deinal/infant-aagcn
♻ ☆ LoRAPrune: Pruning Meets Low-Rank Parameter-Efficient Fine-Tuning
Large Language Models (LLMs), such as LLaMA and T5, have shown exceptional performance across various tasks through fine-tuning. Although low-rank adaption (LoRA) has emerged to cheaply fine-tune these LLMs on downstream tasks, their deployment is still hindered by the vast model scale and computational costs. Post-training model pruning offers a way to compress LLMs. However, the current pruning methods designed for LLMs are not compatible with LoRA. This is due to their utilization of unstructured pruning on LLMs, impeding the merging of LoRA weights, or their dependence on the gradients of pre-trained weights to guide pruning, which can impose significant memory overhead. To this end, we propose LoRAPrune, a new framework that delivers an accurate structured pruned model in a highly memory-efficient manner. Specifically, we first design a LoRA-guided pruning criterion, which uses the weights and gradients of LoRA, rather than the gradients of pre-trained weights for importance estimation. We subsequently integrate this criterion into an iterative pruning process, effectively removing redundant channels and heads. Extensive experimental results demonstrate the superior performance of our LoRAPrune over existing approaches on the LLaMA series models. At a 50\% compression rate, LoRAPrune demonstrates superior performance over LLM-Pruner, achieving a reduction in perplexity by 4.81 on WikiText2 and 3.46 on PTB, while also decreasing memory usage by 52.6%. Besides, LoRAPrune also matches semi-structural pruning across multiple LLMs, proving its wide applicability. The code is available at https://github.com/aim-uofa/LoRAPrune.
comment: accepted by acl 2024 findings
♻ ☆ TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
This work proposes a novel representation of injective deformations of 3D space, which overcomes existing limitations of injective methods: inaccuracy, lack of robustness, and incompatibility with general learning and optimization frameworks. The core idea is to reduce the problem to a deep composition of multiple 2D mesh-based piecewise-linear maps. Namely, we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D), and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to efficiently and accurately optimize and learn complex deformations, outperforming other injective approaches. As a main application, we produce complex and artifact-free NeRF and SDF deformations.
♻ ☆ Anonymization Prompt Learning for Facial Privacy-Preserving Text-to-Image Generation
Text-to-image diffusion models, such as Stable Diffusion, generate highly realistic images from text descriptions. However, the generation of certain content at such high quality raises concerns. A prominent issue is the accurate depiction of identifiable facial images, which could lead to malicious deepfake generation and privacy violations. In this paper, we propose Anonymization Prompt Learning (APL) to address this problem. Specifically, we train a learnable prompt prefix for text-to-image diffusion models, which forces the model to generate anonymized facial identities, even when prompted to produce images of specific individuals. Extensive quantitative and qualitative experiments demonstrate the successful anonymization performance of APL, which anonymizes any specific individuals without compromising the quality of non-identity-specific image generation. Furthermore, we reveal the plug-and-play property of the learned prompt prefix, enabling its effective application across different pretrained text-to-image models for transferrable privacy and security protection against the risks of deepfakes.
comment: 15 pages, 8 figures and 5 tables
♻ ☆ EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens ICML 2024
Masked Video Autoencoder (MVA) approaches have demonstrated their potential by significantly outperforming previous video representation learning methods. However, they waste an excessive amount of computations and memory in predicting uninformative tokens/frames due to random masking strategies. (e.g., over 16 nodes with 128 NVIDIA A100 GPUs). To resolve this issue, we exploit the unequal information density among the patches in videos and propose EVEREST, a surprisingly efficient MVA approach for video representation learning that finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. We further present an information-intensive frame selection strategy that allows the model to focus on informative and causal frames with minimal redundancy. Our method significantly reduces the computation and memory requirements of MVA, enabling the pre-training and fine-tuning on a single machine with 8 GPUs while achieving comparable performance to computation- and memory-heavy baselines on multiple benchmarks and the uncurated Ego4D dataset. We hope that our work contributes to reducing the barrier to further research on video understanding.
comment: Accepted by ICML 2024
♻ ☆ Online,Target-Free LiDAR-Camera Extrinsic Calibration via Cross-Modal Mask Matching
LiDAR-camera extrinsic calibration (LCEC) is crucial for data fusion in intelligent vehicles. Offline, target-based approaches have long been the preferred choice in this field. However, they often demonstrate poor adaptability to real-world environments. This is largely because extrinsic parameters may change significantly due to moderate shocks or during extended operations in environments with vibrations. In contrast, online, target-free approaches provide greater adaptability yet typically lack robustness, primarily due to the challenges in cross-modal feature matching. Therefore, in this article, we unleash the full potential of large vision models (LVMs), which are emerging as a significant trend in the fields of computer vision and robotics, especially for embodied artificial intelligence, to achieve robust and accurate online, target-free LCEC across a variety of challenging scenarios. Our main contributions are threefold: we introduce a novel framework known as MIAS-LCEC, provide an open-source versatile calibration toolbox with an interactive visualization interface, and publish three real-world datasets captured from various indoor and outdoor environments. The cornerstone of our framework and toolbox is the cross-modal mask matching (C3M) algorithm, developed based on a state-of-the-art (SoTA) LVM and capable of generating sufficient and reliable matches. Extensive experiments conducted on these real-world datasets demonstrate the robustness of our approach and its superior performance compared to SoTA methods, particularly for the solid-state LiDARs with super-wide fields of view.
comment: accepted to IEEE Trans. on Intelligent Vehicles (T-IV)
♻ ☆ Position-Guided Prompt Learning for Anomaly Detection in Chest X-Rays MICCAI 2024
Anomaly detection in chest X-rays is a critical task. Most methods mainly model the distribution of normal images, and then regard significant deviation from normal distribution as anomaly. Recently, CLIP-based methods, pre-trained on a large number of medical images, have shown impressive performance on zero/few-shot downstream tasks. In this paper, we aim to explore the potential of CLIP-based methods for anomaly detection in chest X-rays. Considering the discrepancy between the CLIP pre-training data and the task-specific data, we propose a position-guided prompt learning method. Specifically, inspired by the fact that experts diagnose chest X-rays by carefully examining distinct lung regions, we propose learnable position-guided text and image prompts to adapt the task data to the frozen pre-trained CLIP-based model. To enhance the model's discriminative capability, we propose a novel structure-preserving anomaly synthesis method within chest x-rays during the training process. Extensive experiments on three datasets demonstrate that our proposed method outperforms some state-of-the-art methods. The code of our implementation is available at https://github.com/sunzc-sunny/PPAD.
comment: MICCAI 2024 Early Accept
♻ ☆ AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
comment: Project Page: https://adamesh.github.io
♻ ☆ Qualitative Failures of Image Generation Models and Their Application in Detecting Deepfakes
The ability of image and video generation models to create photorealistic images has reached unprecedented heights, making it difficult to distinguish between real and fake images in many cases. However, despite this progress, a gap remains between the quality of generated images and those found in the real world. To address this, we have reviewed a vast body of literature from both academic publications and social media to identify qualitative shortcomings in image generation models, which we have classified into five categories. By understanding these failures, we can identify areas where these models need improvement, as well as develop strategies for detecting deep fakes. The prevalence of deep fakes in today's society is a serious concern, and our findings can help mitigate their negative impact.
♻ ☆ Determinantal Point Process as an alternative to NMS BMVC 2020
We present a determinantal point process (DPP) inspired alternative to non-maximum suppression (NMS) which has become an integral step in all state-of-the-art object detection frameworks. DPPs have been shown to encourage diversity in subset selection problems. We pose NMS as a subset selection problem and posit that directly incorporating DPP like framework can improve the overall performance of the object detection system. We propose an optimization problem which takes the same inputs as NMS, but introduces a novel sub-modularity based diverse subset selection functional. Our results strongly indicate that the modifications proposed in this paper can provide consistent improvements to state-of-the-art object detection pipelines.
comment: Published in BMVC 2020
♻ ☆ AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
Vision Transformers (ViTs) have demonstrated remarkable performance in image classification tasks, particularly when equipped with local information via region attention or convolutions. While such architectures improve the feature aggregation from different granularities, they often fail to contribute to the robustness of the networks. Neural Cellular Automata (NCA) enables the modeling of global cell representations through local interactions, with its training strategies and architecture design conferring strong generalization ability and robustness against noisy inputs. In this paper, we propose Adaptor Neural Cellular Automata (AdaNCA) for Vision Transformer that uses NCA as plug-in-play adaptors between ViT layers, enhancing ViT's performance and robustness against adversarial samples as well as out-of-distribution inputs. To overcome the large computational overhead of standard NCAs, we propose Dynamic Interaction for more efficient interaction learning. Furthermore, we develop an algorithm for identifying the most effective insertion points for AdaNCA based on our analysis of AdaNCA placement and robustness improvement. With less than a 3% increase in parameters, AdaNCA contributes to more than 10% absolute improvement in accuracy under adversarial attacks on the ImageNet1K benchmark. Moreover, we demonstrate with extensive evaluations across 8 robustness benchmarks and 4 ViT architectures that AdaNCA, as a plug-in-play module, consistently improves the robustness of ViTs.
comment: 26 pages, 11 figures
♻ ☆ AcTExplore: Active Tactile Exploration of Unknown Objects ICRA 2024
Tactile exploration plays a crucial role in understanding object structures for fundamental robotics tasks such as grasping and manipulation. However, efficiently exploring such objects using tactile sensors is challenging, primarily due to the large-scale unknown environments and limited sensing coverage of these sensors. To this end, we present AcTExplore, an active tactile exploration method driven by reinforcement learning for object reconstruction at scales that automatically explores the object surfaces in a limited number of steps. Through sufficient exploration, our algorithm incrementally collects tactile data and reconstructs 3D shapes of the objects as well, which can serve as a representation for higher-level downstream tasks. Our method achieves an average of 95.97% IoU coverage on unseen YCB objects while just being trained on primitive shapes. Project Webpage: https://prg.cs.umd.edu/AcTExplore
comment: 8 pages, 6 figures, Accepted to ICRA 2024
♻ ☆ Residual Connections Harm Abstract Feature Learning in Masked Autoencoders
We demonstrate that adding a weighting factor to decay the strength of identity shortcuts within residual networks substantially improves semantic feature learning in the state-of-the-art self-supervised masked autoencoding (MAE) paradigm. Our modification to the identity shortcuts within a VIT-B/16 backbone of an MAE boosts linear probing accuracy on ImageNet from 67.8% to 72.7%. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.
♻ ☆ Unsupervised Multimodal Deepfake Detection Using Intra- and Cross-Modal Inconsistencies
Deepfake videos present an increasing threat to society with potentially negative impact on criminal justice, democracy, and personal safety and privacy. Meanwhile, detecting deepfakes, at scale, remains a very challenging task that often requires labeled training data from existing deepfake generation methods. Further, even the most accurate supervised deepfake detection methods do not generalize to deepfakes generated using new generation methods. In this paper, we propose a novel unsupervised method for detecting deepfake videos by directly identifying intra-modal and cross-modal inconsistency between video segments. The fundamental hypothesis behind the proposed detection method is that motion or identity inconsistencies are inevitable in deepfake videos. We will mathematically and empirically support this hypothesis, and then proceed to constructing our method grounded in our theoretical analysis. Our proposed method outperforms prior state-of-the-art unsupervised deepfake detection methods on the challenging FakeAVCeleb dataset, and also has several additional advantages: it is scalable because it does not require pristine (real) samples for each identity during inference and therefore can apply to arbitrarily many identities, generalizable because it is trained only on real videos and therefore does not rely on a particular deepfake method, reliable because it does not rely on any likelihood estimation in high dimensions, and explainable because it can pinpoint the exact location of modality inconsistencies which are then verifiable by a human expert.
comment: 11 pages, 3 figures, 3 tables
♻ ☆ YouTube SFV+HDR Quality Dataset
The popularity of Short form videos (SFV) has grown dramatically in the past few years, and has become a phenomenal video category with billions of viewers. Meanwhile, High Dynamic Range (HDR) as an advanced feature also becomes more and more popular on video sharing platforms. As a hot topic with huge impact, SFV and HDR bring new questions to video quality research: 1) is SFV+HDR quality assessment significantly different from traditional User Generated Content (UGC) quality assessment? 2) do objective quality metrics designed for traditional UGC still work well for SFV+HDR? To answer the above questions, we created the first large scale SFV+HDR dataset with reliable subjective quality scores, covering 10 popular content categories. Further, we also introduce a general sampling framework to maximize the representativeness of the dataset. We provided a comprehensive analysis of subjective quality scores for Short form SDR and HDR videos, and discuss the reliability of state-of-the-art UGC quality metrics and potential improvements.
comment: Accepted by 2024 IEEE International Conference on Image Processing Dataset link: https://media.withyoutube.com/sfv-hdr
♻ ☆ Major TOM: Expandable Datasets for Earth Observation
Deep learning models are increasingly data-hungry, requiring significant resources to collect and compile the datasets needed to train them, with Earth Observation (EO) models being no exception. However, the landscape of datasets in EO is relatively atomised, with interoperability made difficult by diverse formats and data structures. If ever larger datasets are to be built, and duplication of effort minimised, then a shared framework that allows users to combine and access multiple datasets is needed. Here, Major TOM (Terrestrial Observation Metaset) is proposed as this extensible framework. Primarily, it consists of a geographical indexing system based on a set of grid points and a metadata structure that allows multiple datasets with different sources to be merged. Besides the specification of Major TOM as a framework, this work also presents a large, open-access dataset, MajorTOM-Core, which covers the vast majority of the Earth's land surface. This dataset provides the community with both an immediately useful resource, as well as acting as a template for future additions to the Major TOM ecosystem. Access: https://huggingface.co/Major-TOM
comment: To be presented at IGARSS 2024
Computation and Language 183
☆ Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
comment: Under review
☆ Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
comment: Project website: whiteboard.cs.columbia.edu/
☆ How to Compute the Probability of a Word
Language models (LMs) estimate the probability distribution over sequences of natural language; these distributions are crucial for computing perplexity and surprisal in linguistics research. While we are usually concerned with measuring these values for words, most LMs operate over subwords. Despite seemingly straightforward, accurately computing probabilities over one unit given probabilities over the other requires care. Indeed, we show here that many recent linguistic studies have been incorrectly computing these values. This paper derives the correct methods for computing word probabilities, highlighting issues when relying on language models that use beginning-of-word (bow)-marking tokenisers, e.g., the GPT family. Empirically, we show that correcting the widespread bug in probability computations affects measured outcomes in sentence comprehension and lexical optimisation analyses.
☆ xCOMET-lite: Bridging the Gap Between Efficiency and Quality in Learned MT Evaluation Metrics
State-of-the-art trainable machine translation evaluation metrics like xCOMET achieve high correlation with human judgment but rely on large encoders (up to 10.7B parameters), making them computationally expensive and inaccessible to researchers with limited resources. To address this issue, we investigate whether the knowledge stored in these large encoders can be compressed while maintaining quality. We employ distillation, quantization, and pruning techniques to create efficient xCOMET alternatives and introduce a novel data collection pipeline for efficient black-box distillation. Our experiments show that, using quantization, xCOMET can be compressed up to three times with no quality degradation. Additionally, through distillation, we create an xCOMET-lite metric, which has only 2.6% of xCOMET-XXL parameters, but retains 92.1% of its quality. Besides, it surpasses strong small-scale metrics like COMET-22 and BLEURT-20 on the WMT22 metrics challenge dataset by 6.4%, despite using 50% fewer parameters. All code, dataset, and models are available online.
☆ GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. Despite numerous efforts made to optimize LLMs for long contexts, challenges persist in robustly processing long inputs. In this paper, we introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. It then invokes a set of predefined functions to read node content and neighbors, facilitating a coarse-to-fine exploration of the graph. Throughout the exploration, the agent continuously records new insights and reflects on current circumstances to optimize the process until it has gathered sufficient information to generate an answer. Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin. Additionally, our approach demonstrates superior performance on four challenging single-hop and multi-hop benchmarks.
comment: The first four authors contributed equally, 27 pages
☆ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
☆ Unmasking Database Vulnerabilities: Zero-Knowledge Schema Inference Attacks in Text-to-SQL Systems
Relational databases are integral to modern information systems, serving as the foundation for storing, querying, and managing data efficiently and effectively. Advancements in large language modeling have led to the emergence of text-to-SQL technologies, significantly enhancing the querying and extracting of information from these databases and raising concerns about privacy and security. Our research extracts the database schema elements underlying a text-to-SQL model. Knowledge of the schema can make attacks such as SQL injection easier. By asking specially crafted questions, we have developed a zero-knowledge framework designed to probe various database schema elements without knowledge of the database itself. The text-to-SQL models then process these questions to produce an output that we use to uncover the structure of the database schema. We apply it to specialized text-to-SQL models fine-tuned on text-SQL pairs and generative language models used for SQL generation. Overall, we can reconstruct the table names with an F1 of nearly .75 for fine-tuned models and .96 for generative.
☆ Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Vision Language Models (VLMs) demonstrate remarkable proficiency in addressing a wide array of visual questions, which requires strong perception and reasoning faculties. Assessing these two competencies independently is crucial for model refinement, despite the inherent difficulty due to the intertwined nature of seeing and reasoning in existing VLMs. To tackle this issue, we present Prism, an innovative framework designed to disentangle the perception and reasoning processes involved in visual question solving. Prism comprises two distinct stages: a perception stage that utilizes a VLM to extract and articulate visual information in textual form, and a reasoning stage that formulates responses based on the extracted visual information using a Large Language Model (LLM). This modular design enables the systematic comparison and assessment of both proprietary and open-source VLM for their perception and reasoning strengths. Our analytical framework provides several valuable insights, underscoring Prism's potential as a cost-effective solution for vision-language tasks. By combining a streamlined VLM focused on perception with a powerful LLM tailored for reasoning, Prism achieves superior results in general vision-language tasks while substantially cutting down on training and operational expenses. Quantitative evaluations show that Prism, when configured with a vanilla 2B LLaVA and freely accessible GPT-3.5, delivers performance on par with VLMs $10 \times$ larger on the rigorous multimodal benchmark MMStar. The project is released at: https://github.com/SparksJoe/Prism.
☆ RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.
☆ PostMark: A Robust Blackbox Watermark for Large Language Models
The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
comment: preprint; 18 pages, 5 figures
☆ Investigating Mysteries of CoT-Augmented Distillation
Eliciting "chain of thought" (CoT) rationales -- sequences of token that convey a "reasoning" process -- has been shown to consistently improve LLM performance on tasks like question answering. More recent efforts have shown that such rationales can also be used for model distillation: Including CoT sequences (elicited from a large "teacher" model) in addition to target labels when fine-tuning a small student model yields (often substantial) improvements. In this work we ask: Why and how does this additional training signal help in model distillation? We perform ablations to interrogate this, and report some potentially surprising results. Specifically: (1) Placing CoT sequences after labels (rather than before) realizes consistently better downstream performance -- this means that no student "reasoning" is necessary at test time to realize gains. (2) When rationales are appended in this way, they need not be coherent reasoning sequences to yield improvements; performance increases are robust to permutations of CoT tokens, for example. In fact, (3) a small number of key tokens are sufficient to achieve improvements equivalent to those observed when full rationales are used in model distillation.
comment: Draft; under review
☆ Evidence of a log scaling law for political persuasion with large language models
Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
comment: 16 pages, 4 figures
☆ Translating Across Cultures: LLMs for Intralingual Cultural Adaptation
LLMs are increasingly being deployed for multilingual applications and have demonstrated impressive translation capabilities between several low and high resource languages. An aspect of translation that often gets overlooked is that of cultural adaptation, or modifying source culture references to suit the target culture. Cultural adaptation has applications across several creative industries and requires intimate knowledge of source and target cultures during translation. While specialized translation models still outperform LLMs on the machine translation task when viewed from the lens of correctness, they are not sensitive to cultural differences often requiring manual correction. LLMs on the other hand have a rich reservoir of cultural knowledge embedded within its parameters that can be potentially exploited for such applications. In this paper we define the task of cultural adaptation and create an evaluation framework to benchmark different models for this task. We evaluate the performance of modern LLMs for cultural adaptation and analyze their cross cultural knowledge while connecting related concepts across different cultures. We also analyze possible issues with automatic adaptation including cultural biases and stereotypes. We hope that this task will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.
Overview of the CAIL 2023 Argument Mining Track
We give a detailed overview of the CAIL 2023 Argument Mining Track, one of the Chinese AI and Law Challenge (CAIL) 2023 tracks. The main goal of the track is to identify and extract interacting argument pairs in trial dialogs. It mainly uses summarized judgment documents but can also refer to trial recordings. The track consists of two stages, and we introduce the tasks designed for each stage; we also extend the data from previous events into a new dataset -- CAIL2023-ArgMine -- with annotated new cases from various causes of action. We outline several submissions that achieve the best results, including their methods for different stages. While all submissions rely on language models, they have incorporated strategies that may benefit future work in this field.
☆ Improving Expert Radiology Report Summarization by Prompting Large Language Models with a Layperson Summary
Radiology report summarization (RRS) is crucial for patient care, requiring concise "Impressions" from detailed "Findings." This paper introduces a novel prompting strategy to enhance RRS by first generating a layperson summary. This approach normalizes key observations and simplifies complex information using non-expert communication techniques inspired by doctor-patient interactions. Combined with few-shot in-context learning, this method improves the model's ability to link general terms to specific findings. We evaluate this approach on the MIMIC-CXR, CheXpert, and MIMIC-III datasets, benchmarking it against 7B/8B parameter state-of-the-art open-source large language models (LLMs) like Meta-Llama-3-8B-Instruct. Our results demonstrate improvements in summarization accuracy and accessibility, particularly in out-of-domain tests, with improvements as high as 5% for some metrics.
☆ LLaSA: Large Multimodal Agent for Human Activity Analysis Through Wearable Sensors EMNLP 2024
Integrating inertial measurement units (IMUs) with large language models (LLMs) advances multimodal AI by enhancing human activity understanding. We introduce SensorCaps, a dataset of 26,288 IMU-derived activity narrations, and OpenSQA, an instruction-following dataset with 257,562 question-answer pairs. Combining LIMU-BERT and Llama, we develop LLaSA, a Large Multimodal Agent capable of interpreting and responding to activity and motion analysis queries. Our evaluation demonstrates LLaSA's effectiveness in activity classification and question answering, highlighting its potential in healthcare, sports science, and human-computer interaction. These contributions advance sensor-aware language models and open new research avenues. Our code repository and datasets can be found on https://github.com/BASHLab/LLaSA.
comment: Under review at ARR (for EMNLP 2024)
☆ CodeRAG-Bench: Can Retrieval Augment Code Generation?
While language models (LMs) have proven remarkably adept at generating code, many programs are challenging for LMs to generate using their parametric knowledge alone. Providing external contexts such as library documentation can facilitate generating accurate and functional code. Despite the success of retrieval-augmented generation (RAG) in various text-oriented tasks, its potential for improving code generation remains under-explored. In this work, we conduct a systematic, large-scale analysis by asking: in what scenarios can retrieval benefit code generation models? and what challenges remain? We first curate a comprehensive evaluation benchmark, CodeRAG-Bench, encompassing three categories of code generation tasks, including basic programming, open-domain, and repository-level problems. We aggregate documents from five sources for models to retrieve contexts: competition solutions, online tutorials, library documentation, StackOverflow posts, and GitHub repositories. We examine top-performing models on CodeRAG-Bench by providing contexts retrieved from one or multiple sources. While notable gains are made in final code generation by retrieving high-quality contexts across various settings, our analysis reveals room for improvement -- current retrievers still struggle to fetch useful contexts especially with limited lexical overlap, and generators fail to improve with limited context lengths or abilities to integrate additional contexts. We hope CodeRAG-Bench serves as an effective testbed to encourage further development of advanced code-oriented RAG methods.
☆ African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification
Recent Large Vision-Language Models (LVLMs) demonstrate impressive abilities on numerous image understanding and reasoning tasks. The task of fine-grained object classification (e.g., distinction between \textit{animal species}), however, has been probed insufficiently, despite its downstream importance. We fill this evaluation gap by creating \texttt{FOCI} (\textbf{F}ine-grained \textbf{O}bject \textbf{C}lass\textbf{I}fication), a difficult multiple-choice benchmark for fine-grained object classification, from existing object classification datasets: (1) multiple-choice avoids ambiguous answers associated with casting classification as open-ended QA task; (2) we retain classification difficulty by mining negative labels with a CLIP model. \texttt{FOCI}\xspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k. We benchmark 12 public LVLMs on \texttt{FOCI} and show that it tests for a \textit{complementary skill} to established image understanding and reasoning benchmarks. Crucially, CLIP models exhibit dramatically better performance than LVLMs. Since the image encoders of LVLMs come from these CLIP models, this points to inadequate alignment for fine-grained object distinction between the encoder and the LLM and warrants (pre)training data with more fine-grained annotation. We release our code at \url{https://github.com/gregor-ge/FOCI-Benchmark}.
☆ Does Object Grounding Really Reduce Hallucination of Large Vision-Language Models?
Large vision-language models (LVLMs) have recently dramatically pushed the state of the art in image captioning and many image understanding tasks (e.g., visual question answering). LVLMs, however, often \textit{hallucinate} and produce captions that mention concepts that cannot be found in the image. These hallucinations erode the trustworthiness of LVLMs and are arguably among the main obstacles to their ubiquitous adoption. Recent work suggests that addition of grounding objectives -- those that explicitly align image regions or objects to text spans -- reduces the amount of LVLM hallucination. Although intuitive, this claim is not empirically justified as the reduction effects have been established, we argue, with flawed evaluation protocols that (i) rely on data (i.e., MSCOCO) that has been extensively used in LVLM training and (ii) measure hallucination via question answering rather than open-ended caption generation. In this work, in contrast, we offer the first systematic analysis of the effect of fine-grained object grounding on LVLM hallucination under an evaluation protocol that more realistically captures LVLM hallucination in open generation. Our extensive experiments over three backbone LLMs reveal that grounding objectives have little to no effect on object hallucination in open caption generation.
☆ Instruction Pre-Training: Language Models are Supervised Multitask Learners
Unsupervised multitask pre-training has been the critical method behind the recent success of language models (LMs). However, supervised multitask learning still holds significant promise, as scaling it in the post-training stage trends towards better generalization. In this paper, we explore supervised multitask pre-training by proposing Instruction Pre-Training, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train LMs. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of Instruction Pre-Training. In pre-training from scratch, Instruction Pre-Training not only consistently enhances pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, Instruction Pre-Training enables Llama3-8B to be comparable to or even outperform Llama3-70B. Our model, code, and data are available at https://github.com/microsoft/LMOps.
☆ On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier
Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
☆ Data-Centric AI in the Age of Large Language Models
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
comment: Preprint
☆ A Review of Common Online Speaker Diarization Methods
Speaker diarization provides the answer to the question "who spoke when?" for an audio file. This information can be used to complete audio transcripts for further processing steps. Most speaker diarization systems assume that the audio file is available as a whole. However, there are scenarios in which the speaker labels are needed immediately after the arrival of an audio segment. Speaker diarization with a correspondingly low latency is referred to as online speaker diarization. This paper provides an overview. First the history of online speaker diarization is briefly presented. Next a taxonomy and datasets for training and evaluation are given. In the sections that follow, online diarization methods and systems are discussed in detail. This paper concludes with the presentation of challenges that still need to be solved by future research in the field of online speaker diarization.
comment: 6 pages
☆ Explicit and Implicit Large Language Model Personas Generate Opinions but Fail to Replicate Deeper Perceptions and Biases
Large language models (LLMs) are increasingly being used in human-centered social scientific tasks, such as data annotation, synthetic data creation, and engaging in dialog. However, these tasks are highly subjective and dependent on human factors, such as one's environment, attitudes, beliefs, and lived experiences. Thus, employing LLMs (which do not have such human factors) in these tasks may result in a lack of variation in data, failing to reflect the diversity of human experiences. In this paper, we examine the role of prompting LLMs with human-like personas and asking the models to answer as if they were a specific human. This is done explicitly, with exact demographics, political beliefs, and lived experiences, or implicitly via names prevalent in specific populations. The LLM personas are then evaluated via (1) subjective annotation task (e.g., detecting toxicity) and (2) a belief generation task, where both tasks are known to vary across human factors. We examine the impact of explicit vs. implicit personas and investigate which human factors LLMs recognize and respond to. Results show that LLM personas show mixed results when reproducing known human biases, but generate generally fail to demonstrate implicit biases. We conclude that LLMs lack the intrinsic cognitive mechanisms of human thought, while capturing the statistical patterns of how people speak, which may restrict their effectiveness in complex social science applications.
☆ Healing Powers of BERT: How Task-Specific Fine-Tuning Recovers Corrupted Language Models
Language models like BERT excel at sentence classification tasks due to extensive pre-training on general data, but their robustness to parameter corruption is unexplored. To understand this better, we look at what happens if a language model is "broken", in the sense that some of its parameters are corrupted and then recovered by fine-tuning. Strategically corrupting BERT variants at different levels, we find corrupted models struggle to fully recover their original performance, with higher corruption causing more severe degradation. Notably, bottom-layer corruption affecting fundamental linguistic features is more detrimental than top-layer corruption. Our insights contribute to understanding language model robustness and adaptability under adverse conditions, informing strategies for developing resilient NLP systems against parameter perturbations.
☆ Towards Truthful Multilingual Large Language Models: Benchmarking and Alignment Strategies
In the era of large language models (LLMs), building multilingual large language models (MLLMs) that can serve users worldwide holds great significance. However, existing research seldom focuses on the truthfulness of MLLMs. Meanwhile, contemporary multilingual aligning technologies struggle to balance massive languages and often exhibit serious truthfulness gaps across different languages, especially those that differ greatly from English. In our work, we construct a benchmark for truthfulness evaluation in multilingual scenarios and explore the ways to align facts across languages to enhance the truthfulness of MLLMs. Furthermore, we propose Fact-aware Multilingual Selective Synergy (FaMSS) to optimize the data allocation across a large number of languages and different data types. Experimental results demonstrate that our approach can effectively reduce the multilingual representation disparity and enhance the multilingual capabilities of LLMs.
comment: 15 pages
☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
☆ SEC-QA: A Systematic Evaluation Corpus for Financial QA
The financial domain frequently deals with large numbers of long documents that are essential for daily operations. Significant effort is put towards automating financial data analysis. However, a persistent challenge, not limited to the finance domain, is the scarcity of datasets that accurately reflect real-world tasks for model evaluation. Existing datasets are often constrained by size, context, or relevance to practical applications. Moreover, LLMs are currently trained on trillions of tokens of text, limiting access to novel data or documents that models have not encountered during training for unbiased evaluation. We propose SEC-QA, a continuous dataset generation framework with two key features: 1) the semi-automatic generation of Question-Answer (QA) pairs spanning multiple long context financial documents, which better represent real-world financial scenarios; 2) the ability to continually refresh the dataset using the most recent public document collections, not yet ingested by LLMs. Our experiments show that current retrieval augmented generation methods systematically fail to answer these challenging multi-document questions. In response, we introduce a QA system based on program-of-thought that improves the ability to perform complex information retrieval and quantitative reasoning pipelines, thereby increasing QA accuracy.
☆ Jailbreaking as a Reward Misspecification Problem
The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.
☆ Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory
The emergence of Large Language Models (LLMs) and advancements in Artificial Intelligence (AI) offer an opportunity for computational social science research at scale. Building upon prior explorations of LLM agent design, our work introduces a simulated agent society where complex social relationships dynamically form and evolve over time. Agents are imbued with psychological drives and placed in a sandbox survival environment. We conduct an evaluation of the agent society through the lens of Thomas Hobbes's seminal Social Contract Theory (SCT). We analyze whether, as the theory postulates, agents seek to escape a brutish "state of nature" by surrendering rights to an absolute sovereign in exchange for order and security. Our experiments unveil an alignment: Initially, agents engage in unrestrained conflict, mirroring Hobbes's depiction of the state of nature. However, as the simulation progresses, social contracts emerge, leading to the authorization of an absolute sovereign and the establishment of a peaceful commonwealth founded on mutual cooperation. This congruence between our LLM agent society's evolutionary trajectory and Hobbes's theoretical account indicates LLMs' capability to model intricate social dynamics and potentially replicate forces that shape human societies. By enabling such insights into group behavior and emergent societal phenomena, LLM-driven multi-agent simulations, while unable to simulate all the nuances of human behavior, may hold potential for advancing our understanding of social structures, group dynamics, and complex human systems.
☆ The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing
The ability to manipulate logical-mathematical symbols (LMS), encompassing tasks such as calculation, reasoning, and programming, is a cognitive skill arguably unique to humans. Considering the relatively recent emergence of this ability in human evolutionary history, it has been suggested that LMS processing may build upon more fundamental cognitive systems, possibly through neuronal recycling. Previous studies have pinpointed two primary candidates, natural language processing and spatial cognition. Existing comparisons between these domains largely relied on task-level comparison, which may be confounded by task idiosyncrasy. The present study instead compared the neural correlates at the domain level with both automated meta-analysis and synthesized maps based on three representative LMS tasks, reasoning, calculation, and mental programming. Our results revealed a more substantial cortical overlap between LMS processing and spatial cognition, in contrast to language processing. Furthermore, in regions activated by both spatial and language processing, the multivariate activation pattern for LMS processing exhibited greater multivariate similarity to spatial cognition than to language processing. A hierarchical clustering analysis further indicated that typical LMS tasks were indistinguishable from spatial cognition tasks at the neural level, suggesting an inherent connection between these two cognitive processes. Taken together, our findings support the hypothesis that spatial cognition is likely the basis of LMS processing, which may shed light on the limitations of large language models in logical reasoning, particularly those trained exclusively on textual data without explicit emphasis on spatial content.
☆ Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction
Navigating historical narratives poses a challenge in unveiling the spatial intricacies of past landscapes. The proposed work addresses this challenge within the context of the English Lake District, employing the Corpus of the Lake District Writing. The method utilizes a generative pre-trained transformer model to extract spatial relations from the textual descriptions in the corpus. The study applies this large language model to understand the spatial dimensions inherent in historical narratives comprehensively. The outcomes are presented as semantic triples, capturing the nuanced connections between entities and locations, and visualized as a network, offering a graphical representation of the spatial narrative. The study contributes to a deeper comprehension of the English Lake District's spatial tapestry and provides an approach to uncovering spatial relations within diverse historical contexts.
Self-supervised Interpretable Concept-based Models for Text Classification
Despite their success, Large-Language Models (LLMs) still face criticism as their lack of interpretability limits their controllability and reliability. Traditional post-hoc interpretation methods, based on attention and gradient-based analysis, offer limited insight into the model's decision-making processes. In the image field, Concept-based models have emerged as explainable-by-design architectures, employing human-interpretable features as intermediate representations. However, these methods have not been yet adapted to textual data, mainly because they require expensive concept annotations, which are impractical for real-world text data. This paper addresses this challenge by proposing a self-supervised Interpretable Concept Embedding Models (ICEMs). We leverage the generalization abilities of LLMs to predict the concepts labels in a self-supervised way, while we deliver the final predictions with an interpretable function. The results of our experiments show that ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones. Additionally, we show that our models are (i) interpretable, offering meaningful logical explanations for their predictions; (ii) interactable, allowing humans to modify intermediate predictions through concept interventions; and (iii) controllable, guiding the LLMs' decoding process to follow a required decision-making path.
☆ medIKAL: Integrating Knowledge Graphs as Assistants of LLMs for Enhanced Clinical Diagnosis on EMRs
Electronic Medical Records (EMRs), while integral to modern healthcare, present challenges for clinical reasoning and diagnosis due to their complexity and information redundancy. To address this, we proposed medIKAL (Integrating Knowledge Graphs as Assistants of LLMs), a framework that combines Large Language Models (LLMs) with knowledge graphs (KGs) to enhance diagnostic capabilities. medIKAL assigns weighted importance to entities in medical records based on their type, enabling precise localization of candidate diseases within KGs. It innovatively employs a residual network-like approach, allowing initial diagnosis by the LLM to be merged into KG search results. Through a path-based reranking algorithm and a fill-in-the-blank style prompt template, it further refined the diagnostic process. We validated medIKAL's effectiveness through extensive experiments on a newly introduced open-sourced Chinese EMR dataset, demonstrating its potential to improve clinical diagnosis in real-world settings.
☆ Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning
Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are `almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of user-level DP for LLM fine-tuning on natural language generation tasks. Focusing on two mechanisms for achieving user-level DP guarantees, Group Privacy and User-wise DP-SGD, we investigate design choices like data selection strategies and parameter tuning for the best privacy-utility tradeoff.
☆ LiveMind: Low-latency Large Language Models with Simultaneous Inference
In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
☆ The Fire Thief Is Also the Keeper: Balancing Usability and Privacy in Prompts
The rapid adoption of online chatbots represents a significant advancement in artificial intelligence. However, this convenience brings considerable privacy concerns, as prompts can inadvertently contain sensitive information exposed to large language models (LLMs). Limited by high computational costs, reduced task usability, and excessive system modifications, previous works based on local deployment, embedding perturbation, and homomorphic encryption are inapplicable to online prompt-based LLM applications. To address these issues, this paper introduces Prompt Privacy Sanitizer (i.e., ProSan), an end-to-end prompt privacy protection framework that can produce anonymized prompts with contextual privacy removed while maintaining task usability and human readability. It can also be seamlessly integrated into the online LLM service pipeline. To achieve high usability and dynamic anonymity, ProSan flexibly adjusts its protection targets and strength based on the importance of the words and the privacy leakage risk of the prompts. Additionally, ProSan is capable of adapting to diverse computational resource conditions, ensuring privacy protection even for mobile devices with limited computing power. Our experiments demonstrate that ProSan effectively removes private information across various tasks, including question answering, text summarization, and code generation, with minimal reduction in task performance.
☆ Identifying User Goals from UI Trajectories
Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.
☆ Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable Questions
Real-world KBQA applications require models that are (1) robust -- e.g., can differentiate between answerable and unanswerable questions, and (2) low-resource -- do not require large training data. Towards this goal, we propose the novel task of few-shot transfer for KBQA with unanswerable questions. We present FUn-FuSIC that extends the state-of-the-art (SoTA) few-shot transfer model for answerable-only KBQA to handle unanswerability. It iteratively prompts an LLM to generate logical forms for the question by providing feedback using a diverse suite of syntactic, semantic and execution guided checks, and adapts self-consistency to assess confidence of the LLM to decide answerability. Experiments over newly constructed datasets show that FUn-FuSIC outperforms suitable adaptations of the SoTA model for KBQA with unanswerability, and the SoTA model for answerable-only few-shot-transfer KBQA.
☆ Infusing clinical knowledge into tokenisers for language models
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
comment: 18 pages, 6 figures
☆ QuST-LLM: Integrating Large Language Models for Comprehensive Spatial Transcriptomics Analysis
In this paper, we introduce QuST-LLM, an innovative extension of QuPath that utilizes the capabilities of large language models (LLMs) to analyze and interpret spatial transcriptomics (ST) data. This tool effectively simplifies the intricate and high-dimensional nature of ST data by offering a comprehensive workflow that includes data loading, region selection, gene expression analysis, and functional annotation. QuST-LLM employs LLMs to transform complex ST data into understandable and detailed biological narratives based on gene ontology annotations, thereby significantly improving the interpretability of ST data. Consequently, users can interact with their own ST data using natural language. Hence, QuST-LLM provides researchers with a potent functionality to unravel the spatial and functional complexities of tissues, fostering novel insights and advancements in biomedical research.
comment: 12 pages, 7 figures
☆ VAIYAKARANA : A Benchmark for Automatic Grammar Correction in Bangla
Bangla (Bengali) is the fifth most spoken language globally and, yet, the problem of automatic grammar correction in Bangla is still in its nascent stage. This is mostly due to the need for a large corpus of grammatically incorrect sentences, with their corresponding correct counterparts. The present state-of-the-art techniques to curate a corpus for grammatically wrong sentences involve random swapping, insertion and deletion of words. However,these steps may not always generate grammatically wrong sentences in Bangla. In this work, we propose a pragmatic approach to generate grammatically wrong sentences in Bangla. We first categorize the different kinds of errors in Bangla into 5 broad classes and 12 finer classes. We then use these to generate grammatically wrong sentences systematically from a correct sentence. This approach can generate a large number of wrong sentences and can, thus, mitigate the challenge of lacking a large corpus for neural networks. We provide a dataset, Vaiyakarana, consisting of 92,830 grammatically incorrect sentences as well as 18,426 correct sentences. We also collected 619 human-generated sentences from essays written by Bangla native speakers. This helped us to understand errors that are more frequent. We evaluated our corpus against neural models and LLMs and also benchmark it against human evaluators who are native speakers of Bangla. Our analysis shows that native speakers are far more accurate than state-of-the-art models to detect whether the sentence is grammatically correct. Our methodology of generating erroneous sentences can be applied for most other Indian languages as well.
☆ Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs
Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs' planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.
comment: Work in progress
☆ Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering
Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.
☆ Step-Back Profiling: Distilling User History for Personalized Scientific Writing
Large language models (LLMs) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce Step-Back Profiling to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. Regarding our experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multiuser personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via Step-Back Profiling for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our extensive ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{https://github.com/gersteinlab/step-back-profiling}.
☆ On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?
While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
☆ Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Warning: this paper contains model outputs exhibiting unethical information. Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Numerous datasets have been constructed to assess social bias, toxicity, and ethics in LLMs, but they suffer from evaluation chronoeffect, that is, as models rapidly evolve, existing data becomes leaked or undemanding, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs. Distinct from previous adaptive testing methods that rely on static datasets with limited difficulty, GETA incorporates an iteratively-updated item generator which infers each LLM's moral boundaries and generates difficulty-tailored testing items, accurately reflecting the true alignment extent. This process theoretically learns a joint distribution of item and model response, with item difficulty and value conformity as latent variables, where the generator co-evolves with the LLM, addressing chronoeffect. We evaluate various popular LLMs with diverse capabilities and demonstrate that GETA can create difficulty-matching testing items and more accurately assess LLMs' values, better consistent with their performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
comment: Work in progress
☆ Complexity of Symbolic Representation in Working Memory of Transformer Correlates with the Complexity of a Task
Even though Transformers are extensively used for Natural Language Processing tasks, especially for machine translation, they lack an explicit memory to store key concepts of processed texts. This paper explores the properties of the content of symbolic working memory added to the Transformer model decoder. Such working memory enhances the quality of model predictions in machine translation task and works as a neural-symbolic representation of information that is important for the model to make correct translations. The study of memory content revealed that translated text keywords are stored in the working memory, pointing to the relevance of memory content to the processed text. Also, the diversity of tokens and parts of speech stored in memory correlates with the complexity of the corpora for machine translation task.
comment: 18 pages, 6 figures. Published in the journal Cognitive Systems Research 3 June 2022: https://www.sciencedirect.com/science/article/abs/pii/S1389041722000274
☆ On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning ACL 2024
The performance of modern language models (LMs) has been improved by chain-of-thought (CoT) reasoning, i.e., the process of generating intermediate results that guide the model towards a final answer. A possible explanation for this improvement is that CoT reasoning extends an LM's computational power, as RNNs and transformers with additional scratch space are known to be Turing complete. Comparing LMs to Turing machines, however, introduces a category error - Turing machines decide language membership, whereas LMs define distributions over strings. To bridge this gap, we formalize CoT reasoning in a probabilistic setting. We present several results on the representational capacity of recurrent and transformer LMs with CoT reasoning, showing that they can represent the same family of distributions over strings as probabilistic Turing machines.
comment: To be published at ACL 2024
☆ Timo: Towards Better Temporal Reasoning for Language Models
Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at https://github.com/zhaochen0110/Timo.
comment: Under review
☆ Temporal Knowledge Graph Question Answering: A Survey
Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.
comment: 8 pages, 3 figures
☆ In Tree Structure Should Sentence Be Generated
Generative models reliant on sequential autoregression have been at the forefront of language generation for an extensive period, particularly following the introduction of widely acclaimed transformers. Despite its excellent performance, there are always some issues that we face today. For example, problems such as hallucinations and getting trapped in a logic loop may occur. To enhance the performance of existing systems, this paper introduces a new method for generating sequences in natural language, which involves generating the targeted sentence in a tree-traversing order. The paper includes an illustration of the theoretical basis and validity of the approach, as well as a comparison of its fundamentals with the diffusion model in graphic generation. Finally, a module called SenTree is introduced for generating an approximating binary tree. It is already available at https://github.com/arklyg/sentree. Additionally, a joint training framework based on this approach is proposed, incorporating the intrinsics of generative adversarial networks.
☆ SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation
This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->{German, Japanese, Chinese}, and Czech->English), achieving acceptable or even better results compared to last year's submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: https://github.com/hlt-mt/FBK-fairseq/.
☆ Ranking LLMs by compression
We conceptualize the process of understanding as information compression, and propose a method for ranking large language models (LLMs) based on lossless data compression. We demonstrate the equivalence of compression length under arithmetic coding with cumulative negative log probabilities when using a large language model as a prior, that is, the pre-training phase of the model is essentially the process of learning the optimal coding length. At the same time, the evaluation metric compression ratio can be obtained without actual compression, which greatly saves overhead. In this paper, we use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks, including sentence completion, question answering, and coreference resolution. Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
comment: 7 pages, 4 tables
☆ Definition generation for lexical semantic change detection ACL 2024
We use contextualized word definitions generated by large language models as semantic representations in the task of diachronic lexical semantic change detection (LSCD). In short, generated definitions are used as `senses', and the change score of a target word is retrieved by comparing their distributions in two time periods under comparison. On the material of five datasets and three languages, we show that generated definitions are indeed specific and general enough to convey a signal sufficient to rank sets of words by the degree of their semantic change over time. Our approach is on par with or outperforms prior non-supervised sense-based LSCD methods. At the same time, it preserves interpretability and allows to inspect the reasons behind a specific shift in terms of discrete definitions-as-senses. This is another step in the direction of explainable semantic change modeling.
comment: Findings of ACL 2024
☆ A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning ACL
Diagnostic Captioning (DC) automatically generates a diagnostic text from one or more medical images (e.g., X-rays, MRIs) of a patient. Treated as a draft, the generated text may assist clinicians, by providing an initial estimation of the patient's condition, speeding up and helping safeguard the diagnostic process. The accuracy of a diagnostic text, however, strongly depends on how well the key medical conditions depicted in the images are expressed. We propose a new data-driven guided decoding method that incorporates medical information, in the form of existing tags capturing key conditions of the image(s), into the beam search of the diagnostic text generation process. We evaluate the proposed method on two medical datasets using four DC systems that range from generic image-to-text systems with CNN encoders and RNN decoders to pre-trained Large Language Models. The latter can also be used in few- and zero-shot learning scenarios. In most cases, the proposed mechanism improves performance with respect to all evaluation measures. We provide an open-source implementation of the proposed method at https://github.com/nlpaueb/dmmcs.
comment: [Pre-print] ACL Findings 2024, 17 pages, 7 figures, 7 tables
☆ DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
☆ Aligning Large Language Models with Diverse Political Viewpoints
Large language models such as ChatGPT often exhibit striking political biases. If users query them about political information, they might take a normative stance and reinforce such biases. To overcome this, we align LLMs with diverse political viewpoints from 100,000 comments written by candidates running for national parliament in Switzerland. Such aligned models are able to generate more accurate political viewpoints from Swiss parties compared to commercial models such as ChatGPT. We also propose a procedure to generate balanced overviews from multiple viewpoints using such models.
☆ Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services
Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
comment: Accepted at European Workshop on Algorithmic Fairness (EWAF'24)
☆ Finding Safety Neurons in Large Language Models
Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
☆ MACAROON: Training Vision-Language Models To Be Your Engaged Partners
Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues. Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses. In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners. We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs. Utilizing this hierarchy, we create PIE, (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics. Our evaluations on \benchmark indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28. In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria. Then, the self-imagined data is formatted for conditional reinforcement learning. Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.
comment: The code will be made public at https://github.com/ShujinWu-0814/MACAROON
☆ Towards Event-oriented Long Video Understanding
With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
comment: Work on progress
☆ An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
☆ Take the essence and discard the dross: A Rethinking on Data Selection for Fine-Tuning Large Language Models
Data selection for fine-tuning Large Language Models (LLMs) aims to select a high-quality subset from a given candidate dataset to train a Pending Fine-tune Model (PFM) into a Selective-Enhanced Model (SEM). It can improve the model performance and accelerate the training process. Although a few surveys have investigated related works of data selection, there is a lack of comprehensive comparison between existing methods due to their various experimental settings. To address this issue, we first propose a three-stage scheme for data selection and comprehensively review existing works according to this scheme. Then, we design a unified comparing method with ratio-based efficiency indicators and ranking-based feasibility indicators to overcome the difficulty of comparing various models with diverse experimental settings. After an in-depth comparative analysis, we find that the more targeted method with data-specific and model-specific quality labels has higher efficiency, but the introduction of additional noise information should be avoided when designing selection algorithms. Finally, we summarize the trends in data selection and highlight the short-term and long-term challenges to guide future research.
☆ EasyECR: A Library for Easy Implementation and Evaluation of Event Coreference Resolution Models
Event Coreference Resolution (ECR) is the task of clustering event mentions that refer to the same real-world event. Despite significant advancements, ECR research faces two main challenges: limited generalizability across domains due to narrow dataset evaluations, and difficulties in comparing models within diverse ECR pipelines. To address these issues, we develop EasyECR, the first open-source library designed to standardize data structures and abstract ECR pipelines for easy implementation and fair evaluation. More specifically, EasyECR integrates seven representative pipelines and ten popular benchmark datasets, enabling model evaluations on various datasets and promoting the development of robust ECR pipelines. By conducting extensive evaluation via our EasyECR, we find that, \lowercase\expandafter{\romannumeral1}) the representative ECR pipelines cannot generalize across multiple datasets, hence evaluating ECR pipelines on multiple datasets is necessary, \lowercase\expandafter{\romannumeral2}) all models in ECR pipelines have a great effect on pipeline performance, therefore, when one model in ECR pipelines are compared, it is essential to ensure that the other models remain consistent. Additionally, reproducing ECR results is not trivial, and the developed library can help reduce this discrepancy. The experimental results provide valuable baselines for future research.
comment: 14 pages, 4 figures, 12 tables
☆ Let Guidelines Guide You: A Prescriptive Guideline-Centered Data Annotation Methodology
We introduce the Guideline-Centered annotation process, a novel data annotation methodology focused on reporting the annotation guidelines associated with each data sample. We identify three main limitations of the standard prescriptive annotation process and describe how the Guideline-Centered methodology overcomes them by reducing the loss of information in the annotation process and ensuring adherence to guidelines. Additionally, we discuss how the Guideline-Centered enables the reuse of annotated data across multiple tasks at the cost of a single human-annotation process.
☆ Seamless Language Expansion: Enhancing Multilingual Mastery in Self-Supervised Models
Self-supervised (SSL) models have shown great performance in various downstream tasks. However, they are typically developed for limited languages, and may encounter new languages in real-world. Developing a SSL model for each new language is costly. Thus, it is vital to figure out how to efficiently adapt existed SSL models to a new language without impairing its original abilities. We propose adaptation methods which integrate LoRA to existed SSL models to extend new language. We also develop preservation strategies which include data combination and re-clustering to retain abilities on existed languages. Applied to mHuBERT, we investigate their effectiveness on speech re-synthesis task. Experiments show that our adaptation methods enable mHuBERT to be applied to a new language (Mandarin) with MOS value increased about 1.6 and the relative value of WER reduced up to 61.72%. Also, our preservation strategies ensure that the performance on both existed and new languages remains intact.
comment: Accepted by Interspeech 2024
☆ Protecting Privacy Through Approximating Optimal Parameters for Sequence Unlearning in Language Models ACL2024
Although language models (LMs) demonstrate exceptional capabilities on various tasks, they are potentially vulnerable to extraction attacks, which represent a significant privacy risk. To mitigate the privacy concerns of LMs, machine unlearning has emerged as an important research area, which is utilized to induce the LM to selectively forget about some of its training data. While completely retraining the model will guarantee successful unlearning and privacy assurance, it is impractical for LMs, as it would be time-consuming and resource-intensive. Prior works efficiently unlearn the target token sequences, but upon subsequent iterations, the LM displays significant degradation in performance. In this work, we propose Privacy Protection via Optimal Parameters (POP), a novel unlearning method that effectively forgets the target token sequences from the pretrained LM by applying optimal gradient updates to the parameters. Inspired by the gradient derivation of complete retraining, we approximate the optimal training objective that successfully unlearns the target sequence while retaining the knowledge from the rest of the training data. Experimental results demonstrate that POP exhibits remarkable retention performance post-unlearning across 9 classification and 4 dialogue benchmarks, outperforming the state-of-the-art by a large margin. Furthermore, we introduce Remnant Memorization Accuracy that quantifies privacy risks based on token likelihood and validate its effectiveness through both qualitative and quantitative analyses.
comment: Accepted to ACL2024 findings
☆ ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at https://github.com/openpsi-project/ReaLHF .
comment: 13 pages (15 pages with references), 13 figures
☆ EXCEEDS: Extracting Complex Events as Connecting the Dots to Graphs in Scientific Domain
It is crucial to utilize events to understand a specific domain. There are lots of research on event extraction in many domains such as news, finance and biology domain. However, scientific domain still lacks event extraction research, including comprehensive datasets and corresponding methods. Compared to other domains, scientific domain presents two characteristics: denser nuggets and more complex events. To solve the above problem, considering these two characteristics, we first construct SciEvents, a large-scale multi-event document-level dataset with a schema tailored for scientific domain. It has 2,508 documents and 24,381 events under refined annotation and quality control. Then, we propose EXCEEDS, a novel end-to-end scientific event extraction framework by storing dense nuggets in a grid matrix and simplifying complex event extraction into a dot construction and connection task. Experimental results demonstrate state-of-the-art performances of EXCEEDS on SciEvents. Additionally, we release SciEvents and EXCEEDS on GitHub.
comment: This paper is working in process
☆ How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.
comment: under review
Prompt Injection Attacks in Defended Systems
Large language models play a crucial role in modern natural language processing technologies. However, their extensive use also introduces potential security risks, such as the possibility of black-box attacks. These attacks can embed hidden malicious features into the model, leading to adverse consequences during its deployment. This paper investigates methods for black-box attacks on large language models with a three-tiered defense mechanism. It analyzes the challenges and significance of these attacks, highlighting their potential implications for language processing system security. Existing attack and defense methods are examined, evaluating their effectiveness and applicability across various scenarios. Special attention is given to the detection algorithm for black-box attacks, identifying hazardous vulnerabilities in language models and retrieving sensitive information. This research presents a methodology for vulnerability detection and the development of defensive strategies against black-box attacks on large language models.
☆ Taxonomy-Guided Zero-Shot Recommendations with LLMs
With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel method using a taxonomy dictionary. This method provides a systematic framework for categorizing and organizing items, improving the clarity and structure of item information. By incorporating the taxonomy dictionary into LLM prompts, we achieve efficient token utilization and controlled feature generation, leading to more accurate and contextually relevant recommendations. Our Taxonomy-guided Recommendation (TaxRec) approach features a two-step process: one-time taxonomy categorization and LLM-based recommendation, enabling zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as personal recommender with LLMs. Code is available at https://github.com/yueqingliang1/TaxRec.
☆ CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial news
CryptoGPT: a 7B model competing with GPT-4 in a specific task -- The Impact of Automatic Annotation and Strategic Fine-Tuning via QLoRAIn this article, we present a method aimed at refining a dedicated LLM of reasonable quality with limited resources in an industrial setting via CryptoGPT. It is an LLM designed for financial news analysis for the cryptocurrency market in real-time. This project was launched in an industrial context. This model allows not only for the classification of financial information but also for providing comprehensive analysis. We refined different LLMs of the same size such as Mistral-7B and LLama-7B using semi-automatic annotation and compared them with various LLMs such as GPT-3.5 and GPT-4. Our goal is to find a balance among several needs: 1. Protecting data (by avoiding their transfer to external servers), 2. Limiting annotation cost and time, 3. Controlling the model's size (to manage deployment costs), and 4. Maintaining better analysis quality.
comment: Journ{\'e}e Nationale sur la Fouille de Textes, Pascal CUXAC; Adrien GUILLE; C{\'e}dric LOPEZ, Jun 2024, Lyon (Universit{\'e} Lumi{\`e}re Lyon 2), France
☆ Toward Infinite-Long Prefix in Transformer
Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning from the perspective of prefix length. In particular, we approximate the infinite-long Prefix Learning optimization process by the Neural Tangent Kernel (NTK) technique. We formulate and solve it as a learning problem of the infinite-long prefix in a one-layer attention network. Our results confirm the over-parameterization property and arbitrary small loss convergence guarantee of the infinite-long Prefix Learning in attention. To the implementation end, we propose our NTK-Attention method, which is "equivalent" to attention computation with arbitrary prefix length efficiently. Its time complexity mainly depends on the sub-quadratic of input length (without prefix), and our method only requires $d^2 + d$ extra parameters for representation, where $d$ is the feature dimension. In addition, we conducted experiments that compare our NTK-Attention with full parameters fine-tuning, LoRA, and P-Tuning V2 methods across vision or natural language datasets. The results indicate our approach may be a promising parameter-efficient-fine-tuning method since it has demonstrated superior performance in numerous scenarios. Our code can be found at \url{https://github.com/ChristianYang37/chiwun/tree/main/src/NTK-Attention}.
☆ Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
comment: under review
☆ Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example Associations
Language models (LMs) are known to suffer from forgetting of previously learned examples when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: https://inklab.usc.edu/lm-forgetting-prediction/
comment: 5 pages
☆ The Reason behind Good or Bad: Towards a Better Mathematical Verifier with Natural Language Feedback
Mathematical verfier achieves success in mathematical reasoning tasks by validating the correctness of solutions. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedbacks as rationale labels (i.e., the correctness of the current step and the explanations). In this paper, we propose \textbf{Math-Minos}, a natural language feedback enhanced verifier by constructing automatically-generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier by the accuracy of 1.6\% (86.6\% $\rightarrow$ 88.2\%) on GSM8K and 0.8\% (37.8\% $\rightarrow$ 38.6\%) on MATH. We will release the code, data and model for reproduction soon.
comment: 9 pages
☆ Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
As Large Language Models (LLMs) become an important way of information seeking, there have been increasing concerns about the unethical content LLMs may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses. Our attack methodology is inspired by psychometric principles in cognitive and social psychology. We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types. Each prompt attack has bilingual versions. Extensive evaluation of representative LLMs shows that 1) all three attack methods work effectively, especially the Deception attacks; 2) GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and GPT-4; 3) LLMs could output content of other bias types when being taught with one type of bias. Our methodology provides a rigorous and effective way of evaluating LLMs' implicit bias and will benefit the assessments of LLMs' potential ethical risks.
comment: Code and datasets are available at https://github.com/wen112358/ImplicitBiasPsychometricEvaluation
☆ Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning
The emergence of in-context learning (ICL) is potentially attributed to two major abilities: task recognition (TR) for recognizing the task from demonstrations and utilizing pre-trained priors, and task learning (TL) for learning from demonstrations. However, relationships between the two abilities and how such relationships affect the emergence of ICL is unclear. In this paper, we take the first step by examining the pre-training dynamics of the emergence of ICL. With carefully designed metrics, we find that these two abilities are, in fact, competitive during pre-training. Moreover, we observe a strong negative correlation between the competition and ICL performance. Further analysis of common pre-training factors (i.e., model size, dataset size, and data curriculum) demonstrates possible ways to manage the competition. Based on these insights, we propose a simple yet effective method to better integrate these two abilities for ICL at inference time. Through adaptive ensemble learning, the performance of ICL can be significantly boosted, enabling two small models to outperform a larger one with more than twice the parameters. The code is available at https://github.com/RUCAIBox/Competitive-ICL.
comment: work in progress
☆ HIGHT: Hierarchical Graph Tokenization for Graph-Language Alignment
Recently there has been a surge of interest in extending the success of large language models (LLMs) to graph modality, such as social networks and molecules. As LLMs are predominantly trained with 1D text data, most existing approaches adopt a graph neural network to represent a graph as a series of node tokens and feed these tokens to LLMs for graph-language alignment. Despite achieving some successes, existing approaches have overlooked the hierarchical structures that are inherent in graph data. Especially, in molecular graphs, the high-order structural information contains rich semantics of molecular functional groups, which encode crucial biochemical functionalities of the molecules. We establish a simple benchmark showing that neglecting the hierarchical information in graph tokenization will lead to subpar graph-language alignment and severe hallucination in generated outputs. To address this problem, we propose a novel strategy called HIerarchical GrapH Tokenization (HIGHT). HIGHT employs a hierarchical graph tokenizer that extracts and encodes the hierarchy of node, motif, and graph levels of informative tokens to improve the graph perception of LLMs. HIGHT also adopts an augmented graph-language supervised fine-tuning dataset, enriched with the hierarchical graph information, to further enhance the graph-language alignment. Extensive experiments on 7 molecule-centric benchmarks confirm the effectiveness of HIGHT in reducing hallucination by 40%, as well as significant improvements in various molecule-language downstream tasks.
comment: Preliminary version of an ongoing project: https://higraphllm.github.io/
☆ Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake News
LLMs offer valuable capabilities, yet they can be utilized by malicious users to disseminate deceptive information and generate fake news. The growing prevalence of LLMs poses difficulties in crafting detection approaches that remain effective across various text domains. Additionally, the absence of precautionary measures for AI-generated news on online social platforms is concerning. Therefore, there is an urgent need to improve people's ability to differentiate between news articles written by humans and those produced by LLMs. By providing cues in human-written and LLM-generated news, we can help individuals increase their skepticism towards fake LLM-generated news. This paper aims to elucidate simple markers that help individuals distinguish between articles penned by humans and those created by LLMs. To achieve this, we initially collected a dataset comprising 39k news articles authored by humans or generated by four distinct LLMs with varying degrees of fake. We then devise a metric named Entropy-Shift Authorship Signature (ESAS) based on the information theory and entropy principles. The proposed ESAS ranks terms or entities, like POS tagging, within news articles based on their relevance in discerning article authorship. We demonstrate the effectiveness of our metric by showing the high accuracy attained by a basic method, i.e., TF-IDF combined with logistic regression classifier, using a small set of terms with the highest ESAS score. Consequently, we introduce and scrutinize these top ESAS-ranked terms to aid individuals in strengthening their skepticism towards LLM-generated fake news.
☆ Information Guided Regularization for Fine-tuning Language Models
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
☆ "Global is Good, Local is Bad?": Understanding Brand Bias in LLMs
Many recent studies have investigated social biases in LLMs but brand bias has received little attention. This research examines the biases exhibited by LLMs towards different brands, a significant concern given the widespread use of LLMs in affected use cases such as product recommendation and market analysis. Biased models may perpetuate societal inequalities, unfairly favoring established global brands while marginalizing local ones. Using a curated dataset across four brand categories, we probe the behavior of LLMs in this space. We find a consistent pattern of bias in this space -- both in terms of disproportionately associating global brands with positive attributes and disproportionately recommending luxury gifts for individuals in high-income countries. We also find LLMs are subject to country-of-origin effects which may boost local brand preference in LLM outputs in specific contexts.
☆ Exploring Changes in Nation Perception with Nationality-Assigned Personas in LLMs
Persona assignment has become a common strategy for customizing LLM use to particular tasks and contexts. In this study, we explore how perceptions of different nations change when LLMs are assigned specific nationality personas. We assign 193 different nationality personas (e.g., an American person) to four LLMs and examine how the LLM perceptions of countries change. We find that all LLM-persona combinations tend to favor Western European nations, though nation-personas push LLM behaviors to focus more on and view more favorably the nation-persona's own region. Eastern European, Latin American, and African nations are viewed more negatively by different nationality personas. Our study provides insight into how biases and stereotypes are realized within LLMs when adopting different national personas. In line with the "Blueprint for an AI Bill of Rights", our findings underscore the critical need for developing mechanisms to ensure LLMs uphold fairness and not over-generalize at a global scale.
☆ Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.
☆ MR-BEN: A Comprehensive Meta-Reasoning Benchmark for Large Language Models
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, it has been increasingly challenging to evaluate the reasoning capability of LLMs. Concretely, existing outcome-based benchmarks begin to saturate and become less sufficient to monitor the progress. To this end, we present a process-based benchmark MR-BEN that demands a meta reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. MR-BEN is a comprehensive benchmark comprising 5,975 questions collected from human experts, covering various subjects such as physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, open-source models are seemingly comparable to GPT-4 on outcome-based benchmarks, but they lag far behind on our benchmark, revealing the underlying reasoning capability gap between them. Our dataset and codes are available on https://randolph-zeng.github.io/Mr-Ben.github.io/.
☆ Evolving to be Your Soulmate: Personalized Dialogue Agents with Dynamically Adapted Personas
Previous research on persona-based dialogue agents typically preset the agent's persona before deployment, which remains static thereafter. In this paper, we take a step further and explore a new paradigm called Self-evolving Personalized Dialogue Agents (SPDA), where the agent continuously evolves during the conversation to better align with the user's anticipation by dynamically adapting its persona. This paradigm could enable better personalization for each user, but also introduce unique challenges, which mainly lie in the process of persona adaptation. Two key issues include how to achieve persona alignment with the user and how to ensure smooth transition in the adaptation process. To address them, we propose a novel framework that refines the persona at hierarchical levels to progressively align better with the user in a controllable way. Experiments show that integrating the personas adapted by our framework consistently enhances personalization and overall dialogue performance across various base systems.
comment: Work in progress
☆ CityGPT: Empowering Urban Spatial Cognition of Large Language Models
Large language models(LLMs) with powerful language generation and reasoning capabilities have already achieved success in many domains, e.g., math and code generation. However, due to the lacking of physical world's corpus and knowledge during training, they usually fail to solve many real-life tasks in the urban space. In this paper, we propose CityGPT, a systematic framework for enhancing the capability of LLMs on understanding urban space and solving the related urban tasks by building a city-scale world model in the model. First, we construct a diverse instruction tuning dataset CityInstruction for injecting urban knowledge and enhancing spatial reasoning capability effectively. By using a mixture of CityInstruction and general instruction data, we fine-tune various LLMs (e.g., ChatGLM3-6B, Qwen1.5 and LLama3 series) to enhance their capability without sacrificing general abilities. To further validate the effectiveness of proposed methods, we construct a comprehensive benchmark CityEval to evaluate the capability of LLMs on diverse urban scenarios and problems. Extensive evaluation results demonstrate that small LLMs trained with CityInstruction can achieve competitive performance with commercial LLMs in the comprehensive evaluation of CityEval. The source codes are openly accessible to the research community via https://github.com/tsinghua-fib-lab/CityGPT.
☆ AspirinSum: an Aspect-based utility-preserved de-identification Summarization framework
Due to the rapid advancement of Large Language Model (LLM), the whole community eagerly consumes any available text data in order to train the LLM. Currently, large portion of the available text data are collected from internet, which has been thought as a cheap source of the training data. However, when people try to extend the LLM's capability to the personal related domain, such as healthcare or education, the lack of public dataset in these domains make the adaption of the LLM in such domains much slower. The reason of lacking public available dataset in such domains is because they usually contain personal sensitive information. In order to comply with privacy law, the data in such domains need to be de-identified before any kind of dissemination. It had been much research tried to address this problem for the image or tabular data. However, there was limited research on the efficient and general de-identification method for text data. Most of the method based on human annotation or predefined category list. It usually can not be easily adapted to specific domains. The goal of this proposal is to develop a text de-identification framework, which can be easily adapted to the specific domain, leverage the existing expert knowledge without further human annotation. We propose an aspect-based utility-preserved de-identification summarization framework, AspirinSum, by learning to align expert's aspect from existing comment data, it can efficiently summarize the personal sensitive document by extracting personal sensitive aspect related sub-sentence and de-identify it by substituting it with similar aspect sub-sentence. We envision that the de-identified text can then be used in data publishing, eventually publishing our de-identified dataset for downstream task use.
☆ CityBench: Evaluating the Capabilities of Large Language Model as World Model
Large language models (LLMs) with powerful generalization ability has been widely used in many domains. A systematic and reliable evaluation of LLMs is a crucial step in their development and applications, especially for specific professional fields. In the urban domain, there have been some early explorations about the usability of LLMs, but a systematic and scalable evaluation benchmark is still lacking. The challenge in constructing a systematic evaluation benchmark for the urban domain lies in the diversity of data and scenarios, as well as the complex and dynamic nature of cities. In this paper, we propose CityBench, an interactive simulator based evaluation platform, as the first systematic evaluation benchmark for the capability of LLMs for urban domain. First, we build CitySim to integrate the multi-source data and simulate fine-grained urban dynamics. Based on CitySim, we design 7 tasks in 2 categories of perception-understanding and decision-making group to evaluate the capability of LLMs as city-scale world model for urban domain. Due to the flexibility and ease-of-use of CitySim, our evaluation platform CityBench can be easily extended to any city in the world. We evaluate 13 well-known LLMs including open source LLMs and commercial LLMs in 13 cities around the world. Extensive experiments demonstrate the scalability and effectiveness of proposed CityBench and shed lights for the future development of LLMs in urban domain. The dataset, benchmark and source codes are openly accessible to the research community via https://github.com/tsinghua-fib-lab/CityBench
☆ AutoCAP: Towards Automatic Cross-lingual Alignment Planning for Zero-shot Chain-of-Thought ACL2024
Cross-lingual chain-of-thought can effectively complete reasoning tasks across languages, which gains increasing attention. Recently, dominant approaches in the literature improve cross-lingual alignment capabilities by integrating reasoning knowledge from different languages. Despite achieving excellent performance, current methods still have two main challenges: (1) Manual language specification: They still highly rely on manually selecting the languages to integrate, severely affecting their generalizability; (2) Static weight allocation: Current methods simply integrate all languages equally. In fact, different language reasoning paths should have different weights to achieve better complementation and integration. Motivated by this, we introduce an Automatic Cross-lingual Alignment Planning (AutoCAP) for zero-shot chain-of-thought to address the above challenges. The core of AutoCAP consists of two components: (1) Automatic Language Selection Prompting to guide LLMs to select appropriate languages and (2) Automatic Weight Allocation Prompting to automatically allocate alignment weight scores to each reasoning path. Extensive experiments on several benchmarks reveal that AutoCAP achieves state-of-the-art performance, surpassing previous methods that required manual effort.
comment: Accepted by ACL2024 Findings
☆ Reasoning Like a Doctor: Improving Medical Dialogue Systems via Diagnostic Reasoning Process Alignment ACL 2024
Medical dialogue systems have attracted significant attention for their potential to act as medical assistants. Enabling these medical systems to emulate clinicians' diagnostic reasoning process has been the long-standing research focus. Previous studies rudimentarily realized the simulation of clinicians' diagnostic process by fine-tuning language models on high-quality dialogue datasets. Nonetheless, they overly focus on the outcomes of the clinician's reasoning process while ignoring their internal thought processes and alignment with clinician preferences. Our work aims to build a medical dialogue system that aligns with clinicians' diagnostic reasoning processes. We propose a novel framework, Emulation, designed to generate an appropriate response that relies on abductive and deductive diagnostic reasoning analyses and aligns with clinician preferences through thought process modeling. Experimental results on two datasets confirm the efficacy of Emulation. Crucially, our framework furnishes clear explanations for the generated responses, enhancing its transparency in medical consultations.
comment: Accepted by ACL 2024 Findings
☆ Large Language Models are Skeptics: False Negative Problem of Input-conflicting Hallucination
In this paper, we identify a new category of bias that induces input-conflicting hallucinations, where large language models (LLMs) generate responses inconsistent with the content of the input context. This issue we have termed the false negative problem refers to the phenomenon where LLMs are predisposed to return negative judgments when assessing the correctness of a statement given the context. In experiments involving pairs of statements that contain the same information but have contradictory factual directions, we observe that LLMs exhibit a bias toward false negatives. Specifically, the model presents greater overconfidence when responding with False. Furthermore, we analyze the relationship between the false negative problem and context and query rewriting and observe that both effectively tackle false negatives in LLMs.
comment: 12 pages, 9 figures
☆ GenderAlign: An Alignment Dataset for Mitigating Gender Bias in Large Language Models
Large Language Models (LLMs) are prone to generating content that exhibits gender biases, raising significant ethical concerns. Alignment, the process of fine-tuning LLMs to better align with desired behaviors, is recognized as an effective approach to mitigate gender biases. Although proprietary LLMs have made significant strides in mitigating gender bias, their alignment datasets are not publicly available. The commonly used and publicly available alignment dataset, HH-RLHF, still exhibits gender bias to some extent. There is a lack of publicly available alignment datasets specifically designed to address gender bias. Hence, we developed a new dataset named GenderAlign, aiming at mitigating a comprehensive set of gender biases in LLMs. This dataset comprises 8k single-turn dialogues, each paired with a "chosen" and a "rejected" response. Compared to the "rejected" responses, the "chosen" responses demonstrate lower levels of gender bias and higher quality. Furthermore, we categorized the gender biases in the "rejected" responses of GenderAlign into 4 principal categories. The experimental results show the effectiveness of GenderAlign in reducing gender bias in LLMs.
☆ PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
☆ Persuasiveness of Generated Free-Text Rationales in Subjective Decisions: A Case Study on Pairwise Argument Ranking
Generating free-text rationales is among the emergent capabilities of Large Language Models (LLMs). These rationales have been found to enhance LLM performance across various NLP tasks. Recently, there has been growing interest in using these rationales to provide insights for various important downstream tasks. In this paper, we analyze generated free-text rationales in tasks with subjective answers, emphasizing the importance of rationalization in such scenarios. We focus on pairwise argument ranking, a highly subjective task with significant potential for real-world applications, such as debate assistance. We evaluate the persuasiveness of rationales generated by nine LLMs to support their subjective choices. Our findings suggest that open-source LLMs, particularly Llama2-70B-chat, are capable of providing highly persuasive rationalizations, surpassing even GPT models. Additionally, our experiments show that rationale persuasiveness can be improved by controlling its parameters through prompting or through self-refinement.
☆ Generative AI for Enhancing Active Learning in Education: A Comparative Study of GPT-3.5 and GPT-4 in Crafting Customized Test Questions
This study investigates how LLMs, specifically GPT-3.5 and GPT-4, can develop tailored questions for Grade 9 math, aligning with active learning principles. By utilizing an iterative method, these models adjust questions based on difficulty and content, responding to feedback from a simulated 'student' model. A novel aspect of the research involved using GPT-4 as a 'teacher' to create complex questions, with GPT-3.5 as the 'student' responding to these challenges. This setup mirrors active learning, promoting deeper engagement. The findings demonstrate GPT-4's superior ability to generate precise, challenging questions and notable improvements in GPT-3.5's ability to handle more complex problems after receiving instruction from GPT-4. These results underscore the potential of LLMs to mimic and enhance active learning scenarios, offering a promising path for AI in customized education. This research contributes to understanding how AI can support personalized learning experiences, highlighting the need for further exploration in various educational contexts
comment: Publisher: Canadian Artificial Intelligence Association. URL: https://caiac.pubpub.org/pub/kmn55wd2#nssvokovikx
☆ The Use of Multimodal Large Language Models to Detect Objects from Thermal Images: Transportation Applications
The integration of thermal imaging data with Multimodal Large Language Models (MLLMs) constitutes an exciting opportunity for improving the safety and functionality of autonomous driving systems and many Intelligent Transportation Systems (ITS) applications. This study investigates whether MLLMs can understand complex images from RGB and thermal cameras and detect objects directly. Our goals were to 1) assess the ability of the MLLM to learn from information from various sets, 2) detect objects and identify elements in thermal cameras, 3) determine whether two independent modality images show the same scene, and 4) learn all objects using different modalities. The findings showed that both GPT-4 and Gemini were effective in detecting and classifying objects in thermal images. Similarly, the Mean Absolute Percentage Error (MAPE) for pedestrian classification was 70.39% and 81.48%, respectively. Moreover, the MAPE for bike, car, and motorcycle detection were 78.4%, 55.81%, and 96.15%, respectively. Gemini produced MAPE of 66.53%, 59.35% and 78.18% respectively. This finding further demonstrates that MLLM can identify thermal images and can be employed in advanced imaging automation technologies for ITS applications.
☆ Understanding Finetuning for Factual Knowledge Extraction ICML 2024
In this work, we study the impact of QA fine-tuning data on downstream factuality. We show that fine-tuning on lesser-known facts that are poorly stored during pretraining yields significantly worse factuality than fine-tuning on well-known facts, even when all facts are seen during pretraining. We prove this phenomenon theoretically, showing that training on lesser-known facts can lead the model to ignore subject entity names and instead output a generic plausible response even when the relevant factual knowledge is encoded in the model. On three question answering benchmarks (PopQA, Entity Questions, and MMLU) and two language models (Llama-2-7B and Mistral-7B), we find that (i) finetuning on a completely factual but lesser-known subset of the data deteriorates downstream factuality (5-10%) and (ii) finetuning on a subset of better-known examples matches or outperforms finetuning on the entire dataset. Ultimately, our results shed light on the interaction between pretrained knowledge and finetuning data and demonstrate the importance of taking into account how facts are stored in the pretrained model when fine-tuning for knowledge-intensive tasks.
comment: To appear in ICML 2024
☆ Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework SIGIR24
Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.
comment: Accepted to LLM4Eval @ SIGIR24
☆ Evaluating Numerical Reasoning in Text-to-Image Models
Text-to-image generative models are capable of producing high-quality images that often faithfully depict concepts described using natural language. In this work, we comprehensively evaluate a range of text-to-image models on numerical reasoning tasks of varying difficulty, and show that even the most advanced models have only rudimentary numerical skills. Specifically, their ability to correctly generate an exact number of objects in an image is limited to small numbers, it is highly dependent on the context the number term appears in, and it deteriorates quickly with each successive number. We also demonstrate that models have poor understanding of linguistic quantifiers (such as "a few" or "as many as"), the concept of zero, and struggle with more advanced concepts such as partial quantities and fractional representations. We bundle prompts, generated images and human annotations into GeckoNum, a novel benchmark for evaluation of numerical reasoning.
☆ ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
comment: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS
☆ RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
☆ A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering
Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidence for logical expression generation. These multi-stage efforts prioritize acquiring external sources but overlook the incorporation of new knowledge into their model parameters. In effect, even advanced language models and retrievers have knowledge boundaries, thereby limiting the generalization capabilities of previous KBQA models. Therefore, this paper develops KBLLaMA, which follows a learn-then-reason framework to inject new KB knowledge into a large language model for flexible end-to-end KBQA. At the core of KBLLaMA, we study (1) how to organize new knowledge about KBQA and (2) how to facilitate the learning of the organized knowledge. Extensive experiments on various KBQA generalization tasks showcase the state-of-the-art performance of KBLLaMA. Especially on the general benchmark GrailQA and domain-specific benchmark Bio-chemical, KBLLaMA respectively derives a performance gain of up to 3.8% and 9.8% compared to the baselines.
☆ An LLM Feature-based Framework for Dialogue Constructiveness Assessment
Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.
☆ An Adapter-Based Unified Model for Multiple Spoken Language Processing Tasks ICASSP 2024
Self-supervised learning models have revolutionized the field of speech processing. However, the process of fine-tuning these models on downstream tasks requires substantial computational resources, particularly when dealing with multiple speech-processing tasks. In this paper, we explore the potential of adapter-based fine-tuning in developing a unified model capable of effectively handling multiple spoken language processing tasks. The tasks we investigate are Automatic Speech Recognition, Phoneme Recognition, Intent Classification, Slot Filling, and Spoken Emotion Recognition. We validate our approach through a series of experiments on the SUPERB benchmark, and our results indicate that adapter-based fine-tuning enables a single encoder-decoder model to perform multiple speech processing tasks with an average improvement of 18.4% across the five target tasks while staying efficient in terms of parameter updates.
comment: ICASSP 2024
☆ Relation Extraction with Fine-Tuned Large Language Models in Retrieval Augmented Generation Frameworks
Information Extraction (IE) is crucial for converting unstructured data into structured formats like Knowledge Graphs (KGs). A key task within IE is Relation Extraction (RE), which identifies relationships between entities in text. Various RE methods exist, including supervised, unsupervised, weakly supervised, and rule-based approaches. Recent studies leveraging pre-trained language models (PLMs) have shown significant success in this area. In the current era dominated by Large Language Models (LLMs), fine-tuning these models can overcome limitations associated with zero-shot LLM prompting-based RE methods, especially regarding domain adaptation challenges and identifying implicit relations between entities in sentences. These implicit relations, which cannot be easily extracted from a sentence's dependency tree, require logical inference for accurate identification. This work explores the performance of fine-tuned LLMs and their integration into the Retrieval Augmented-based (RAG) RE approach to address the challenges of identifying implicit relations at the sentence level, particularly when LLMs act as generators within the RAG framework. Empirical evaluations on the TACRED, TACRED-Revisited (TACREV), Re-TACRED, and SemEVAL datasets show significant performance improvements with fine-tuned LLMs, including Llama2-7B, Mistral-7B, and T5 (Large). Notably, our approach achieves substantial gains on SemEVAL, where implicit relations are common, surpassing previous results on this dataset. Additionally, our method outperforms previous works on TACRED, TACREV, and Re-TACRED, demonstrating exceptional performance across diverse evaluation scenarios.
comment: preprint
☆ Learning to Retrieve Iteratively for In-Context Learning
We introduce iterative retrieval, a novel framework that empowers retrievers to make iterative decisions through policy optimization. Finding an optimal portfolio of retrieved items is a combinatorial optimization problem, generally considered NP-hard. This approach provides a learned approximation to such a solution, meeting specific task requirements under a given family of large language models (LLMs). We propose a training procedure based on reinforcement learning, incorporating feedback from LLMs. We instantiate an iterative retriever for composing in-context learning (ICL) exemplars and apply it to various semantic parsing tasks that demand synthesized programs as outputs. By adding only 4M additional parameters for state encoding, we convert an off-the-shelf dense retriever into a stateful iterative retriever, outperforming previous methods in selecting ICL exemplars on semantic parsing datasets such as CalFlow, TreeDST, and MTOP. Additionally, the trained iterative retriever generalizes across different inference LLMs beyond the one used during training.
☆ Dissecting the Ullman Variations with a SCALPEL: Why do LLMs fail at Trivial Alterations to the False Belief Task?
Recent empirical results have sparked a debate about whether or not Large Language Models (LLMs) are capable of Theory of Mind (ToM). While some have found LLMs to be successful on ToM evaluations such as the False Belief task (Kosinski, 2023), others have argued that LLMs solve these tasks by exploiting spurious correlations -- not representing beliefs -- since they fail on trivial alterations to these tasks (Ullman, 2023). In this paper, we introduce SCALPEL: a technique to generate targeted modifications for False Belief tasks to test different specific hypotheses about why LLMs fail. We find that modifications which make explicit common inferences -- such as that looking at a transparent object implies recognizing its contents -- preserve LLMs' performance. This suggests that LLMs' failures on modified ToM tasks could result from a lack of more general commonsense reasoning, rather than a failure to represent mental states. We argue that SCALPEL could be helpful for explaining LLM successes and failures in other cases.
☆ TTQA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization
Question answering (QA) over tables and text has gained much popularity over the years. Multi-hop table-text QA requires multiple hops between the table and text, making it a challenging QA task. Although several works have attempted to solve the table-text QA task, most involve training the models and requiring labeled data. In this paper, we have proposed a model - TTQA-RS: A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization. Our model uses augmented knowledge including table-text summary with decomposed sub-question with answer for a reasoning-based table-text QA. Using open-source language models our model outperformed all existing prompting methods for table-text QA tasks on existing table-text QA datasets like HybridQA and OTT-QA's development set. Our results are comparable with the training-based state-of-the-art models, demonstrating the potential of prompt-based approaches using open-source LLMs. Additionally, by using GPT-4 with LLaMA3-70B, our model achieved state-of-the-art performance for prompting-based methods on multi-hop table-text QA.
☆ 1+1>2: Can Large Language Models Serve as Cross-Lingual Knowledge Aggregators?
Large Language Models (LLMs) have garnered significant attention due to their remarkable ability to process information across various languages. Despite their capabilities, they exhibit inconsistencies in handling identical queries in different languages, presenting challenges for further advancement. This paper introduces a method to enhance the multilingual performance of LLMs by aggregating knowledge from diverse languages. This approach incorporates a low-resource knowledge detector specific to a language, a language selection process, and mechanisms for answer replacement and integration. Our experiments demonstrate notable performance improvements, particularly in reducing language performance disparity. An ablation study confirms that each component of our method significantly contributes to these enhancements. This research highlights the inherent potential of LLMs to harmonize multilingual capabilities and offers valuable insights for further exploration.
☆ MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
☆ Factual Dialogue Summarization via Learning from Large Language Models
Factual consistency is an important quality in dialogue summarization. Large language model (LLM)-based automatic text summarization models generate more factually consistent summaries compared to those by smaller pretrained language models, but they face deployment challenges in real-world applications due to privacy or resource constraints. In this paper, we investigate the use of symbolic knowledge distillation to improve the factual consistency of smaller pretrained models for dialogue summarization. We employ zero-shot learning to extract symbolic knowledge from LLMs, generating both factually consistent (positive) and inconsistent (negative) summaries. We then apply two contrastive learning objectives on these summaries to enhance smaller summarization models. Experiments with BART, PEGASUS, and Flan-T5 indicate that our approach surpasses strong baselines that rely on complex data augmentation strategies. Our approach achieves better factual consistency while maintaining coherence, fluency, and relevance, as confirmed by various automatic evaluation metrics. We also provide access to the data and code to facilitate future research.
☆ Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics
The idea of personality in descriptive psychology, traditionally defined through observable behavior, has now been extended to Large Language Models (LLMs) to better understand their behavior. This raises a question: do LLMs exhibit distinct and consistent personality traits, similar to humans? Existing self-assessment personality tests, while applicable, lack the necessary validity and reliability for precise personality measurements. To address this, we introduce TRAIT, a new tool consisting of 8K multi-choice questions designed to assess the personality of LLMs with validity and reliability. TRAIT is built on the psychometrically validated human questionnaire, Big Five Inventory (BFI) and Short Dark Triad (SD-3), enhanced with the ATOMIC10X knowledge graph for testing personality in a variety of real scenarios. TRAIT overcomes the reliability and validity issues when measuring personality of LLM with self-assessment, showing the highest scores across three metrics: refusal rate, prompt sensitivity, and option order sensitivity. It reveals notable insights into personality of LLM: 1) LLMs exhibit distinct and consistent personality, which is highly influenced by their training data (i.e., data used for alignment tuning), and 2) current prompting techniques have limited effectiveness in eliciting certain traits, such as high psychopathy or low conscientiousness, suggesting the need for further research in this direction.
comment: Preprint; Under review
☆ Speech Prefix-Tuning with RNNT Loss for Improving LLM Predictions
In this paper, we focus on addressing the constraints faced when applying LLMs to ASR. Recent works utilize prefixLM-type models, which directly apply speech as a prefix to LLMs for ASR. We have found that optimizing speech prefixes leads to better ASR performance and propose applying RNNT loss to perform speech prefix-tuning. This is a simple approach and does not increase the model complexity or alter the inference pipeline. We also propose language-based soft prompting to further improve with frozen LLMs. Empirical analysis on realtime testset from 10 Indic languages demonstrate that our proposed speech prefix-tuning yields improvements with both frozen and fine-tuned LLMs. Our recognition results on an average of 10 Indics show that the proposed prefix-tuning with RNNT loss results in a 12\% relative improvement in WER over the baseline with a fine-tuned LLM. Our proposed approches with the frozen LLM leads to a 31\% relative improvement over basic soft-prompting prefixLM.
☆ Depth $F_1$: Improving Evaluation of Cross-Domain Text Classification by Measuring Semantic Generalizability
Recent evaluations of cross-domain text classification models aim to measure the ability of a model to obtain domain-invariant performance in a target domain given labeled samples in a source domain. The primary strategy for this evaluation relies on assumed differences between source domain samples and target domain samples in benchmark datasets. This evaluation strategy fails to account for the similarity between source and target domains, and may mask when models fail to transfer learning to specific target samples which are highly dissimilar from the source domain. We introduce Depth $F_1$, a novel cross-domain text classification performance metric. Designed to be complementary to existing classification metrics such as $F_1$, Depth $F_1$ measures how well a model performs on target samples which are dissimilar from the source domain. We motivate this metric using standard cross-domain text classification datasets and benchmark several recent cross-domain text classification models, with the goal of enabling in-depth evaluation of the semantic generalizability of cross-domain text classification models.
☆ A Contrastive Learning Approach to Mitigate Bias in Speech Models
Speech models may be affected by performance imbalance in different population subgroups, raising concerns about fair treatment across these groups. Prior attempts to mitigate unfairness either focus on user-defined subgroups, potentially overlooking other affected subgroups, or do not explicitly improve the internal representation at the subgroup level. This paper proposes the first adoption of contrastive learning to mitigate speech model bias in underperforming subgroups. We employ a three-level learning technique that guides the model in focusing on different scopes for the contrastive loss, i.e., task, subgroup, and the errors within subgroups. The experiments on two spoken language understanding datasets and two languages demonstrate that our approach improves internal subgroup representations, thus reducing model bias and enhancing performance.
comment: Accepted at Interspeech 2024
☆ TAGLAS: An atlas of text-attributed graph datasets in the era of large graph and language models
In this report, we present TAGLAS, an atlas of text-attributed graph (TAG) datasets and benchmarks. TAGs are graphs with node and edge features represented in text, which have recently gained wide applicability in training graph-language or graph foundation models. In TAGLAS, we collect and integrate more than 23 TAG datasets with domains ranging from citation graphs to molecule graphs and tasks from node classification to graph question-answering. Unlike previous graph datasets and benchmarks, all datasets in TAGLAS have a unified node and edge text feature format, which allows a graph model to be simultaneously trained and evaluated on multiple datasets from various domains. Further, we provide a standardized, efficient, and simplified way to load all datasets and tasks. We also provide useful utils like text-to-embedding conversion, and graph-to-text conversion, which can facilitate different evaluation scenarios. Finally, we also provide standard and easy-to-use evaluation utils. The project is open-sourced at https://github.com/JiaruiFeng/TAGLAS and is still under construction. Please expect more datasets/features in the future.
comment: Preprint
☆ Dravidian language family through Universal Dependencies lens
The Universal Dependencies (UD) project aims to create a cross-linguistically consistent dependency annotation for multiple languages, to facilitate multilingual NLP. It currently supports 114 languages. Dravidian languages are spoken by over 200 million people across the word, and yet there are only two languages from this family in UD. This paper examines some of the morphological and syntactic features of Dravidian languages and explores how they can be annotated in the UD framework.
comment: unpublished report from 2021
☆ Bidirectional Transformer Representations of (Spanish) Ambiguous Words in Context: A New Lexical Resource and Empirical Analysis EMNLP 2024
Lexical ambiguity -- where a single wordform takes on distinct, context-dependent meanings -- serves as a useful tool to compare across different large language models' (LLMs') ability to form distinct, contextualized representations of the same stimulus. Few studies have systematically compared LLMs' contextualized word embeddings for languages beyond English. Here, we evaluate multiple bidirectional transformers' (BERTs') semantic representations of Spanish ambiguous nouns in context. We develop a novel dataset of minimal-pair sentences evoking the same or different sense for a target ambiguous noun. In a pre-registered study, we collect contextualized human relatedness judgments for each sentence pair. We find that various BERT-based LLMs' contextualized semantic representations capture some variance in human judgments but fall short of the human benchmark, and for Spanish -- unlike English -- model scale is uncorrelated with performance. We also identify stereotyped trajectories of target noun disambiguation as a proportion of traversal through a given LLM family's architecture, which we partially replicate in English. We contribute (1) a dataset of controlled, Spanish sentence stimuli with human relatedness norms, and (2) to our evolving understanding of the impact that LLM specification (architectures, training protocols) exerts on contextualized embeddings.
comment: 16 pages, 12 figures, submitted to conference (EMNLP 2024)
☆ Insights into LLM Long-Context Failures: When Transformers Know but Don't Tell
Large Language Models (LLMs) exhibit positional bias, struggling to utilize information from the middle or end of long contexts. Our study explores LLMs' long-context reasoning by probing their hidden representations. We find that while LLMs encode the position of target information, they often fail to leverage this in generating accurate responses. This reveals a disconnect between information retrieval and utilization, a "know but don't tell" phenomenon. We further analyze the relationship between extraction time and final accuracy, offering insights into the underlying mechanics of transformer models.
☆ Exploring Design Choices for Building Language-Specific LLMs
Despite rapid progress in large language models (LLMs), their performance on a vast majority of languages remain unsatisfactory. In this paper, we study building language-specific LLMs by adapting monolingual and multilingual LLMs. We conduct systematic experiments on how design choices (base model selection, vocabulary extension, and continued fine-tuning) impact the adapted LLM, both in terms of efficiency (how many tokens are needed to encode the same amount of information) and end task performance. We find that (1) the initial performance before the adaptation is not always indicative of the final performance. (2) Efficiency can easily improved with simple vocabulary extension and continued fine-tuning in most LLMs we study, and (3) The optimal adaptation method is highly language-dependent, and the simplest approach works well across various experimental settings. Adapting English-centric models can yield better results than adapting multilingual models despite their worse initial performance on low-resource languages. Together, our work lays foundations on efficiently building language-specific LLMs by adapting existing LLMs.
comment: 15 pages, 6 figures, 11 tables
☆ Co-training for Low Resource Scientific Natural Language Inference ACL 2024
Scientific Natural Language Inference (NLI) is the task of predicting the semantic relation between a pair of sentences extracted from research articles. The automatic annotation method based on distant supervision for the training set of SciNLI (Sadat and Caragea, 2022b), the first and most popular dataset for this task, results in label noise which inevitably degenerates the performance of classifiers. In this paper, we propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels, reflective of the manner they are used in the subsequent training epochs. That is, unlike the existing semi-supervised learning (SSL) approaches, we consider the historical behavior of the classifiers to evaluate the quality of the automatically annotated labels. Furthermore, by assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data, while ensuring that the noisy labels have a minimal impact on model training. The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines. We make our code and data available on Github.
comment: Accepted in ACL 2024 (main conference)
☆ OpenDebateEvidence: A Massive-Scale Argument Mining and Summarization Dataset ACL2024
We introduce OpenDebateEvidence, a comprehensive dataset for argument mining and summarization sourced from the American Competitive Debate community. This dataset includes over 3.5 million documents with rich metadata, making it one of the most extensive collections of debate evidence. OpenDebateEvidence captures the complexity of arguments in high school and college debates, providing valuable resources for training and evaluation. Our extensive experiments demonstrate the efficacy of fine-tuning state-of-the-art large language models for argumentative abstractive summarization across various methods, models, and datasets. By providing this comprehensive resource, we aim to advance computational argumentation and support practical applications for debaters, educators, and researchers. OpenDebateEvidence is publicly available to support further research and innovation in computational argumentation. Access it here: https://huggingface.co/datasets/Yusuf5/OpenCaselist
comment: Accepted for Publication to ARGMIN 2024 at ACL2024
☆ Major Entity Identification: A Generalizable Alternative to Coreference Resolution
The limited generalization of coreference resolution (CR) models has been a major bottleneck in the task's broad application. Prior work has identified annotation differences, especially for mention detection, as one of the main reasons for the generalization gap and proposed using additional annotated target domain data. Rather than relying on this additional annotation, we propose an alternative formulation of the CR task, Major Entity Identification (MEI), where we: (a) assume the target entities to be specified in the input, and (b) limit the task to only the frequent entities. Through extensive experiments, we demonstrate that MEI models generalize well across domains on multiple datasets with supervised models and LLM-based few-shot prompting. Additionally, the MEI task fits the classification framework, which enables the use of classification-based metrics that are more robust than the current CR metrics. Finally, MEI is also of practical use as it allows a user to search for all mentions of a particular entity or a group of entities of interest.
comment: 16 pages, 6 figures
☆ Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation ACL 2024
Data contamination has garnered increased attention in the era of large language models (LLMs) due to the reliance on extensive internet-derived training corpora. The issue of training corpus overlap with evaluation benchmarks--referred to as contamination--has been the focus of significant recent research. This body of work aims to identify contamination, understand its impacts, and explore mitigation strategies from diverse perspectives. However, comprehensive studies that provide a clear pathway from foundational concepts to advanced insights are lacking in this nascent field. Therefore, we present a comprehensive survey in the field of data contamination, laying out the key issues, methodologies, and findings to date, and highlighting areas in need of further research and development. In particular, we begin by examining the effects of data contamination across various stages and forms. We then provide a detailed analysis of current contamination detection methods, categorizing them to highlight their focus, assumptions, strengths, and limitations. We also discuss mitigation strategies, offering a clear guide for future research. This survey serves as a succinct overview of the most recent advancements in data contamination research, providing a straightforward guide for the benefit of future research endeavors.
comment: ACL 2024 Camera-Ready Version
☆ Holistic Evaluation for Interleaved Text-and-Image Generation
Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
comment: Work in progress. 13 pages, 5 figure, 6 tables
☆ Can LLMs Learn by Teaching? A Preliminary Study
Teaching to improve student models (e.g., knowledge distillation) is an extensively studied methodology in LLMs. However, for humans, teaching not only improves students but also improves teachers. We ask: Can LLMs also learn by teaching (LbT)? If yes, we can potentially unlock the possibility of continuously advancing the models without solely relying on human-produced data or stronger models. In this paper, we provide a preliminary exploration of this ambitious agenda. We show that LbT ideas can be incorporated into existing LLM training/prompting pipelines and provide noticeable improvements. Specifically, we design three methods, each mimicking one of the three levels of LbT in humans: observing students' feedback, learning from the feedback, and learning iteratively, with the goals of improving answer accuracy without training and improving models' inherent capability with fine-tuning. The findings are encouraging. For example, similar to LbT in human, we see that: (1) LbT can induce weak-to-strong generalization: strong models can improve themselves by teaching other weak models; (2) Diversity in students might help: teaching multiple students could be better than teaching one student or the teacher itself. We hope that this early promise can inspire future research on LbT and more broadly adopting the advanced techniques in education to improve LLMs. The code is available at https://github.com/imagination-research/lbt.
comment: Under review
♻ ☆ SIT: Fine-tuning Large Language Models with Sequential Instructions
Despite the success of existing instruction-tuned models, we find that they usually struggle to respond to queries with multiple instructions. This impairs their performance in complex problems whose solution consists of multiple intermediate tasks. Thus, we contend that part of the fine-tuning data mixture should be sequential--containing a chain of interrelated tasks. We first approach sequential instruction tuning from a task-driven perspective, manually creating interpretable intermediate tasks for multilingual and visual question answering: namely "translate then predict" and "caption then answer". Next, we automate this process by turning instructions in existing datasets (e.g., Alpaca and FlanCoT) into diverse and complex sequential instructions, making our method general-purpose. Models that underwent our sequential instruction tuning show improved results in coding, maths, and open-ended generation. Moreover, we put forward a new benchmark named SeqEval to evaluate a model's ability to follow all the instructions in a sequence, which further corroborates the benefits of our fine-tuning method. We hope that our endeavours will open new research avenues on instruction tuning for complex tasks.
comment: 21pages, 8 figures
♻ ☆ STimage-1K4M: A histopathology image-gene expression dataset for spatial transcriptomics
Recent advances in multi-modal algorithms have driven and been driven by the increasing availability of large image-text datasets, leading to significant strides in various fields, including computational pathology. However, in most existing medical image-text datasets, the text typically provides high-level summaries that may not sufficiently describe sub-tile regions within a large pathology image. For example, an image might cover an extensive tissue area containing cancerous and healthy regions, but the accompanying text might only specify that this image is a cancer slide, lacking the nuanced details needed for in-depth analysis. In this study, we introduce STimage-1K4M, a novel dataset designed to bridge this gap by providing genomic features for sub-tile images. STimage-1K4M contains 1,149 images derived from spatial transcriptomics data, which captures gene expression information at the level of individual spatial spots within a pathology image. Specifically, each image in the dataset is broken down into smaller sub-image tiles, with each tile paired with 15,000-30,000 dimensional gene expressions. With 4,293,195 pairs of sub-tile images and gene expressions, STimage-1K4M offers unprecedented granularity, paving the way for a wide range of advanced research in multi-modal data analysis an innovative applications in computational pathology, and beyond.
♻ ☆ The Importance of Directional Feedback for LLM-based Optimizers NeurIPS 2023
We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems in a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space. We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations. Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques.
comment: Accepted and Presented at Foundation Models for Decision Making at NeurIPS 2023 (December 15, 2023). Work completed from June 2023 to September 2023
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.
♻ ☆ Leveraging Collection-Wide Similarities for Unsupervised Document Structure Extraction ACL 2024
Document collections of various domains, e.g., legal, medical, or financial, often share some underlying collection-wide structure, which captures information that can aid both human users and structure-aware models. We propose to identify the typical structure of document within a collection, which requires to capture recurring topics across the collection, while abstracting over arbitrary header paraphrases, and ground each topic to respective document locations. These requirements pose several challenges: headers that mark recurring topics frequently differ in phrasing, certain section headers are unique to individual documents and do not reflect the typical structure, and the order of topics can vary between documents. Subsequently, we develop an unsupervised graph-based method which leverages both inter- and intra-document similarities, to extract the underlying collection-wide structure. Our evaluations on three diverse domains in both English and Hebrew indicate that our method extracts meaningful collection-wide structure, and we hope that future work will leverage our method for multi-document applications and structure-aware models.
comment: Accepted to ACL 2024 findings
♻ ☆ Translation and Fusion Improves Zero-shot Cross-lingual Information Extraction
Large language models (LLMs) combined with instruction tuning have shown significant progress in information extraction (IE) tasks, exhibiting strong generalization capabilities to unseen datasets by following annotation guidelines. However, their applicability to low-resource languages remains limited due to lack of both labeled data for fine-tuning, and unlabeled text for pre-training. In this paper, we propose TransFusion, a framework in which models are fine-tuned to use English translations of low-resource language data, enabling more precise predictions through annotation fusion. Based on TransFusion, we introduce GoLLIE-TF, a cross-lingual instruction-tuned LLM for IE tasks, designed to close the performance gap between high and low-resource languages. Our experiments across twelve multilingual IE datasets spanning 50 languages demonstrate that GoLLIE-TF achieves better zero-shot cross-lingual transfer over the base model. In addition, we show that TransFusion significantly improves low-resource language named entity recognition when applied to proprietary models such as GPT-4 (+5 F1) with a prompting approach, or fine-tuning different language models including decoder-only (+14 F1) and encoder-only (+13 F1) architectures.
♻ ☆ All Languages Matter: On the Multilingual Safety of Large Language Models ACL 2024
Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.
comment: Accepted by ACL 2024 Findings. The first multilingual safety benchmark for large language models
♻ ☆ Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models
The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
comment: 12 pages, 9 figures
♻ ☆ Are you still on track!? Catching LLM Task Drift with Activations
Large Language Models (LLMs) are routinely used in retrieval-augmented applications to orchestrate tasks and process inputs from users and other sources. These inputs, even in a single LLM interaction, can come from a variety of sources, of varying trustworthiness and provenance. This opens the door to prompt injection attacks, where the LLM receives and acts upon instructions from supposedly data-only sources, thus deviating from the user's original instructions. We define this as task drift, and we propose to catch it by scanning and analyzing the LLM's activations. We compare the LLM's activations before and after processing the external input in order to detect whether this input caused instruction drift. We develop two probing methods and find that simply using a linear classifier can detect drift with near perfect ROC AUC on an out-of-distribution test set. We show that this approach generalizes surprisingly well to unseen task domains, such as prompt injections, jailbreaks, and malicious instructions, without being trained on any of these attacks. Our setup does not require any modification of the LLM (e.g., fine-tuning) or any text generation, thus maximizing deployability and cost efficiency and avoiding reliance on unreliable model output. To foster future research on activation-based task inspection, decoding, and interpretability, we will release our large-scale TaskTracker toolkit, comprising a dataset of over 500K instances, representations from 4 SoTA language models, and inspection tools.
♻ ☆ Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
♻ ☆ Application of Natural Language Processing in Financial Risk Detection
This paper explores the application of Natural Language Processing (NLP) in financial risk detection. By constructing an NLP-based financial risk detection model, this study aims to identify and predict potential risks in financial documents and communications. First, the fundamental concepts of NLP and its theoretical foundation, including text mining methods, NLP model design principles, and machine learning algorithms, are introduced. Second, the process of text data preprocessing and feature extraction is described. Finally, the effectiveness and predictive performance of the model are validated through empirical research. The results show that the NLP-based financial risk detection model performs excellently in risk identification and prediction, providing effective risk management tools for financial institutions. This study offers valuable references for the field of financial risk management, utilizing advanced NLP techniques to improve the accuracy and efficiency of financial risk detection.
♻ ☆ DistALANER: Distantly Supervised Active Learning Augmented Named Entity Recognition in the Open Source Software Ecosystem ECML-PKDD 2024
With the AI revolution in place, the trend for building automated systems to support professionals in different domains such as the open source software systems, healthcare systems, banking systems, transportation systems and many others have become increasingly prominent. A crucial requirement in the automation of support tools for such systems is the early identification of named entities, which serves as a foundation for developing specialized functionalities. However, due to the specific nature of each domain, different technical terminologies and specialized languages, expert annotation of available data becomes expensive and challenging. In light of these challenges, this paper proposes a novel named entity recognition (NER) technique specifically tailored for the open-source software systems. Our approach aims to address the scarcity of annotated software data by employing a comprehensive two-step distantly supervised annotation process. This process strategically leverages language heuristics, unique lookup tables, external knowledge sources, and an active learning approach. By harnessing these powerful techniques, we not only enhance model performance but also effectively mitigate the limitations associated with cost and the scarcity of expert annotators. It is noteworthy that our model significantly outperforms the state-of-the-art LLMs by a substantial margin. We also show the effectiveness of NER in the downstream task of relation extraction.
comment: Accepted at ECML-PKDD 2024 (Long Paper)
♻ ☆ CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought ICANN 2024
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent strides within this domain have been significantly propelled by breakthroughs in contrastive learning and prompt engineering. Despite these advancements, the field has reached a plateau, leading some researchers to incorporate external components to enhance the quality of sentence embeddings. Such integration, though beneficial, complicates solutions and inflates demands for computational resources. In response to these challenges, this paper presents CoT-BERT, an innovative method that harnesses the progressive thinking of Chain-of-Thought reasoning to tap into the latent potential of pre-trained models like BERT. Additionally, we develop an advanced contrastive learning loss function and propose a novel template denoising strategy. Rigorous experimentation demonstrates that CoT-BERT surpasses a range of well-established baselines by relying exclusively on the intrinsic strengths of pre-trained models.
comment: Accepted by ICANN 2024
♻ ☆ CDEval: A Benchmark for Measuring the Cultural Dimensions of Large Language Models ACL 2024
As the scaling of Large Language Models (LLMs) has dramatically enhanced their capabilities, there has been a growing focus on the alignment problem to ensure their responsible and ethical use. While existing alignment efforts predominantly concentrate on universal values such as the HHH principle, the aspect of culture, which is inherently pluralistic and diverse, has not received adequate attention. This work introduces a new benchmark, CDEval, aimed at evaluating the cultural dimensions of LLMs. CDEval is constructed by incorporating both GPT-4's automated generation and human verification, covering six cultural dimensions across seven domains. Our comprehensive experiments provide intriguing insights into the culture of mainstream LLMs, highlighting both consistencies and variations across different dimensions and domains. The findings underscore the importance of integrating cultural considerations in LLM development, particularly for applications in diverse cultural settings. Through CDEval, we aim to broaden the horizon of LLM alignment research by including cultural dimensions, thus providing a more holistic framework for the future development and evaluation of LLMs. This benchmark serves as a valuable resource for cultural studies in LLMs, paving the way for more culturally aware and sensitive models.
comment: Accepted by the Cross-Cultural Considerations in NLP Workshop @ ACL 2024
♻ ☆ Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis Generation ACL 2024
Abductive reasoning is the process of making educated guesses to provide explanations for observations. Although many applications require the use of knowledge for explanations, the utilization of abductive reasoning in conjunction with structured knowledge, such as a knowledge graph, remains largely unexplored. To fill this gap, this paper introduces the task of complex logical hypothesis generation, as an initial step towards abductive logical reasoning with KG. In this task, we aim to generate a complex logical hypothesis so that it can explain a set of observations. We find that the supervised trained generative model can generate logical hypotheses that are structurally closer to the reference hypothesis. However, when generalized to unseen observations, this training objective does not guarantee better hypothesis generation. To address this, we introduce the Reinforcement Learning from Knowledge Graph (RLF-KG) method, which minimizes differences between observations and conclusions drawn from generated hypotheses according to the KG. Experiments show that, with RLF-KG's assistance, the generated hypotheses provide better explanations, and achieve state-of-the-art results on three widely used KGs.
comment: Accepted by ACL 2024
♻ ☆ A Small and Fast BERT for Chinese Medical Punctuation Restoration INTERSPEECH 2024
In clinical dictation, utterances after automatic speech recognition (ASR) without explicit punctuation marks may lead to the misunderstanding of dictated reports. To give a precise and understandable clinical report with ASR, automatic punctuation restoration is required. Considering a practical scenario, we propose a fast and light pre-trained model for Chinese medical punctuation restoration based on 'pretraining and fine-tuning' paradigm. In this work, we distill pre-trained models by incorporating supervised contrastive learning and a novel auxiliary pre-training task (Punctuation Mark Prediction) to make it well-suited for punctuation restoration. Our experiments on various distilled models reveal that our model can achieve 95% performance while 10% model size relative to state-of-the-art Chinese RoBERTa.
comment: 5 pages, 2 figures, Accepted by INTERSPEECH 2024
♻ ☆ Is Translation All You Need? A Study on Solving Multilingual Tasks with Large Language Models
Large language models (LLMs) have demonstrated multilingual capabilities; yet, they are mostly English-centric due to the imbalanced training corpora. Existing works leverage this phenomenon to improve their multilingual performances through translation, primarily on natural language processing (NLP) tasks. This work extends the evaluation from NLP tasks to real user queries and from English-centric LLMs to non-English-centric LLMs. While translation into English can help improve the performance of multilingual NLP tasks for English-centric LLMs, it may not be optimal for all scenarios. For culture-related tasks that need deep language understanding, prompting in the native language tends to be more promising as it better captures the nuances of culture and language. Our experiments reveal varied behaviors among different LLMs and tasks in the multilingual context. Therefore, we advocate for more comprehensive multilingual evaluation and more efforts toward developing multilingual LLMs beyond English-centric ones.
comment: 19 pages
♻ ☆ RE-GAINS & EnChAnT: Intelligent Tool Manipulation Systems For Enhanced Query Responses
Large Language Models (LLMs) currently struggle with tool invocation and chaining, as they often hallucinate or miss essential steps in a sequence. We propose RE-GAINS and EnChAnT, two novel frameworks that empower LLMs to tackle complex user queries by making API calls to external tools based on tool descriptions and argument lists. Tools are chained based on the expected output, without receiving the actual results from each individual call. EnChAnT, an open-source solution, leverages an LLM format enforcer, OpenChat 3.5 (an LLM), and ToolBench's API Retriever. RE-GAINS utilizes OpenAI models and embeddings with a specialized prompt based on the $\underline{R}$easoning vi$\underline{a}$ $\underline{P}$lanning $(RAP)$ framework. Both frameworks are low cost (0.01\$ per query). Our key contribution is enabling LLMs for tool invocation and chaining using modifiable, externally described tools.
♻ ☆ Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.
comment: Medical Imaging with Deep Learning (MIDL) 2024 (Oral)
♻ ☆ Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention ICML 2024
We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To enhance accuracy while preserving efficacy, we introduce TransNormerLLM (TNL), a new architecture that is tailored to our lightning attention. We conduct rigorous testing on standard and self-collected datasets with varying model sizes and sequence lengths. TNL is notably more efficient than other language models. In addition, benchmark results indicate that TNL performs on par with state-of-the-art LLMs utilizing conventional transformer structures. The source code is released at github.com/OpenNLPLab/TransnormerLLM.
comment: Accepted by ICML 2024. Yiran Zhong is the corresponding author. Code is released at github.com/OpenNLPLab/TransnormerLLM
♻ ☆ Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting
Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt.
♻ ☆ Concentrate Attention: Towards Domain-Generalizable Prompt Optimization for Language Models
Recent advances in prompt optimization have notably enhanced the performance of pre-trained language models (PLMs) on downstream tasks. However, the potential of optimized prompts on domain generalization has been under-explored. To explore the nature of prompt generalization on unknown domains, we conduct pilot experiments and find that (i) Prompts gaining more attention weight from PLMs' deep layers are more generalizable and (ii) Prompts with more stable attention distributions in PLMs' deep layers are more generalizable. Thus, we offer a fresh objective towards domain-generalizable prompts optimization named "Concentration", which represents the "lookback" attention from the current decoding token to the prompt tokens, to increase the attention strength on prompts and reduce the fluctuation of attention distribution. We adapt this new objective to popular soft prompt and hard prompt optimization methods, respectively. Extensive experiments demonstrate that our idea improves comparison prompt optimization methods by 1.42% for soft prompt generalization and 2.16% for hard prompt generalization in accuracy on the multi-source domain generalization setting, while maintaining satisfying in-domain performance. The promising results validate the effectiveness of our proposed prompt optimization objective and provide key insights into domain-generalizable prompts.
comment: Preprint
♻ ☆ Are Large Language Models a Good Replacement of Taxonomies? VLDB 2024
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
comment: Accepted by VLDB 2024
♻ ☆ White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs
Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups -- such as Black females -- are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.
♻ ☆ ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
comment: 17 pages, preprint
♻ ☆ LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction SIGIR 2024
Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).
comment: SIGIR 2024 industry track
♻ ☆ Dependency-Aware Semi-Structured Sparsity: Declining Roles of Outliers in Pruning GLU-based LLMs
The rapid growth in the scale of Large Language Models (LLMs) has led to significant computational and memory costs, making model compression techniques such as network pruning increasingly crucial for their efficient deployment. Recent LLMs such as LLaMA2 and Mistral have adopted GLU-based MLP architectures. However, current LLM pruning strategies are primarily based on insights from older LLM architectures, necessitating a reevaluation of these strategies to suit the new architectural characteristics. Contrary to traditional beliefs, we find that outliers play a diminished role in the input projections of GLU-based MLPs. Leveraging this new insight, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel pruning method for GLU-based LLMs. DaSS balances the flexibility of unstructured pruning and the structural consistency of dependency-based structured pruning by considering both of weight magnitude and corresponding intermediate activation norms in weight pruning metric. Empirical evaluations on the Mistral, Gemma, and LLaMA2 model families demonstrate the consistent effectiveness of DaSS in the prevailing GLU variants.
♻ ☆ Improving Neural Topic Models with Wasserstein Knowledge Distillation ECIR 2023
Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher's. The distilled model also outperforms several other competitive topic models on topic coherence.
comment: Accepted at ECIR 2023
♻ ☆ mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Modular vision-language models (Vision-LLMs) align pretrained image encoders with (frozen) large language models (LLMs) and post-hoc condition LLMs to `understand' the image input. With the abundance of readily available high-quality English image-text data as well as strong monolingual English LLMs, the research focus has been on English-only Vision-LLMs. Multilingual vision-language models are still predominantly obtained via expensive end-to-end pretraining, resulting in comparatively smaller models, trained on limited multilingual image data supplemented with text-only multilingual corpora. We present mBLIP, the first Vision-LLM leveraging multilingual LLMs, which we obtain in a computationally efficient manner on consumer-level hardware. To this end, we \textit{re-align} an image encoder previously tuned to an English LLM to a new, multilingual LLM using only a few million multilingual training examples derived from a mix of vision-and-language tasks, which we obtain by machine-translating high-quality English data to 95 languages. On the IGLUE benchmark and XM3600, mBLIP yields results competitive with state-of-the-art models and it greatly outperforms strong English-only Vision-LLMs like Llava 1.5. We release our model, code, and train data at \url{https://github.com/gregor-ge/mBLIP}.
comment: ALVR Workshop 2024
♻ ☆ Unified Active Retrieval for Retrieval Augmented Generation
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
♻ ☆ What Will My Model Forget? Forecasting Forgotten Examples in Language Model Refinement ICML 2024
Language models deployed in the wild make errors. However, simply updating the model with the corrected error instances causes catastrophic forgetting -- the updated model makes errors on instances learned during the instruction tuning or upstream training phase. Randomly replaying upstream data yields unsatisfactory performance and often comes with high variance and poor controllability. To this end, we try to forecast upstream examples that will be forgotten due to a model update for improved controllability of the replay process and interpretability. We train forecasting models given a collection of online learned examples and corresponding forgotten upstream pre-training examples. We propose a partially interpretable forecasting model based on the observation that changes in pre-softmax logit scores of pretraining examples resemble that of online learned examples, which performs decently on BART but fails on T5 models. We further show a black-box classifier based on inner products of example representations achieves better forecasting performance over a series of setups. Finally, we show that we reduce forgetting of upstream pretraining examples by replaying examples that are forecasted to be forgotten, demonstrating the practical utility of forecasting example forgetting.
comment: To appear at ICML 2024 (Spotlight)
♻ ☆ Exploring ChatGPT's Capabilities on Vulnerability Management USENIX Security 2024
Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works show that ChatGPT has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using ChatGPT to comprehend code syntax and static behaviors. However, it is unclear whether ChatGPT can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments. In this paper, we explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. For each task, we compare ChatGPT against SOTA approaches, investigate the impact of different prompts, and explore the difficulties. The results suggest promising potential in leveraging ChatGPT to assist vulnerability management. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Furthermore, our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions. For instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. By contrast, leveraging ChatGPT in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. Besides, ChatGPT may misunderstand and misuse the information in the prompt. Consequently, effectively guiding ChatGPT to focus on helpful information rather than the irrelevant content is still an open problem.
comment: Accepted by USENIX Security 2024
♻ ☆ SPA: Towards A Computational Friendly Cloud-Base and On-Devices Collaboration Seq2seq Personalized Generation SP
Large language models(LLMs) have shown its outperforming ability on various tasks and question answering. However, LLMs require substantial memory storage on low-resource devices. More critically, the computational speed on these devices is also severely limited. In this paper, we propose SPA(Side Plugin Adaption), a lightweight architecture for fast on-devices inference on the constraints of strict on-devices computation and memory constraints. Compared with other on-devices seq2seq generation, SPA could make a fast and stable inference on low-resource constraints, allowing it to obtain cost effiency. Our method establish an interaction between a pretrained LLMs on-cloud and additive parameters on-devices, which could provide the knowledge on both pretrained LLMs and featured personal feature. Further more, SPA provides a framework to keep feature-base parameters on low computational devices while leave the parameters containing general information on the high computational devices.
comment: 15 pages, second version of SPA(Side Plugin Adaption)
♻ ☆ MemDPT: Differential Privacy for Memory Efficient Language Models
Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.
comment: 12 pages first version
♻ ☆ OR-Bench: An Over-Refusal Benchmark for Large Language Models
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models.
comment: version 2, 10 pages main, 22 pages total
♻ ☆ Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment
Despite the general capabilities of Large Language Models (LLM), these models still request fine-tuning or adaptation with customized data when meeting specific business demands. However, this process inevitably introduces new threats, particularly against the Fine-tuning based Jailbreak Attack (FJAttack) under the setting of Language-Model-as-a-Service (LMaaS), where the model's safety has been significantly compromised by fine-tuning users' uploaded examples contain just a few harmful examples. Though potential defenses have been proposed that the service providers can integrate safety examples into the fine-tuning dataset to reduce safety issues, such approaches require incorporating a substantial amount of data, making it inefficient. To effectively defend against the FJAttack with limited safety examples under LMaaS, we propose the Backdoor Enhanced Safety Alignment method inspired by an analogy with the concept of backdoor attacks. In particular, service providers will construct prefixed safety examples with a secret prompt, acting as a "backdoor trigger". By integrating prefixed safety examples into the fine-tuning dataset, the subsequent fine-tuning process effectively acts as the "backdoor attack", establishing a strong correlation between the secret prompt and safety generations. Consequently, safe responses are ensured once service providers prepend this secret prompt ahead of any user input during inference. Our comprehensive experiments demonstrate that through the Backdoor Enhanced Safety Alignment with adding as few as 11 prefixed safety examples, the maliciously fine-tuned LLMs will achieve similar safety performance as the original aligned models without harming the benign performance. Furthermore, we also present the effectiveness of our method in a more practical setting where the fine-tuning data consists of both FJAttack examples and the fine-tuning task data.
♻ ☆ Explainable Fake News Detection With Large Language Model via Defense Among Competing Wisdom WWW'2024
Most fake news detection methods learn latent feature representations based on neural networks, which makes them black boxes to classify a piece of news without giving any justification. Existing explainable systems generate veracity justifications from investigative journalism, which suffer from debunking delayed and low efficiency. Recent studies simply assume that the justification is equivalent to the majority opinions expressed in the wisdom of crowds. However, the opinions typically contain some inaccurate or biased information since the wisdom of crowds is uncensored. To detect fake news from a sea of diverse, crowded and even competing narratives, in this paper, we propose a novel defense-based explainable fake news detection framework. Specifically, we first propose an evidence extraction module to split the wisdom of crowds into two competing parties and respectively detect salient evidences. To gain concise insights from evidences, we then design a prompt-based module that utilizes a large language model to generate justifications by inferring reasons towards two possible veracities. Finally, we propose a defense-based inference module to determine veracity via modeling the defense among these justifications. Extensive experiments conducted on two real-world benchmarks demonstrate that our proposed method outperforms state-of-the-art baselines in terms of fake news detection and provides high-quality justifications.
comment: 12 pages, WWW'2024
♻ ☆ BlockPruner: Fine-grained Pruning for Large Language Models
With the rapid growth in the size and complexity of large language models (LLMs), the costs associated with their training and inference have escalated significantly. Research indicates that certain layers in LLMs harbor substantial redundancy, and pruning these layers has minimal impact on the overall performance. While various layer pruning methods have been developed based on this insight, they generally overlook the finer-grained redundancies within the layers themselves. In this paper, we delve deeper into the architecture of LLMs and demonstrate that finer-grained pruning can be achieved by targeting redundancies in multi-head attention (MHA) and multi-layer perceptron (MLP) blocks. We propose a novel, training-free structured pruning approach called BlockPruner. Unlike existing layer pruning methods, BlockPruner segments each Transformer layer into MHA and MLP blocks. It then assesses the importance of these blocks using perplexity measures and applies a heuristic search for iterative pruning. We applied BlockPruner to LLMs of various sizes and architectures and validated its performance across a wide range of downstream tasks. Experimental results show that BlockPruner achieves more granular and effective pruning compared to state-of-the-art baselines.
♻ ☆ OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining KDD'24
With the rapid proliferation of scientific literature, versatile academic knowledge services increasingly rely on comprehensive academic graph mining. Despite the availability of public academic graphs, benchmarks, and datasets, these resources often fall short in multi-aspect and fine-grained annotations, are constrained to specific task types and domains, or lack underlying real academic graphs. In this paper, we present OAG-Bench, a comprehensive, multi-aspect, and fine-grained human-curated benchmark based on the Open Academic Graph (OAG). OAG-Bench covers 10 tasks, 20 datasets, 70+ baselines, and 120+ experimental results to date. We propose new data annotation strategies for certain tasks and offer a suite of data pre-processing codes, algorithm implementations, and standardized evaluation protocols to facilitate academic graph mining. Extensive experiments reveal that even advanced algorithms like large language models (LLMs) encounter difficulties in addressing key challenges in certain tasks, such as paper source tracing and scholar profiling. We also introduce the Open Academic Graph Challenge (OAG-Challenge) to encourage community input and sharing. We envisage that OAG-Bench can serve as a common ground for the community to evaluate and compare algorithms in academic graph mining, thereby accelerating algorithm development and advancement in this field. OAG-Bench is accessible at https://www.aminer.cn/data/.
comment: KDD'24, 9 pages, 5 appendix pages
♻ ☆ Identifying Factual Inconsistencies in Summaries: Grounding Model Inference via Task Taxonomy
Factual inconsistencies pose a significant hurdle for the faithful summarization by generative models. While a major direction to enhance inconsistency detection is to derive stronger Natural Language Inference (NLI) models, we propose an orthogonal aspect that underscores the importance of incorporating task-specific taxonomy into the inference. To this end, we consolidate key error types of inconsistent facts in summaries, and incorporate them to facilitate both the zero-shot and supervised paradigms of LLMs. Extensive experiments on ten datasets of five distinct domains suggest that, zero-shot LLM inference could benefit from the explicit solution space depicted by the error type taxonomy, and achieves state-of-the-art performance overall, surpassing specialized non-LLM baselines, as well as recent LLM baselines. We further distill models that fuse the taxonomy into parameters through our designed prompt completions and supervised training strategies, efficiently substituting state-of-the-art zero-shot inference with much larger LLMs.
♻ ☆ Quest: Query-centric Data Synthesis Approach for Long-context Scaling of Large Language Model
Large language models, initially pre-trained with a limited context length, can better handle longer texts by continuing training on a corpus with extended contexts. However, obtaining effective long-context data is challenging due to the scarcity and uneven distribution of long documents across different domains. To address this issue, we propose a Query-centric data synthesis method, abbreviated as Quest. Quest is an interpretable method based on the observation that documents retrieved by similar queries are relevant but low-redundant, thus well-suited for synthesizing long-context data. The method is also scalable and capable of constructing large amounts of long-context data. Using Quest, we synthesize a long-context dataset up to 128k context length, significantly outperforming other data synthesis methods on multiple long-context benchmark datasets. In addition, we further verify that the Quest method is predictable through scaling law experiments, making it a reliable solution for advancing long-context models.
♻ ☆ Exploring Multilingual Unseen Speaker Emotion Recognition: Leveraging Co-Attention Cues in Multitask Learning INTERSPEECH 2024
Advent of modern deep learning techniques has given rise to advancements in the field of Speech Emotion Recognition (SER). However, most systems prevalent in the field fail to generalize to speakers not seen during training. This study focuses on handling challenges of multilingual SER, specifically on unseen speakers. We introduce CAMuLeNet, a novel architecture leveraging co-attention based fusion and multitask learning to address this problem. Additionally, we benchmark pretrained encoders of Whisper, HuBERT, Wav2Vec2.0, and WavLM using 10-fold leave-speaker-out cross-validation on five existing multilingual benchmark datasets: IEMOCAP, RAVDESS, CREMA-D, EmoDB and CaFE and, release a novel dataset for SER on the Hindi language (BhavVani). CAMuLeNet shows an average improvement of approximately 8% over all benchmarks on unseen speakers determined by our cross-validation strategy.
comment: 5 pages, Accepted to INTERSPEECH 2024. The first two authors contributed equally
♻ ☆ Investigating Subtler Biases in LLMs: Ageism, Beauty, Institutional, and Nationality Bias in Generative Models ACL 2024
LLMs are increasingly powerful and widely used to assist users in a variety of tasks. This use risks the introduction of LLM biases to consequential decisions such as job hiring, human performance evaluation, and criminal sentencing. Bias in NLP systems along the lines of gender and ethnicity has been widely studied, especially for specific stereotypes (e.g., Asians are good at math). In this paper, we investigate bias along less-studied but still consequential, dimensions, such as age and beauty, measuring subtler correlated decisions that LLMs make between social groups and unrelated positive and negative attributes. We ask whether LLMs hold wide-reaching biases of positive or negative sentiment for specific social groups similar to the "what is beautiful is good" bias found in people in experimental psychology. We introduce a template-generated dataset of sentence completion tasks that asks the model to select the most appropriate attribute to complete an evaluative statement about a person described as a member of a specific social group. We also reverse the completion task to select the social group based on an attribute. We report the correlations that we find for 4 cutting-edge LLMs. This dataset can be used as a benchmark to evaluate progress in more generalized biases and the templating technique can be used to expand the benchmark with minimal additional human annotation.
comment: Camera-ready version for Findings of ACL 2024
♻ ☆ Decoding the AI Pen: Techniques and Challenges in Detecting AI-Generated Text
Large Language Models (LLMs) have revolutionized the field of Natural Language Generation (NLG) by demonstrating an impressive ability to generate human-like text. However, their widespread usage introduces challenges that necessitate thoughtful examination, ethical scrutiny, and responsible practices. In this study, we delve into these challenges, explore existing strategies for mitigating them, with a particular emphasis on identifying AI-generated text as the ultimate solution. Additionally, we assess the feasibility of detection from a theoretical perspective and propose novel research directions to address the current limitations in this domain.
♻ ☆ How are Prompts Different in Terms of Sensitivity? NAACL 2024
In-context learning (ICL) has become one of the most popular learning paradigms. While there is a growing body of literature focusing on prompt engineering, there is a lack of systematic analysis comparing the effects of prompts across different models and tasks. To address this gap, we present a comprehensive prompt analysis based on the sensitivity of a function. Our analysis reveals that sensitivity is an unsupervised proxy for model performance, as it exhibits a strong negative correlation with accuracy. We use gradient-based saliency scores to empirically demonstrate how different prompts affect the relevance of input tokens to the output, resulting in different levels of sensitivity. Furthermore, we introduce sensitivity-aware decoding which incorporates sensitivity estimation as a penalty term in the standard greedy decoding. We show that this approach is particularly helpful when information in the input is scarce. Our work provides a fresh perspective on the analysis of prompts, and contributes to a better understanding of the mechanism of ICL.
comment: NAACL 2024 Main
♻ ☆ Towards a Client-Centered Assessment of LLM Therapists by Client Simulation
Although there is a growing belief that LLMs can be used as therapists, exploring LLMs' capabilities and inefficacy, particularly from the client's perspective, is limited. This work focuses on a client-centered assessment of LLM therapists with the involvement of simulated clients, a standard approach in clinical medical education. However, there are two challenges when applying the approach to assess LLM therapists at scale. Ethically, asking humans to frequently mimic clients and exposing them to potentially harmful LLM outputs can be risky and unsafe. Technically, it can be difficult to consistently compare the performances of different LLM therapists interacting with the same client. To this end, we adopt LLMs to simulate clients and propose ClientCAST, a client-centered approach to assessing LLM therapists by client simulation. Specifically, the simulated client is utilized to interact with LLM therapists and complete questionnaires related to the interaction. Based on the questionnaire results, we assess LLM therapists from three client-centered aspects: session outcome, therapeutic alliance, and self-reported feelings. We conduct experiments to examine the reliability of ClientCAST and use it to evaluate LLMs therapists implemented by Claude-3, GPT-3.5, LLaMA3-70B, and Mixtral 8*7B. Codes are released at https://github.com/wangjs9/ClientCAST.
♻ ☆ Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems
Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. Extending our study to production RAG models GPTs, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.
♻ ☆ Knowledge of Knowledge: Exploring Known-Unknowns Uncertainty with Large Language Models
This paper investigates the capabilities of Large Language Models (LLMs) in the context of understanding their knowledge and uncertainty over questions. Specifically, we focus on addressing known-unknown questions, characterized by high uncertainty due to the absence of definitive answers. To facilitate our study, we collect a new dataset with Known-Unknown Questions (KUQ) and establish a categorization framework to clarify the origins of uncertainty in such queries. Subsequently, we examine the performance of open-source LLMs, fine-tuned using this dataset, in distinguishing between known and unknown queries within open-ended question-answering scenarios. The fine-tuned models demonstrated a significant improvement, achieving a considerable increase in F1-score relative to their pre-fine-tuning state. Through a comprehensive analysis, we reveal insights into the models' improved uncertainty articulation and their consequent efficacy in multi-agent debates. These findings help us understand how LLMs can be trained to identify and express uncertainty, improving our knowledge of how they understand and express complex or unclear information.
♻ ☆ What Are Large Language Models Mapping to in the Brain? A Case Against Over-Reliance on Brain Scores
Given the remarkable capabilities of large language models (LLMs), there has been a growing interest in evaluating their similarity to the human brain. One approach towards quantifying this similarity is by measuring how well a model predicts neural signals, also called "brain score". Internal representations from LLMs achieve state-of-the-art brain scores, leading to speculation that they share computational principles with human language processing. This inference is only valid if the subset of neural activity predicted by LLMs reflects core elements of language processing. Here, we question this assumption by analyzing three neural datasets used in an impactful study on LLM-to-brain mappings, with a particular focus on an fMRI dataset where participants read short passages. We first find that when using shuffled train-test splits, as done in previous studies with these datasets, a trivial feature that encodes temporal autocorrelation not only outperforms LLMs but also accounts for the majority of neural variance that LLMs explain. We therefore use contiguous splits moving forward. Second, we explain the surprisingly high brain scores of untrained LLMs by showing they do not account for additional neural variance beyond two simple features: sentence length and sentence position. This undermines evidence used to claim that the transformer architecture biases computations to be more brain-like. Third, we find that brain scores of trained LLMs on this dataset can largely be explained by sentence length, position, and pronoun-dereferenced static word embeddings; a small, additional amount is explained by sense-specific embeddings and contextual representations of sentence structure. We conclude that over-reliance on brain scores can lead to over-interpretations of similarity between LLMs and brains, and emphasize the importance of deconstructing what LLMs are mapping to in neural signals.
comment: 10 pages, 4 figures in the main paper
♻ ☆ LEMMA: Towards LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation
The rise of multimodal misinformation on social platforms poses significant challenges for individuals and societies. Its increased credibility and broader impact compared to textual misinformation make detection complex, requiring robust reasoning across diverse media types and profound knowledge for accurate verification. The emergence of Large Vision Language Model (LVLM) offers a potential solution to this problem. Leveraging their proficiency in processing visual and textual information, LVLM demonstrates promising capabilities in recognizing complex information and exhibiting strong reasoning skills. In this paper, we first investigate the potential of LVLM on multimodal misinformation detection. We find that even though LVLM has a superior performance compared to LLMs, its profound reasoning may present limited power with a lack of evidence. Based on these observations, we propose LEMMA: LVLM-Enhanced Multimodal Misinformation Detection with External Knowledge Augmentation. LEMMA leverages LVLM intuition and reasoning capabilities while augmenting them with external knowledge to enhance the accuracy of misinformation detection. Our method improves the accuracy over the top baseline LVLM by 7% and 13% on Twitter and Fakeddit datasets respectively.
♻ ☆ Understanding Reasoning Ability of Language Models From the Perspective of Reasoning Paths Aggregation ICML 2024
Pre-trained language models (LMs) are able to perform complex reasoning without explicit fine-tuning. To understand how pre-training with a next-token prediction objective contributes to the emergence of such reasoning capability, we propose that we can view an LM as deriving new conclusions by aggregating indirect reasoning paths seen at pre-training time. We found this perspective effective in two important cases of reasoning: logic reasoning with knowledge graphs (KGs) and chain-of-thought (CoT) reasoning. More specifically, we formalize the reasoning paths as random walk paths on the knowledge/reasoning graphs. Analyses of learned LM distributions suggest that a weighted sum of relevant random walk path probabilities is a reasonable way to explain how LMs reason. Experiments and analysis on multiple KG and CoT datasets reveal the effect of training on random walk paths and suggest that augmenting unlabeled random walk reasoning paths can improve real-world multi-step reasoning performance. code: https://github.com/WANGXinyiLinda/LM_random_walk
comment: Accepted to ICML 2024
♻ ☆ AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters ACL
Large language models' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.
comment: 28 pages, 13 figures. Association for Computational Linguistics (ACL) 2024
Machine Learning 152
☆ Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
comment: Under review
☆ Uncovering Latent Memories: Assessing Data Leakage and Memorization Patterns in Large Language Models
The proliferation of large language models has revolutionized natural language processing tasks, yet it raises profound concerns regarding data privacy and security. Language models are trained on extensive corpora including potentially sensitive or proprietary information, and the risk of data leakage -- where the model response reveals pieces of such information -- remains inadequately understood. This study examines susceptibility to data leakage by quantifying the phenomenon of memorization in machine learning models, focusing on the evolution of memorization patterns over training. We investigate how the statistical characteristics of training data influence the memories encoded within the model by evaluating how repetition influences memorization. We reproduce findings that the probability of memorizing a sequence scales logarithmically with the number of times it is present in the data. Furthermore, we find that sequences which are not apparently memorized after the first encounter can be uncovered throughout the course of training even without subsequent encounters. The presence of these latent memorized sequences presents a challenge for data privacy since they may be hidden at the final checkpoint of the model. To this end, we develop a diagnostic test for uncovering these latent memorized sequences by considering their cross entropy loss.
☆ Consistency Models Made Easy
Consistency models (CMs) are an emerging class of generative models that offer faster sampling than traditional diffusion models. CMs enforce that all points along a sampling trajectory are mapped to the same initial point. But this target leads to resource-intensive training: for example, as of 2024, training a SoTA CM on CIFAR-10 takes one week on 8 GPUs. In this work, we propose an alternative scheme for training CMs, vastly improving the efficiency of building such models. Specifically, by expressing CM trajectories via a particular differential equation, we argue that diffusion models can be viewed as a special case of CMs with a specific discretization. We can thus fine-tune a consistency model starting from a pre-trained diffusion model and progressively approximate the full consistency condition to stronger degrees over the training process. Our resulting method, which we term Easy Consistency Tuning (ECT), achieves vastly improved training times while indeed improving upon the quality of previous methods: for example, ECT achieves a 2-step FID of 2.73 on CIFAR10 within 1 hour on a single A100 GPU, matching Consistency Distillation trained of hundreds of GPU hours. Owing to this computational efficiency, we investigate the scaling law of CMs under ECT, showing that they seem to obey classic power law scaling, hinting at their ability to improve efficiency and performance at larger scales. Code (https://github.com/locuslab/ect) is available.
☆ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
☆ Are LLMs Naturally Good at Synthetic Tabular Data Generation?
Large language models (LLMs) have demonstrated their prowess in generating synthetic text and images; however, their potential for generating tabular data -- arguably the most common data type in business and scientific applications -- is largely underexplored. This paper demonstrates that LLMs, used as-is, or after traditional fine-tuning, are severely inadequate as synthetic table generators. Due to the autoregressive nature of LLMs, fine-tuning with random order permutation runs counter to the importance of modeling functional dependencies, and renders LLMs unable to model conditional mixtures of distributions (key to capturing real world constraints). We showcase how LLMs can be made to overcome some of these deficiencies by making them permutation-aware.
☆ MacroHFT: Memory Augmented Context-aware Reinforcement Learning On High Frequency Trading KDD 2024
High-frequency trading (HFT) that executes algorithmic trading in short time scales, has recently occupied the majority of cryptocurrency market. Besides traditional quantitative trading methods, reinforcement learning (RL) has become another appealing approach for HFT due to its terrific ability of handling high-dimensional financial data and solving sophisticated sequential decision-making problems, \emph{e.g.,} hierarchical reinforcement learning (HRL) has shown its promising performance on second-level HFT by training a router to select only one sub-agent from the agent pool to execute the current transaction. However, existing RL methods for HFT still have some defects: 1) standard RL-based trading agents suffer from the overfitting issue, preventing them from making effective policy adjustments based on financial context; 2) due to the rapid changes in market conditions, investment decisions made by an individual agent are usually one-sided and highly biased, which might lead to significant loss in extreme markets. To tackle these problems, we propose a novel Memory Augmented Context-aware Reinforcement learning method On HFT, \emph{a.k.a.} MacroHFT, which consists of two training phases: 1) we first train multiple types of sub-agents with the market data decomposed according to various financial indicators, specifically market trend and volatility, where each agent owns a conditional adapter to adjust its trading policy according to market conditions; 2) then we train a hyper-agent to mix the decisions from these sub-agents and output a consistently profitable meta-policy to handle rapid market fluctuations, equipped with a memory mechanism to enhance the capability of decision-making. Extensive experiments on various cryptocurrency markets demonstrate that MacroHFT can achieve state-of-the-art performance on minute-level trading tasks.
comment: Accepted to KDD 2024
☆ RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold
Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data $\textbf{doubles}$ the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by $\mathbf{8 \times}$. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.
☆ A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data
Kolmogorov-Arnold Networks (KANs) have very recently been introduced into the world of machine learning, quickly capturing the attention of the entire community. However, KANs have mostly been tested for approximating complex functions or processing synthetic data, while a test on real-world tabular datasets is currently lacking. In this paper, we present a benchmarking study comparing KANs and Multi-Layer Perceptrons (MLPs) on tabular datasets. The study evaluates task performance and training times. From the results obtained on the various datasets, KANs demonstrate superior or comparable accuracy and F1 scores, excelling particularly in datasets with numerous instances, suggesting robust handling of complex data. We also highlight that this performance improvement of KANs comes with a higher computational cost when compared to MLPs of comparable sizes.
☆ DeciMamba: Exploring the Length Extrapolation Potential of Mamba
Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are 25x times longer than the ones seen during training, and does so without utilizing additional computational resources. We will release our code and models.
comment: Link To Official Implementation: https://github.com/assafbk/DeciMamba
☆ Towards evolution of Deep Neural Networks through contrastive Self-Supervised learning
Deep Neural Networks (DNNs) have been successfully applied to a wide range of problems. However, two main limitations are commonly pointed out. The first one is that they require long time to design. The other is that they heavily rely on labelled data, which can sometimes be costly and hard to obtain. In order to address the first problem, neuroevolution has been proved to be a plausible option to automate the design of DNNs. As for the second problem, self-supervised learning has been used to leverage unlabelled data to learn representations. Our goal is to study how neuroevolution can help self-supervised learning to bridge the gap to supervised learning in terms of performance. In this work, we propose a framework that is able to evolve deep neural networks using self-supervised learning. Our results on the CIFAR-10 dataset show that it is possible to evolve adequate neural networks while reducing the reliance on labelled data. Moreover, an analysis to the structure of the evolved networks suggests that the amount of labelled data fed to them has less effect on the structure of networks that learned via self-supervised learning, when compared to individuals that relied on supervised learning.
comment: IEEE World Congress on Computational Intelligence (WCCI) 2024; Keywords: NeuroEvolution, Deep Learning, Evolutionary Machine Learning
☆ Fantastic Copyrighted Beasts and How (Not) to Generate Them
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
☆ PostMark: A Robust Blackbox Watermark for Large Language Models
The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
comment: preprint; 18 pages, 5 figures
☆ On Newton's Method to Unlearn Neural Networks
Machine unlearning facilitates personal data ownership, including the ``right to be forgotten''. The proliferation of applications of \emph{neural networks} (NNs) trained on users' personal data calls for the need to develop algorithms to unlearn an NN. Since retraining is costly, efficiency is often achieved through approximate unlearning which aims to unlearn a trained NN to be close to the retrained one (in distribution). Though the Newton's method has been used by previous works to approximately unlearn linear models, adapting it for unlearning an NN often encounters degenerate Hessians that make computing the Newton's update impossible. In this paper, we will first show that when coupled with naive yet often effective solutions to mitigate the degeneracy issue for unlearning, the Newton's method surprisingly suffers from catastrophic forgetting. To overcome this difficulty, we revise the Newton's method to include a theoretically justified regularizer and propose a cubic-regularized Newton's method for unlearning an NN. The cubic regularizer comes with the benefits of not requiring manual finetuning and affording a natural interpretation. Empirical evaluation on several models and real-world datasets shows that our method is more resilient to catastrophic forgetting and performs better than the baselines, especially in sequential unlearning.
☆ rKAN: Rational Kolmogorov-Arnold Networks
The development of Kolmogorov-Arnold networks (KANs) marks a significant shift from traditional multi-layer perceptrons in deep learning. Initially, KANs employed B-spline curves as their primary basis function, but their inherent complexity posed implementation challenges. Consequently, researchers have explored alternative basis functions such as Wavelets, Polynomials, and Fractional functions. In this research, we explore the use of rational functions as a novel basis function for KANs. We propose two different approaches based on Pade approximation and rational Jacobi functions as trainable basis functions, establishing the rational KAN (rKAN). We then evaluate rKAN's performance in various deep learning and physics-informed tasks to demonstrate its practicality and effectiveness in function approximation.
comment: The implementations are available at https://github.com/alirezaafzalaghaei/rKAN
☆ Valid Error Bars for Neural Weather Models using Conformal Prediction
Neural weather models have shown immense potential as inexpensive and accurate alternatives to physics-based models. However, most models trained to perform weather forecasting do not quantify the uncertainty associated with their forecasts. This limits the trust in the model and the usefulness of the forecasts. In this work we construct and formalise a conformal prediction framework as a post-processing method for estimating this uncertainty. The method is model-agnostic and gives calibrated error bounds for all variables, lead times and spatial locations. No modifications are required to the model and the computational cost is negligible compared to model training. We demonstrate the usefulness of the conformal prediction framework on a limited area neural weather model for the Nordic region. We further explore the advantages of the framework for deterministic and probabilistic models.
☆ Revealing Vision-Language Integration in the Brain with Multimodal Networks ICML 2024
We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.
comment: ICML 2024; 23 pages, 11 figures
☆ On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier
Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
☆ Toward data-driven research: preliminary study to predict surface roughness in material extrusion using previously published data with Machine Learning
Material extrusion is one of the most commonly used approaches within the additive manufacturing processes available. Despite its popularity and related technical advancements, process reliability and quality assurance remain only partially solved. In particular, the surface roughness caused by this process is a key concern. To solve this constraint, experimental plans have been exploited to optimize surface roughness in recent years. However, the latter empirical trial and error process is extremely time- and resource-consuming. Thus, this study aims to avoid using large experimental programs to optimize surface roughness in material extrusion. Methodology. This research provides an in-depth analysis of the effect of several printing parameters: layer height, printing temperature, printing speed and wall thickness. The proposed data-driven predictive modeling approach takes advantage of Machine Learning models to automatically predict surface roughness based on the data gathered from the literature and the experimental data generated for testing. Findings. Using 10-fold cross-validation of data gathered from the literature, the proposed Machine Learning solution attains a 0.93 correlation with a mean absolute percentage error of 13 %. When testing with our own data, the correlation diminishes to 0.79 and the mean absolute percentage error reduces to 8 %. Thus, the solution for predicting surface roughness in extrusion-based printing offers competitive results regarding the variability of the analyzed factors. Originality. As available manufacturing data continue to increase on a daily basis, the ability to learn from these large volumes of data is critical in future manufacturing and science. Specifically, the power of Machine Learning helps model surface roughness with limited experimental tests.
☆ Data-Centric AI in the Age of Large Language Models
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs). We start by making the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs, and yet it receives disproportionally low attention from the research community. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization. In each scenario, we underscore the importance of data, highlight promising research directions, and articulate the potential impacts on the research community and, where applicable, the society as a whole. For instance, we advocate for a suite of data-centric benchmarks tailored to the scale and complexity of data for LLMs. These benchmarks can be used to develop new data curation methods and document research efforts and results, which can help promote openness and transparency in AI and LLM research.
comment: Preprint
☆ Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random Walks
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
☆ Centimeter Positioning Accuracy using AI/ML for 6G Applications ICML
This research looks at using AI/ML to achieve centimeter-level user positioning in 6G applications such as the Industrial Internet of Things (IIoT). Initial results show that our AI/ML-based method can estimate user positions with an accuracy of 17 cm in an indoor factory environment. In this proposal, we highlight our approaches and future directions.
comment: 2 Pages, 2 Figures, ICMLCN Conference, Stockholm, Sweden
☆ Capturing Temporal Components for Time Series Classification
Analyzing sequential data is crucial in many domains, particularly due to the abundance of data collected from the Internet of Things paradigm. Time series classification, the task of categorizing sequential data, has gained prominence, with machine learning approaches demonstrating remarkable performance on public benchmark datasets. However, progress has primarily been in designing architectures for learning representations from raw data at fixed (or ideal) time scales, which can fail to generalize to longer sequences. This work introduces a \textit{compositional representation learning} approach trained on statistically coherent components extracted from sequential data. Based on a multi-scale change space, an unsupervised approach is proposed to segment the sequential data into chunks with similar statistical properties. A sequence-based encoder model is trained in a multi-task setting to learn compositional representations from these temporal components for time series classification. We demonstrate its effectiveness through extensive experiments on publicly available time series classification benchmarks. Evaluating the coherence of segmented components shows its competitive performance on the unsupervised segmentation task.
comment: 15 pages, 2 figures, 4 tables
☆ Maintenance Required: Updating and Extending Bootstrapped Human Activity Recognition Systems for Smart Homes
Developing human activity recognition (HAR) systems for smart homes is not straightforward due to varied layouts of the homes and their personalized settings, as well as idiosyncratic behaviors of residents. As such, off-the-shelf HAR systems are effective in limited capacity for an individual home, and HAR systems often need to be derived "from scratch", which comes with substantial efforts and often is burdensome to the resident. Previous work has successfully targeted the initial phase. At the end of this initial phase, we identify seed points. We build on bootstrapped HAR systems and introduce an effective updating and extension procedure for continuous improvement of HAR systems with the aim of keeping up with ever changing life circumstances. Our method makes use of the seed points identified at the end of the initial bootstrapping phase. A contrastive learning framework is trained using these seed points and labels obtained for the same. This model is then used to improve the segmentation accuracy of the identified prominent activities. Improvements in the activity recognition system through this procedure help model the majority of the routine activities in the smart home. We demonstrate the effectiveness of our procedure through experiments on the CASAS datasets that show the practical value of our approach.
comment: 12 pages, 5 figures, accepted at The 6th International Conference on Activity and Behavior Computing, under print at IEEE Explore
☆ Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's Disease
Omics data analysis is crucial for studying complex diseases, but its high dimensionality and heterogeneity challenge classical statistical and machine learning methods. Graph neural networks have emerged as promising alternatives, yet the optimal strategies for their design and optimization in real-world biomedical challenges remain unclear. This study evaluates various graph representation learning models for case-control classification using high-throughput biological data from Parkinson's disease and control samples. We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions (PPI, MMI). Graph Convolutional Network (GCNs), Chebyshev spectral graph convolution (ChebyNet), and Graph Attention Network (GAT), are evaluated alongside advanced architectures like graph transformers, the graph U-net, and simpler models like multilayer perceptron (MLP). These models are systematically applied to transcriptomics and metabolomics data independently. Our comparative analysis highlights the benefits and limitations of various architectures in extracting patterns from omics data, paving the way for more accurate and interpretable models in biomedical research.
comment: Submitted to Machine Learning in Computational Biology 2024 as an extended abstract, 2 pages + 1 appendix
☆ CollaFuse: Collaborative Diffusion Models
In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.
comment: 13 pages, 7 figures
☆ Transferable Boltzmann Generators
The generation of equilibrium samples of molecular systems has been a long-standing problem in statistical physics. Boltzmann Generators are a generative machine learning method that addresses this issue by learning a transformation via a normalizing flow from a simple prior distribution to the target Boltzmann distribution of interest. Recently, flow matching has been employed to train Boltzmann Generators for small molecular systems in Cartesian coordinates. We extend this work and propose a first framework for Boltzmann Generators that are transferable across chemical space, such that they predict zero-shot Boltzmann distributions for test molecules without being retrained for these systems. These transferable Boltzmann Generators allow approximate sampling from the target distribution of unseen systems, as well as efficient reweighting to the target Boltzmann distribution. The transferability of the proposed framework is evaluated on dipeptides, where we show that it generalizes efficiently to unseen systems. Furthermore, we demonstrate that our proposed architecture enhances the efficiency of Boltzmann Generators trained on single molecular systems.
☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
☆ CascadeServe: Unlocking Model Cascades for Inference Serving
Machine learning (ML) models are increasingly deployed to production, calling for efficient inference serving systems. Efficient inference serving is complicated by two challenges: (i) ML models incur high computational costs, and (ii) the request arrival rates of practical applications have frequent, high, and sudden variations which make it hard to correctly provision hardware. Model cascades are positioned to tackle both of these challenges, as they (i) save work while maintaining accuracy, and (ii) expose a high-resolution trade-off between work and accuracy, allowing for fine-grained adjustments to request arrival rates. Despite their potential, model cascades haven't been used inside an online serving system. This comes with its own set of challenges, including workload adaption, model replication onto hardware, inference scheduling, request batching, and more. In this work, we propose CascadeServe, which automates and optimizes end-to-end inference serving with cascades. CascadeServe operates in an offline and online phase. In the offline phase, the system pre-computes a gear plan that specifies how to serve inferences online. In the online phase, the gear plan allows the system to serve inferences while making near-optimal adaptations to the query load at negligible decision overheads. We find that CascadeServe saves 2-3x in cost across a wide spectrum of the latency-accuracy space when compared to state-of-the-art baselines on different workloads.
comment: 17 pages, 13 figures
☆ Communication-efficient Vertical Federated Learning via Compressed Error Feedback
Communication overhead is a known bottleneck in federated learning (FL). To address this, lossy compression is commonly used on the information communicated between the server and clients during training. In horizontal FL, where each client holds a subset of the samples, such communication-compressed training methods have recently seen significant progress. However, in their vertical FL counterparts, where each client holds a subset of the features, our understanding remains limited. To address this, we propose an error feedback compressed vertical federated learning (EFVFL) method to train split neural networks. In contrast with previous communication-compressed methods for vertical FL, EFVFL does not require a vanishing compression error for the gradient norm to converge to zero for smooth nonconvex problems. By leveraging error feedback, our method can achieve a $\mathcal{O}(1/T)$ convergence rate in the full-batch case, improving over the state-of-the-art $\mathcal{O}(1/\sqrt{T})$ rate under $\mathcal{O}(1/\sqrt{T})$ compression error, and matching the rate of uncompressed methods. Further, when the objective function satisfies the Polyak-{\L}ojasiewicz inequality, our method converges linearly. In addition to improving convergence rates, our method also supports the use of private labels. Numerical experiments show that EFVFL significantly improves over the prior art, confirming our theoretical results.
☆ Vectorized Representation Dreamer (VRD): Dreaming-Assisted Multi-Agent Motion-Forecasting
For an autonomous vehicle to plan a path in its environment, it must be able to accurately forecast the trajectory of all dynamic objects in its proximity. While many traditional methods encode observations in the scene to solve this problem, there are few approaches that consider the effect of the ego vehicle's behavior on the future state of the world. In this paper, we introduce VRD, a vectorized world model-inspired approach to the multi-agent motion forecasting problem. Our method combines a traditional open-loop training regime with a novel dreamed closed-loop training pipeline that leverages a kinematic reconstruction task to imagine the trajectory of all agents, conditioned on the action of the ego vehicle. Quantitative and qualitative experiments are conducted on the Argoverse 2 multi-world forecasting evaluation dataset and the intersection drone (inD) dataset to demonstrate the performance of our proposed model. Our model achieves state-of-the-art performance on the single prediction miss rate metric on the Argoverse 2 dataset and performs on par with the leading models for the single prediction displacement metrics.
comment: Accepted for publication in IEEE Intelligent Vehicle Symposium (IV 2024)
☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
☆ Predicting Probabilities of Error to Combine Quantization and Early Exiting: QuEE
Machine learning models can solve complex tasks but often require significant computational resources during inference. This has led to the development of various post-training computation reduction methods that tackle this issue in different ways, such as quantization which reduces the precision of weights and arithmetic operations, and dynamic networks which adapt computation to the sample at hand. In this work, we propose a more general dynamic network that can combine both quantization and early exit dynamic network: QuEE. Our algorithm can be seen as a form of soft early exiting or input-dependent compression. Rather than a binary decision between exiting or continuing, we introduce the possibility of continuing with reduced computation. This complicates the traditionally considered early exiting problem, which we solve through a principled formulation. The crucial factor of our approach is accurate prediction of the potential accuracy improvement achievable through further computation. We demonstrate the effectiveness of our method through empirical evaluation, as well as exploring the conditions for its success on 4 classification datasets.
☆ Fair Streaming Feature Selection
Streaming feature selection techniques have become essential in processing real-time data streams, as they facilitate the identification of the most relevant attributes from continuously updating information. Despite their performance, current algorithms to streaming feature selection frequently fall short in managing biases and avoiding discrimination that could be perpetuated by sensitive attributes, potentially leading to unfair outcomes in the resulting models. To address this issue, we propose FairSFS, a novel algorithm for Fair Streaming Feature Selection, to uphold fairness in the feature selection process without compromising the ability to handle data in an online manner. FairSFS adapts to incoming feature vectors by dynamically adjusting the feature set and discerns the correlations between classification attributes and sensitive attributes from this revised set, thereby forestalling the propagation of sensitive data. Empirical evaluations show that FairSFS not only maintains accuracy that is on par with leading streaming feature selection methods and existing fair feature techniques but also significantly improves fairness metrics.
comment: 30 pages, 10 figures
☆ WEATHER-5K: A Large-scale Global Station Weather Dataset Towards Comprehensive Time-series Forecasting Benchmark
Global Station Weather Forecasting (GSWF) is crucial for various sectors, including aviation, agriculture, energy, and disaster preparedness. Recent advancements in deep learning have significantly improved the accuracy of weather predictions by optimizing models based on public meteorological data. However, existing public datasets for GSWF optimization and benchmarking still suffer from significant limitations, such as small sizes, limited temporal coverage, and a lack of comprehensive variables. These shortcomings prevent them from effectively reflecting the benchmarks of current forecasting methods and fail to support the real needs of operational weather forecasting. To address these challenges, we present the WEATHER-5K dataset. This dataset comprises a comprehensive collection of data from 5,672 weather stations worldwide, spanning a 10-year period with one-hour intervals. It includes multiple crucial weather elements, providing a more reliable and interpretable resource for forecasting. Furthermore, our WEATHER-5K dataset can serve as a benchmark for comprehensively evaluating existing well-known forecasting models, extending beyond GSWF methods to support future time-series research challenges and opportunities. The dataset and benchmark implementation are publicly available at: https://github.com/taohan10200/WEATHER-5K.
comment: 26 pages,13 figures
☆ Jailbreaking as a Reward Misspecification Problem
The widespread adoption of large language models (LLMs) has raised concerns about their safety and reliability, particularly regarding their vulnerability to adversarial attacks. In this paper, we propose a novel perspective that attributes this vulnerability to reward misspecification during the alignment process. We introduce a metric ReGap to quantify the extent of reward misspecification and demonstrate its effectiveness and robustness in detecting harmful backdoor prompts. Building upon these insights, we present ReMiss, a system for automated red teaming that generates adversarial prompts against various target aligned LLMs. ReMiss achieves state-of-the-art attack success rates on the AdvBench benchmark while preserving the human readability of the generated prompts. Detailed analysis highlights the unique advantages brought by the proposed reward misspecification objective compared to previous methods.
☆ Active Diffusion Subsampling
Subsampling is commonly used to mitigate costs associated with data acquisition, such as time or energy requirements, motivating the development of algorithms for estimating the fully-sampled signal of interest $x$ from partially observed measurements $y$. In maximum-entropy sampling, one selects measurement locations that are expected to have the highest entropy, so as to minimize uncertainty about $x$. This approach relies on an accurate model of the posterior distribution over future measurements, given the measurements observed so far. Recently, diffusion models have been shown to produce high-quality posterior samples of high-dimensional signals using guided diffusion. In this work, we propose Active Diffusion Subsampling (ADS), a method for performing active subsampling using guided diffusion in which the model tracks a distribution of beliefs over the true state of $x$ throughout the reverse diffusion process, progressively decreasing its uncertainty by choosing to acquire measurements with maximum expected entropy, and ultimately generating the posterior distribution $p(x | y)$. ADS can be applied using pre-trained diffusion models for any subsampling rate, and does not require task-specific retraining - just the specification of a measurement model. Furthermore, the maximum entropy sampling policy employed by ADS is interpretable, enhancing transparency relative to existing methods using black-box policies. Experimentally, we show that ADS outperforms fixed sampling strategies, and study an application of ADS in Magnetic Resonance Imaging acceleration using the fastMRI dataset, finding that ADS performs competitively with supervised methods. Code available at https://active-diffusion-subsampling.github.io/.
comment: 17 pages, 12 figures, preprint
☆ Estimating Treatment Effects under Recommender Interference: A Structured Neural Networks Approach
Recommender systems are essential for content-sharing platforms by curating personalized content. To evaluate updates of recommender systems targeting content creators, platforms frequently engage in creator-side randomized experiments to estimate treatment effect, defined as the difference in outcomes when a new (vs. the status quo) algorithm is deployed on the platform. We show that the standard difference-in-means estimator can lead to a biased treatment effect estimate. This bias arises because of recommender interference, which occurs when treated and control creators compete for exposure through the recommender system. We propose a "recommender choice model" that captures how an item is chosen among a pool comprised of both treated and control content items. By combining a structural choice model with neural networks, the framework directly models the interference pathway in a microfounded way while accounting for rich viewer-content heterogeneity. Using the model, we construct a double/debiased estimator of the treatment effect that is consistent and asymptotically normal. We demonstrate its empirical performance with a field experiment on Weixin short-video platform: besides the standard creator-side experiment, we carry out a costly blocked double-sided randomization design to obtain a benchmark estimate without interference bias. We show that the proposed estimator significantly reduces the bias in treatment effect estimates compared to the standard difference-in-means estimator.
☆ Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection
Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing model computational efficiency. Here, we propose a computation-efficient semi-supervised learning paradigm (FastECG) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream ECG datasets demonstrate that FastECG not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision.
☆ Communication-Efficient Byzantine-Resilient Federated Zero-Order Optimization
We introduce CYBER-0, the first zero-order optimization algorithm for memory-and-communication efficient Federated Learning, resilient to Byzantine faults. We show through extensive numerical experiments on the MNIST dataset and finetuning RoBERTa-Large that CYBER-0 outperforms state-of-the-art algorithms in terms of communication and memory efficiency while reaching similar accuracy. We provide theoretical guarantees on its convergence for convex loss functions.
☆ Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep Learning
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.
comment: pre-print of the journal paper
☆ Can you trust your explanations? A robustness test for feature attribution methods
The increase of legislative concerns towards the usage of Artificial Intelligence (AI) has recently led to a series of regulations striving for a more transparent, trustworthy and accountable AI. Along with these proposals, the field of Explainable AI (XAI) has seen a rapid growth but the usage of its techniques has at times led to unexpected results. The robustness of the approaches is, in fact, a key property often overlooked: it is necessary to evaluate the stability of an explanation (to random and adversarial perturbations) to ensure that the results are trustable. To this end, we propose a test to evaluate the robustness to non-adversarial perturbations and an ensemble approach to analyse more in depth the robustness of XAI methods applied to neural networks and tabular datasets. We will show how leveraging manifold hypothesis and ensemble approaches can be beneficial to an in-depth analysis of the robustness.
comment: 8 pages, 3 figures
☆ $\nabla^2$DFT: A Universal Quantum Chemistry Dataset of Drug-Like Molecules and a Benchmark for Neural Network Potentials
Methods of computational quantum chemistry provide accurate approximations of molecular properties crucial for computer-aided drug discovery and other areas of chemical science. However, high computational complexity limits the scalability of their applications. Neural network potentials (NNPs) are a promising alternative to quantum chemistry methods, but they require large and diverse datasets for training. This work presents a new dataset and benchmark called $\nabla^2$DFT that is based on the nablaDFT. It contains twice as much molecular structures, three times more conformations, new data types and tasks, and state-of-the-art models. The dataset includes energies, forces, 17 molecular properties, Hamiltonian and overlap matrices, and a wavefunction object. All calculations were performed at the DFT level ($\omega$B97X-D/def2-SVP) for each conformation. Moreover, $\nabla^2$DFT is the first dataset that contains relaxation trajectories for a substantial number of drug-like molecules. We also introduce a novel benchmark for evaluating NNPs in molecular property prediction, Hamiltonian prediction, and conformational optimization tasks. Finally, we propose an extendable framework for training NNPs and implement 10 models within it.
☆ HoTPP Benchmark: Are We Good at the Long Horizon Events Forecasting?
In sequential event prediction, which finds applications in finance, retail, social networks, and healthcare, a crucial task is forecasting multiple future events within a specified time horizon. Traditionally, this has been addressed through autoregressive generation using next-event prediction models, such as Marked Temporal Point Processes. However, autoregressive methods use their own output for future predictions, potentially reducing quality as the prediction horizon extends. In this paper, we challenge traditional approaches by introducing a novel benchmark, HoTPP, specifically designed to evaluate a model's ability to predict event sequences over a horizon. This benchmark features a new metric inspired by object detection in computer vision, addressing the limitations of existing metrics in assessing models with imprecise time-step predictions. Our evaluations on established datasets employing various models demonstrate that high accuracy in next-event prediction does not necessarily translate to superior horizon prediction, and vice versa. HoTPP aims to serve as a valuable tool for developing more robust event sequence prediction methods, ultimately paving the way for further advancements in the field.
☆ Learning rate adaptive stochastic gradient descent optimization methods: numerical simulations for deep learning methods for partial differential equations and convergence analyses
It is known that the standard stochastic gradient descent (SGD) optimization method, as well as accelerated and adaptive SGD optimization methods such as the Adam optimizer fail to converge if the learning rates do not converge to zero (as, for example, in the situation of constant learning rates). Numerical simulations often use human-tuned deterministic learning rate schedules or small constant learning rates. The default learning rate schedules for SGD optimization methods in machine learning implementation frameworks such as TensorFlow and Pytorch are constant learning rates. In this work we propose and study a learning-rate-adaptive approach for SGD optimization methods in which the learning rate is adjusted based on empirical estimates for the values of the objective function of the considered optimization problem (the function that one intends to minimize). In particular, we propose a learning-rate-adaptive variant of the Adam optimizer and implement it in case of several neural network learning problems, particularly, in the context of deep learning approximation methods for partial differential equations such as deep Kolmogorov methods, physics-informed neural networks, and deep Ritz methods. In each of the presented learning problems the proposed learning-rate-adaptive variant of the Adam optimizer faster reduces the value of the objective function than the Adam optimizer with the default learning rate. For a simple class of quadratic minimization problems we also rigorously prove that a learning-rate-adaptive variant of the SGD optimization method converges to the minimizer of the considered minimization problem. Our convergence proof is based on an analysis of the laws of invariant measures of the SGD method as well as on a more general convergence analysis for SGD with random but predictable learning rates which we develop in this work.
comment: 68 pages, 8 figures
☆ Adaptive Adversarial Cross-Entropy Loss for Sharpness-Aware Minimization ICIP2024
Recent advancements in learning algorithms have demonstrated that the sharpness of the loss surface is an effective measure for improving the generalization gap. Building upon this concept, Sharpness-Aware Minimization (SAM) was proposed to enhance model generalization and achieved state-of-the-art performance. SAM consists of two main steps, the weight perturbation step and the weight updating step. However, the perturbation in SAM is determined by only the gradient of the training loss, or cross-entropy loss. As the model approaches a stationary point, this gradient becomes small and oscillates, leading to inconsistent perturbation directions and also has a chance of diminishing the gradient. Our research introduces an innovative approach to further enhancing model generalization. We propose the Adaptive Adversarial Cross-Entropy (AACE) loss function to replace standard cross-entropy loss for SAM's perturbation. AACE loss and its gradient uniquely increase as the model nears convergence, ensuring consistent perturbation direction and addressing the gradient diminishing issue. Additionally, a novel perturbation-generating function utilizing AACE loss without normalization is proposed, enhancing the model's exploratory capabilities in near-optimum stages. Empirical testing confirms the effectiveness of AACE, with experiments demonstrating improved performance in image classification tasks using Wide ResNet and PyramidNet across various datasets. The reproduction code is available online
comment: Accepted in ICIP2024. The project page can be accessed at http://www.vip.sc.e.titech.ac.jp/proj/AACE
☆ Computing Within Limits: An Empirical Study of Energy Consumption in ML Training and Inference
Machine learning (ML) has seen tremendous advancements, but its environmental footprint remains a concern. Acknowledging the growing environmental impact of ML this paper investigates Green ML, examining various model architectures and hyperparameters in both training and inference phases to identify energy-efficient practices. Our study leverages software-based power measurements for ease of replication across diverse configurations, models and datasets. In this paper, we examine multiple models and hardware configurations to identify correlations across the various measurements and metrics and key contributors to energy reduction. Our analysis offers practical guidelines for constructing sustainable ML operations, emphasising energy consumption and carbon footprint reductions while maintaining performance. As identified, short-lived profiling can quantify the long-term expected energy consumption. Moreover, model parameters can also be used to accurately estimate the expected total energy without the need for extensive experimentation.
comment: Accepted for publication at ARISDE2024: 1st International Workshop on Artificial Intelligence for Sustainable Development
☆ Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers
Research in various fields is currently experiencing challenges regarding the reproducibility of results. This problem is also prevalent in machine learning (ML) research. The issue arises primarily due to unpublished data and/or source code and the sensitivity of ML training conditions. Although different solutions have been proposed to address this issue, such as using ML platforms, the level of reproducibility in ML-driven research remains unsatisfactory. Therefore, in this article, we discuss the reproducibility of ML-driven research with three main aims: (i) identify the barriers to reproducibility when applying ML in research as well as categorize the barriers to different types of reproducibility (description, code, data, and experiment reproducibility), (ii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility as well as distinguish between technology-driven drivers, procedural drivers, and drivers related to awareness and education, and (iii) map the drivers to the barriers. With this work, we hope to provide insights and contribute to the decision-making process regarding the adoption of different solutions to support ML reproducibility.
comment: Pre-print of submission planned to the AI Magazine
☆ Revealing the learning process in reinforcement learning agents through attention-oriented metrics
The learning process of a reinforcement learning (RL) agent remains poorly understood beyond the mathematical formulation of its learning algorithm. To address this gap, we introduce attention-oriented metrics (ATOMs) to investigate the development of an RL agent's attention during training. We tested ATOMs on three variations of a Pong game, each designed to teach the agent distinct behaviours, complemented by a behavioural assessment. Our findings reveal that ATOMs successfully delineate the attention patterns of an agent trained on each game variation, and that these differences in attention patterns translate into differences in the agent's behaviour. Through continuous monitoring of ATOMs during training, we observed that the agent's attention developed in phases, and that these phases were consistent across games. Finally, we noted that the agent's attention to its paddle emerged relatively late in the training and coincided with a marked increase in its performance score. Overall, we believe that ATOMs could significantly enhance our understanding of RL agents' learning processes, which is essential for improving their reliability and efficiency.
☆ Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning
Large language models (LLMs) have emerged as powerful tools for tackling complex tasks across diverse domains, but they also raise privacy concerns when fine-tuned on sensitive data due to potential memorization. While differential privacy (DP) offers a promising solution by ensuring models are `almost indistinguishable' with or without any particular privacy unit, current evaluations on LLMs mostly treat each example (text record) as the privacy unit. This leads to uneven user privacy guarantees when contributions per user vary. We therefore study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users. We present a systematic evaluation of user-level DP for LLM fine-tuning on natural language generation tasks. Focusing on two mechanisms for achieving user-level DP guarantees, Group Privacy and User-wise DP-SGD, we investigate design choices like data selection strategies and parameter tuning for the best privacy-utility tradeoff.
☆ Emerging-properties Mapping Using Spatial Embedding Statistics: EMUSES
Understanding complex phenomena often requires analyzing high-dimensional data to uncover emergent properties that arise from multifactorial interactions. Here, we present EMUSES (Emerging-properties Mapping Using Spatial Embedding Statistics), an innovative approach employing Uniform Manifold Approximation and Projection (UMAP) to create high-dimensional embeddings that reveal latent structures within data. EMUSES facilitates the exploration and prediction of emergent properties by statistically analyzing these latent spaces. Using three distinct datasets--a handwritten digits dataset from the National Institute of Standards and Technology (NIST, E. Alpaydin, 1998), the Chicago Face Database (Ma et al., 2015), and brain disconnection data post-stroke (Talozzi et al., 2023)--we demonstrate EMUSES' effectiveness in detecting and interpreting emergent properties. Our method not only predicts outcomes with high accuracy but also provides clear visualizations and statistical insights into the underlying interactions within the data. By bridging the gap between predictive accuracy and interpretability, EMUSES offers researchers a powerful tool to understand the multifactorial origins of complex phenomena.
comment: 14 pages, 3 figures
☆ FIESTA: Fourier-Based Semantic Augmentation with Uncertainty Guidance for Enhanced Domain Generalizability in Medical Image Segmentation
Single-source domain generalization (SDG) in medical image segmentation (MIS) aims to generalize a model using data from only one source domain to segment data from an unseen target domain. Despite substantial advances in SDG with data augmentation, existing methods often fail to fully consider the details and uncertain areas prevalent in MIS, leading to mis-segmentation. This paper proposes a Fourier-based semantic augmentation method called FIESTA using uncertainty guidance to enhance the fundamental goals of MIS in an SDG context by manipulating the amplitude and phase components in the frequency domain. The proposed Fourier augmentative transformer addresses semantic amplitude modulation based on meaningful angular points to induce pertinent variations and harnesses the phase spectrum to ensure structural coherence. Moreover, FIESTA employs epistemic uncertainty to fine-tune the augmentation process, improving the ability of the model to adapt to diverse augmented data and concentrate on areas with higher ambiguity. Extensive experiments across three cross-domain scenarios demonstrate that FIESTA surpasses recent state-of-the-art SDG approaches in segmentation performance and significantly contributes to boosting the applicability of the model in medical imaging modalities.
comment: 40 pages, 7 figures, 5 tables
☆ Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning
Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields have been developed rather independently. We observe that several methods in both representation and causal structure learning rely on the same data-generating process (DGP), namely, exchangeable but not i.i.d. (independent and identically distributed) data. We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning under the lens of exchangeability. IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable non--i.i.d. data. We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results. We hope this work will pave the way for further research in causal representation learning.
☆ Resource Optimization for Tail-Based Control in Wireless Networked Control Systems
Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.
comment: Accepted in PIMRC 2024 conference, 6 pages, 5 figures
☆ Revisiting Modularity Maximization for Graph Clustering: A Contrastive Learning Perspective KDD 2024
Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.
comment: KDD 2024 research track. Code available at https://github.com/EdisonLeeeee/MAGI
☆ Segmentation of Non-Small Cell Lung Carcinomas: Introducing DRU-Net and Multi-Lens Distortion
Considering the increased workload in pathology laboratories today, automated tools such as artificial intelligence models can help pathologists with their tasks and ease the workload. In this paper, we are proposing a segmentation model (DRU-Net) that can provide a delineation of human non-small cell lung carcinomas and an augmentation method that can improve classification results. The proposed model is a fused combination of truncated pre-trained DenseNet201 and ResNet101V2 as a patch-wise classifier followed by a lightweight U-Net as a refinement model. We have used two datasets (Norwegian Lung Cancer Biobank and Haukeland University Hospital lung cancer cohort) to create our proposed model. The DRU-Net model achieves an average of 0.91 Dice similarity coefficient. The proposed spatial augmentation method (multi-lens distortion) improved the network performance by 3%. Our findings show that choosing image patches that specifically include regions of interest leads to better results for the patch-wise classifier compared to other sampling methods. The qualitative analysis showed that the DRU-Net model is generally successful in detecting the tumor. On the test set, some of the cases showed areas of false positive and false negative segmentation in the periphery, particularly in tumors with inflammatory and reactive changes.
comment: 16 pages, 7 figures, submitted to Scientific Reports
☆ FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-removal models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at https://github.com/fahim-sikder/FairX.
☆ Learning to Discover Knowledge: A Weakly-Supervised Partial Domain Adaptation Approach SP
Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks.
comment: Accepted to TIP 2024. Code available: https://github.com/mc-lan/SP-TCL
☆ VeriFlow: Modeling Distributions for Neural Network Verification
Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. Naively verifying a safety property amounts to ensuring the safety of a neural network for the whole input space irrespective of any training or test set. However, this also implies that the safety of the neural network is checked even for inputs that do not occur in the real-world and have no meaning at all, often resulting in spurious errors. To tackle this shortcoming, we propose the VeriFlow architecture as a flow based density model tailored to allow any verification approach to restrict its search to the some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation and log-density function that are defined by our model are piece-wise affine. Therefore, the model allows the usage of verifiers based on SMT with linear arithmetic. Second, upper density level sets (UDL) of the data distribution take the shape of an $L^p$-ball in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in latent space. This allows for SMT and abstract interpretation approaches with fine-grained, probabilistically interpretable, control regarding on how (a)typical the inputs subject to verification are.
☆ MEAT: Median-Ensemble Adversarial Training for Improving Robustness and Generalization
Self-ensemble adversarial training methods improve model robustness by ensembling models at different training epochs, such as model weight averaging (WA). However, previous research has shown that self-ensemble defense methods in adversarial training (AT) still suffer from robust overfitting, which severely affects the generalization performance. Empirically, in the late phases of training, the AT becomes more overfitting to the extent that the individuals for weight averaging also suffer from overfitting and produce anomalous weight values, which causes the self-ensemble model to continue to undergo robust overfitting due to the failure in removing the weight anomalies. To solve this problem, we aim to tackle the influence of outliers in the weight space in this work and propose an easy-to-operate and effective Median-Ensemble Adversarial Training (MEAT) method to solve the robust overfitting phenomenon existing in self-ensemble defense from the source by searching for the median of the historical model weights. Experimental results show that MEAT achieves the best robustness against the powerful AutoAttack and can effectively allievate the robust overfitting. We further demonstrate that most defense methods can improve robust generalization and robustness by combining with MEAT.
☆ Non-Negative Universal Differential Equations With Applications in Systems Biology
Universal differential equations (UDEs) leverage the respective advantages of mechanistic models and artificial neural networks and combine them into one dynamic model. However, these hybrid models can suffer from unrealistic solutions, such as negative values for biochemical quantities. We present non-negative UDE (nUDEs), a constrained UDE variant that guarantees non-negative values. Furthermore, we explore regularisation techniques to improve generalisation and interpretability of UDEs.
comment: 6 pages, This work has been submitted to IFAC for possible publication. Initial submission was March 18, 2024
☆ Enhancing robustness of data-driven SHM models: adversarial training with circle loss
Structural health monitoring (SHM) is critical to safeguarding the safety and reliability of aerospace, civil, and mechanical infrastructure. Machine learning-based data-driven approaches have gained popularity in SHM due to advancements in sensors and computational power. However, machine learning models used in SHM are vulnerable to adversarial examples -- even small changes in input can lead to different model outputs. This paper aims to address this problem by discussing adversarial defenses in SHM. In this paper, we propose an adversarial training method for defense, which uses circle loss to optimize the distance between features in training to keep examples away from the decision boundary. Through this simple yet effective constraint, our method demonstrates substantial improvements in model robustness, surpassing existing defense mechanisms.
comment: 12 pages, 9 figures
☆ aeon: a Python toolkit for learning from time series
aeon is a unified Python 3 library for all machine learning tasks involving time series. The package contains modules for time series forecasting, classification, extrinsic regression and clustering, as well as a variety of utilities, transformations and distance measures designed for time series data. aeon also has a number of experimental modules for tasks such as anomaly detection, similarity search and segmentation. aeon follows the scikit-learn API as much as possible to help new users and enable easy integration of aeon estimators with useful tools such as model selection and pipelines. It provides a broad library of time series algorithms, including efficient implementations of the very latest advances in research. Using a system of optional dependencies, aeon integrates a wide variety of packages into a single interface while keeping the core framework with minimal dependencies. The package is distributed under the 3-Clause BSD license and is available at https://github.com/ aeon-toolkit/aeon. This version was submitted to the JMLR journal on 02 Nov 2023 for v0.5.0 of aeon. At the time of this preprint aeon has released v0.9.0, and has had substantial changes.
comment: 10 pages
☆ Evaluation of Deep Learning Semantic Segmentation for Land Cover Mapping on Multispectral, Hyperspectral and High Spatial Aerial Imagery
In the rise of climate change, land cover mapping has become such an urgent need in environmental monitoring. The accuracy of land cover classification has gotten increasingly based on the improvement of remote sensing data. Land cover classification using satellite imageries has been explored and become more prevalent in recent years, but the methodologies remain some drawbacks of subjective and time-consuming. Some deep learning techniques have been utilized to overcome these limitations. However, most studies implemented just one image type to evaluate algorithms for land cover mapping. Therefore, our study conducted deep learning semantic segmentation in multispectral, hyperspectral, and high spatial aerial image datasets for landcover mapping. This research implemented a semantic segmentation method such as Unet, Linknet, FPN, and PSPnet for categorizing vegetation, water, and others (i.e., soil and impervious surface). The LinkNet model obtained high accuracy in IoU (Intersection Over Union) at 0.92 in all datasets, which is comparable with other mentioned techniques. In evaluation with different image types, the multispectral images showed higher performance with the IoU, and F1-score are 0.993 and 0.997, respectively. Our outcome highlighted the efficiency and broad applicability of LinkNet and multispectral image on land cover classification. This research contributes to establishing an approach on landcover segmentation via open source for long-term future application.
☆ Defending Against Sophisticated Poisoning Attacks with RL-based Aggregation in Federated Learning
Federated learning is highly susceptible to model poisoning attacks, especially those meticulously crafted for servers. Traditional defense methods mainly focus on updating assessments or robust aggregation against manually crafted myopic attacks. When facing advanced attacks, their defense stability is notably insufficient. Therefore, it is imperative to develop adaptive defenses against such advanced poisoning attacks. We find that benign clients exhibit significantly higher data distribution stability than malicious clients in federated learning in both CV and NLP tasks. Therefore, the malicious clients can be recognized by observing the stability of their data distribution. In this paper, we propose AdaAggRL, an RL-based Adaptive Aggregation method, to defend against sophisticated poisoning attacks. Specifically, we first utilize distribution learning to simulate the clients' data distributions. Then, we use the maximum mean discrepancy (MMD) to calculate the pairwise similarity of the current local model data distribution, its historical data distribution, and global model data distribution. Finally, we use policy learning to adaptively determine the aggregation weights based on the above similarities. Experiments on four real-world datasets demonstrate that the proposed defense model significantly outperforms widely adopted defense models for sophisticated attacks.
Self-Supervised Pretext Tasks for Alzheimer's Disease Classification using 3D Convolutional Neural Networks on Large-Scale Synthetic Neuroimaging Dataset
Structural magnetic resonance imaging (MRI) studies have shown that Alzheimer's Disease (AD) induces both localised and widespread neural degenerative changes throughout the brain. However, the absence of segmentation that highlights brain degenerative changes presents unique challenges for training CNN-based classifiers in a supervised fashion. In this work, we evaluated several unsupervised methods to train a feature extractor for downstream AD vs. CN classification. Using the 3D T1-weighted MRI data of cognitive normal (CN) subjects from the synthetic neuroimaging LDM100K dataset, lightweight 3D CNN-based models are trained for brain age prediction, brain image rotation classification, brain image reconstruction and a multi-head task combining all three tasks into one. Feature extractors trained on the LDM100K synthetic dataset achieved similar performance compared to the same model using real-world data. This supports the feasibility of utilising large-scale synthetic data for pretext task training. All the training and testing splits are performed on the subject-level to prevent data leakage issues. Alongside the simple preprocessing steps, the random cropping data augmentation technique shows consistent improvement across all experiments.
☆ LayerMatch: Do Pseudo-labels Benefit All Layers?
Deep neural networks have achieved remarkable performance across various tasks when supplied with large-scale labeled data. However, the collection of labeled data can be time-consuming and labor-intensive. Semi-supervised learning (SSL), particularly through pseudo-labeling algorithms that iteratively assign pseudo-labels for self-training, offers a promising solution to mitigate the dependency of labeled data. Previous research generally applies a uniform pseudo-labeling strategy across all model layers, assuming that pseudo-labels exert uniform influence throughout. Contrasting this, our theoretical analysis and empirical experiment demonstrate feature extraction layer and linear classification layer have distinct learning behaviors in response to pseudo-labels. Based on these insights, we develop two layer-specific pseudo-label strategies, termed Grad-ReLU and Avg-Clustering. Grad-ReLU mitigates the impact of noisy pseudo-labels by removing the gradient detrimental effects of pseudo-labels in the linear classification layer. Avg-Clustering accelerates the convergence of feature extraction layer towards stable clustering centers by integrating consistent outputs. Our approach, LayerMatch, which integrates these two strategies, can avoid the severe interference of noisy pseudo-labels in the linear classification layer while accelerating the clustering capability of the feature extraction layer. Through extensive experimentation, our approach consistently demonstrates exceptional performance on standard semi-supervised learning benchmarks, achieving a significant improvement of 10.38% over baseline method and a 2.44% increase compared to state-of-the-art methods.
☆ Temporal Knowledge Graph Question Answering: A Survey
Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.
comment: 8 pages, 3 figures
☆ Latent. Functional Map
Neural models learn data representations that lie on low-dimensional manifolds, yet modeling the relation between these representational spaces is an ongoing challenge. By integrating spectral geometry principles into neural modeling, we show that this problem can be better addressed in the functional domain, mitigating complexity, while enhancing interpretability and performances on downstream tasks. To this end, we introduce a multi-purpose framework to the representation learning community, which allows to: (i) compare different spaces in an interpretable way and measure their intrinsic similarity; (ii) find correspondences between them, both in unsupervised and weakly supervised settings, and (iii) to effectively transfer representations between distinct spaces. We validate our framework on various applications, ranging from stitching to retrieval tasks, demonstrating that latent functional maps can serve as a swiss-army knife for representation alignment.
☆ Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement Learning KDD 2024
Given an input query, a recommendation model is trained using user feedback data (e.g., click data) to output a ranked list of items. In real-world systems, besides accuracy, an important consideration for a new model is novelty of its top-k recommendations w.r.t. an existing deployed model. However, novelty of top-k items is a difficult goal to optimize a model for, since it involves a non-differentiable sorting operation on the model's predictions. Moreover, novel items, by definition, do not have any user feedback data. Given the semantic capabilities of large language models, we address these problems using a reinforcement learning (RL) formulation where large language models provide feedback for the novel items. However, given millions of candidate items, the sample complexity of a standard RL algorithm can be prohibitively high. To reduce sample complexity, we reduce the top-k list reward to a set of item-wise rewards and reformulate the state space to consist of tuples such that the action space is reduced to a binary decision; and show that this reformulation results in a significantly lower complexity when the number of items is large. We evaluate the proposed algorithm on improving novelty for a query-ad recommendation task on a large-scale search engine. Compared to supervised finetuning on recent pairs, the proposed RL-based algorithm leads to significant novelty gains with minimal loss in recall. We obtain similar results on the ORCAS query-webpage matching dataset and a product recommendation dataset based on Amazon reviews.
comment: Accepted at KDD 2024
☆ Iterative Sizing Field Prediction for Adaptive Mesh Generation From Expert Demonstrations ICML 2024
Many engineering systems require accurate simulations of complex physical systems. Yet, analytical solutions are only available for simple problems, necessitating numerical approximations such as the Finite Element Method (FEM). The cost and accuracy of the FEM scale with the resolution of the underlying computational mesh. To balance computational speed and accuracy meshes with adaptive resolution are used, allocating more resources to critical parts of the geometry. Currently, practitioners often resort to hand-crafted meshes, which require extensive expert knowledge and are thus costly to obtain. Our approach, Adaptive Meshing By Expert Reconstruction (AMBER), views mesh generation as an imitation learning problem. AMBER combines a graph neural network with an online data acquisition scheme to predict the projected sizing field of an expert mesh on a given intermediate mesh, creating a more accurate subsequent mesh. This iterative process ensures efficient and accurate imitation of expert mesh resolutions on arbitrary new geometries during inference. We experimentally validate AMBER on heuristic 2D meshes and 3D meshes provided by a human expert, closely matching the provided demonstrations and outperforming a single-step CNN baseline.
comment: Accepted as a workshop paper in AI4Science@ICML 2024
☆ Tractable Equilibrium Computation in Markov Games through Risk Aversion
A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that -- by imbuing agents with important features of human decision-making like risk aversion and bounded rationality -- a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all $n$-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents' degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples' patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.
comment: preprint of multi-agent RL with risk-averse equilibria
☆ Watching the Watchers: A Comparative Fairness Audit of Cloud-based Content Moderation Services
Online platforms face the challenge of moderating an ever-increasing volume of content, including harmful hate speech. In the absence of clear legal definitions and a lack of transparency regarding the role of algorithms in shaping decisions on content moderation, there is a critical need for external accountability. Our study contributes to filling this gap by systematically evaluating four leading cloud-based content moderation services through a third-party audit, highlighting issues such as biases against minorities and vulnerable groups that may arise through over-reliance on these services. Using a black-box audit approach and four benchmark data sets, we measure performance in explicit and implicit hate speech detection as well as counterfactual fairness through perturbation sensitivity analysis and present disparities in performance for certain target identity groups and data sets. Our analysis reveals that all services had difficulties detecting implicit hate speech, which relies on more subtle and codified messages. Moreover, our results point to the need to remove group-specific bias. It seems that biases towards some groups, such as Women, have been mostly rectified, while biases towards other groups, such as LGBTQ+ and PoC remain.
comment: Accepted at European Workshop on Algorithmic Fairness (EWAF'24)
☆ Multi-modal Transfer Learning between Biological Foundation Models
Biological sequences encode fundamental instructions for the building blocks of life, in the form of DNA, RNA, and proteins. Modeling these sequences is key to understand disease mechanisms and is an active research area in computational biology. Recently, Large Language Models have shown great promise in solving certain biological tasks but current approaches are limited to a single sequence modality (DNA, RNA, or protein). Key problems in genomics intrinsically involve multiple modalities, but it remains unclear how to adapt general-purpose sequence models to those cases. In this work we propose a multi-modal model that connects DNA, RNA, and proteins by leveraging information from different pre-trained modality-specific encoders. We demonstrate its capabilities by applying it to the largely unsolved problem of predicting how multiple RNA transcript isoforms originate from the same gene (i.e. same DNA sequence) and map to different transcription expression levels across various human tissues. We show that our model, dubbed IsoFormer, is able to accurately predict differential transcript expression, outperforming existing methods and leveraging the use of multiple modalities. Our framework also achieves efficient transfer knowledge from the encoders pre-training as well as in between modalities. We open-source our model, paving the way for new multi-modal gene expression approaches.
☆ CheMFi: A Multifidelity Dataset of Quantum Chemical Properties of Diverse Molecules
Progress in both Machine Learning (ML) and conventional Quantum Chemistry (QC) computational methods have resulted in high accuracy ML models for QC properties ranging from atomization energies to excitation energies. Various datasets such as MD17, MD22, and WS22, which consist of properties calculated at some level of QC method, or fidelity, have been generated to benchmark such ML models. The term fidelity refers to the accuracy of the chosen QC method to the actual real value of the property. The higher the fidelity, the more accurate the calculated property, albeit at a higher computational cost. Research in multifidelity ML (MFML) methods, where ML models are trained on data from more than one numerical QC method, has shown the effectiveness of such models over single fidelity methods. Much research is progressing in this direction for diverse applications ranging from energy band gaps to excitation energies. A major hurdle for effective research in this field of research in the community is the lack of a diverse multifidelity dataset for benchmarking. Here, we present a comprehensive multifidelity dataset drawn from the WS22 molecular conformations. We provide the quantum Chemistry MultiFidelity (CheMFi) dataset consisting of five fidelities calculated with the TD-DFT formalism. The fidelities differ in their basis set choice and are namely: STO-3G, 3-21G, 6-31G, def2-SVP, and def2-TZVP. CheMFi offers to the community a variety of QC properties including vertical excitation energies, oscillator strengths, molecular dipole moments, and ground state energies. In addition to the dataset, multifidelity benchmarks are set with state-of-the-art MFML and optimized-MFML
comment: SI not included
☆ Finding Safety Neurons in Large Language Models
Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
☆ Geometric Self-Supervised Pretraining on 3D Protein Structures using Subgraphs
Protein representation learning aims to learn informative protein embeddings capable of addressing crucial biological questions, such as protein function prediction. Although sequence-based transformer models have shown promising results by leveraging the vast amount of protein sequence data in a self-supervised way, there is still a gap in applying these methods to 3D protein structures. In this work, we propose a pre-training scheme going beyond trivial masking methods leveraging 3D and hierarchical structures of proteins. We propose a novel self-supervised method to pretrain 3D graph neural networks on 3D protein structures, by predicting the distances between local geometric centroids of protein subgraphs and the global geometric centroid of the protein. The motivation for this method is twofold. First, the relative spatial arrangements and geometric relationships among different regions of a protein are crucial for its function. Moreover, proteins are often organized in a hierarchical manner, where smaller substructures, such as secondary structure elements, assemble into larger domains. By considering subgraphs and their relationships to the global protein structure, the model can learn to reason about these hierarchical levels of organization. We experimentally show that our proposed pertaining strategy leads to significant improvements in the performance of 3D GNNs in various protein classification tasks.
☆ Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
☆ Expander Hierarchies for Normalized Cuts on Graphs KDD'24
Expander decompositions of graphs have significantly advanced the understanding of many classical graph problems and led to numerous fundamental theoretical results. However, their adoption in practice has been hindered due to their inherent intricacies and large hidden factors in their asymptotic running times. Here, we introduce the first practically efficient algorithm for computing expander decompositions and their hierarchies and demonstrate its effectiveness and utility by incorporating it as the core component in a novel solver for the normalized cut graph clustering objective. Our extensive experiments on a variety of large graphs show that our expander-based algorithm outperforms state-of-the-art solvers for normalized cut with respect to solution quality by a large margin on a variety of graph classes such as citation, e-mail, and social networks or web graphs while remaining competitive in running time.
comment: Accepted to KDD'24, August 25-29, 2024, Barcelona, Spain
☆ Graph Neural Networks for Job Shop Scheduling Problems: A Survey
Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely related flow-shop scheduling problems (FSPs), especially those leveraging deep reinforcement learning (DRL). We begin by presenting the graph representations of various JSSPs, followed by an introduction to the most commonly used GNN architectures. We then review current GNN-based methods for each problem type, highlighting key technical elements such as graph representations, GNN architectures, GNN tasks, and training algorithms. Finally, we summarize and analyze the advantages and limitations of GNNs in solving JSSPs and provide potential future research opportunities. We hope this survey can motivate and inspire innovative approaches for more powerful GNN-based approaches in tackling JSSPs and other scheduling problems.
☆ Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization
Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.
☆ ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at https://github.com/openpsi-project/ReaLHF .
comment: 13 pages (15 pages with references), 13 figures
☆ Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-Labelling
Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across domains.To cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: https://github.com/tanodino/SSHDA/
☆ Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images
Recent advancements in autoregressive networks with linear complexity have driven significant research progress, demonstrating exceptional performance in large language models. A representative model is the Extended Long Short-Term Memory (xLSTM), which incorporates gating mechanisms and memory structures, performing comparably to Transformer architectures in long-sequence language tasks. Autoregressive networks such as xLSTM can utilize image serialization to extend their application to visual tasks such as classification and segmentation. Although existing studies have demonstrated Vision-LSTM's impressive results in image classification, its performance in image semantic segmentation remains unverified. Our study represents the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. This evaluation is based on a specifically designed encoder-decoder architecture named Seg-LSTM, and comparisons with state-of-the-art segmentation networks. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests. Future research directions for enhancing Vision-LSTM are recommended. The source code is available from https://github.com/zhuqinfeng1999/Seg-LSTM.
☆ FLoCoRA: Federated learning compression with low-rank adaptation
Low-Rank Adaptation (LoRA) methods have gained popularity in efficient parameter fine-tuning of models containing hundreds of billions of parameters. In this work, instead, we demonstrate the application of LoRA methods to train small-vision models in Federated Learning (FL) from scratch. We first propose an aggregation-agnostic method to integrate LoRA within FL, named FLoCoRA, showing that the method is capable of reducing communication costs by 4.8 times, while having less than 1% accuracy degradation, for a CIFAR-10 classification task with a ResNet-8. Next, we show that the same method can be extended with an affine quantization scheme, dividing the communication cost by 18.6 times, while comparing it with the standard method, with still less than 1% of accuracy loss, tested with on a ResNet-18 model. Our formulation represents a strong baseline for message size reduction, even when compared to conventional model compression works, while also reducing the training memory requirements due to the low-rank adaptation.
☆ Exploring Layerwise Adversarial Robustness Through the Lens of t-SNE
Adversarial examples, designed to trick Artificial Neural Networks (ANNs) into producing wrong outputs, highlight vulnerabilities in these models. Exploring these weaknesses is crucial for developing defenses, and so, we propose a method to assess the adversarial robustness of image-classifying ANNs. The t-distributed Stochastic Neighbor Embedding (t-SNE) technique is used for visual inspection, and a metric, which compares the clean and perturbed embeddings, helps pinpoint weak spots in the layers. Analyzing two ANNs on CIFAR-10, one designed by humans and another via NeuroEvolution, we found that differences between clean and perturbed representations emerge early on, in the feature extraction layers, affecting subsequent classification. The findings with our metric are supported by the visual analysis of the t-SNE maps.
☆ Bayesian Bandit Algorithms with Approximate Inference in Stochastic Linear Bandits
Bayesian bandit algorithms with approximate Bayesian inference have been widely used in real-world applications. Nevertheless, their theoretical justification is less investigated in the literature, especially for contextual bandit problems. To fill this gap, we propose a general theoretical framework to analyze stochastic linear bandits in the presence of approximate inference and conduct regret analysis on two Bayesian bandit algorithms, Linear Thompson sampling (LinTS) and the extension of Bayesian Upper Confidence Bound, namely Linear Bayesian Upper Confidence Bound (LinBUCB). We demonstrate that both LinTS and LinBUCB can preserve their original rates of regret upper bound but with a sacrifice of larger constant terms when applied with approximate inference. These results hold for general Bayesian inference approaches, under the assumption that the inference error measured by two different $\alpha$-divergences is bounded. Additionally, by introducing a new definition of well-behaved distributions, we show that LinBUCB improves the regret rate of LinTS from $\tilde{O}(d^{3/2}\sqrt{T})$ to $\tilde{O}(d\sqrt{T})$, matching the minimax optimal rate. To our knowledge, this work provides the first regret bounds in the setting of stochastic linear bandits with bounded approximate inference errors.
☆ Tracking solutions of time-varying variational inequalities
Tracking the solution of time-varying variational inequalities is an important problem with applications in game theory, optimization, and machine learning. Existing work considers time-varying games or time-varying optimization problems. For strongly convex optimization problems or strongly monotone games, these results provide tracking guarantees under the assumption that the variation of the time-varying problem is restrained, that is, problems with a sublinear solution path. In this work we extend existing results in two ways: In our first result, we provide tracking bounds for (1) variational inequalities with a sublinear solution path but not necessarily monotone functions, and (2) for periodic time-varying variational inequalities that do not necessarily have a sublinear solution path-length. Our second main contribution is an extensive study of the convergence behavior and trajectory of discrete dynamical systems of periodic time-varying VI. We show that these systems can exhibit provably chaotic behavior or can converge to the solution. Finally, we illustrate our theoretical results with experiments.
☆ Urban-Focused Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing KDD 2024
Enhancing diverse human decision-making processes in an urban environment is a critical issue across various applications, including ride-sharing vehicle dispatching, public transportation management, and autonomous driving. Offline reinforcement learning (RL) is a promising approach to learn and optimize human urban strategies (or policies) from pre-collected human-generated spatial-temporal urban data. However, standard offline RL faces two significant challenges: (1) data scarcity and data heterogeneity, and (2) distributional shift. In this paper, we introduce MODA -- a Multi-Task Offline Reinforcement Learning with Contrastive Data Sharing approach. MODA addresses the challenges of data scarcity and heterogeneity in a multi-task urban setting through Contrastive Data Sharing among tasks. This technique involves extracting latent representations of human behaviors by contrasting positive and negative data pairs. It then shares data presenting similar representations with the target task, facilitating data augmentation for each task. Moreover, MODA develops a novel model-based multi-task offline RL algorithm. This algorithm constructs a robust Markov Decision Process (MDP) by integrating a dynamics model with a Generative Adversarial Network (GAN). Once the robust MDP is established, any online RL or planning algorithm can be applied. Extensive experiments conducted in a real-world multi-task urban setting validate the effectiveness of MODA. The results demonstrate that MODA exhibits significant improvements compared to state-of-the-art baselines, showcasing its capability in advancing urban decision-making processes. We also made our code available to the research community.
comment: KDD 2024
☆ Constrained Meta Agnostic Reinforcement Learning
Meta-Reinforcement Learning (Meta-RL) aims to acquire meta-knowledge for quick adaptation to diverse tasks. However, applying these policies in real-world environments presents a significant challenge in balancing rapid adaptability with adherence to environmental constraints. Our novel approach, Constraint Model Agnostic Meta Learning (C-MAML), merges meta learning with constrained optimization to address this challenge. C-MAML enables rapid and efficient task adaptation by incorporating task-specific constraints directly into its meta-algorithm framework during the training phase. This fusion results in safer initial parameters for learning new tasks. We demonstrate the effectiveness of C-MAML in simulated locomotion with wheeled robot tasks of varying complexity, highlighting its practicality and robustness in dynamic environments.
☆ Understanding Different Design Choices in Training Large Time Series Models
Inspired by Large Language Models (LLMs), Time Series Forecasting (TSF), a long-standing task in time series analysis, is undergoing a transition towards Large Time Series Models (LTSMs), aiming to train universal transformer-based models for TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, and patterns across datasets. Recent endeavors have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities, spanning pre-processing techniques, model configurations, and dataset configurations. In this work, we comprehensively analyze these design choices and aim to identify the best practices for training LTSM. Moreover, we propose \emph{time series prompt}, a novel statistical prompting strategy tailored to time series data. Furthermore, based on the observations in our analysis, we introduce \texttt{LTSM-bundle}, which bundles the best design choices we have identified. Empirical results demonstrate that \texttt{LTSM-bundle} achieves superior zero-shot and few-shot performances compared to state-of-the-art LSTMs and traditional TSF methods on benchmark datasets.
☆ Encoder-Decoder Neural Networks in Interpretation of X-ray Spectra
Encoder-decoder neural networks (EDNN) condense information most relevant to the output of the feedforward network to activation values at a bottleneck layer. We study the use of this architecture in emulation and interpretation of simulated X-ray spectroscopic data with the aim to identify key structural characteristics for the spectra, previously studied using emulator-based component analysis (ECA). We find an EDNN to outperform ECA in covered target variable variance, but also discover complications in interpreting the latent variables in physical terms. As a compromise of the benefits of these two approaches, we develop a network where the linear projection of ECA is used, thus maintaining the beneficial characteristics of vector expansion from the latent variables for their interpretation. These results underline the necessity of information recovery after its condensation and identification of decisive structural degrees for the output spectra for a justified interpretation.
☆ A Practical Diffusion Path for Sampling
Diffusion models are state-of-the-art methods in generative modeling when samples from a target probability distribution are available, and can be efficiently sampled, using score matching to estimate score vectors guiding a Langevin process. However, in the setting where samples from the target are not available, e.g. when this target's density is known up to a normalization constant, the score estimation task is challenging. Previous approaches rely on Monte Carlo estimators that are either computationally heavy to implement or sample-inefficient. In this work, we propose a computationally attractive alternative, relying on the so-called dilation path, that yields score vectors that are available in closed-form. This path interpolates between a Dirac and the target distribution using a convolution. We propose a simple implementation of Langevin dynamics guided by the dilation path, using adaptive step-sizes. We illustrate the results of our sampling method on a range of tasks, and shows it performs better than classical alternatives.
☆ Toward Infinite-Long Prefix in Transformer
Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning from the perspective of prefix length. In particular, we approximate the infinite-long Prefix Learning optimization process by the Neural Tangent Kernel (NTK) technique. We formulate and solve it as a learning problem of the infinite-long prefix in a one-layer attention network. Our results confirm the over-parameterization property and arbitrary small loss convergence guarantee of the infinite-long Prefix Learning in attention. To the implementation end, we propose our NTK-Attention method, which is "equivalent" to attention computation with arbitrary prefix length efficiently. Its time complexity mainly depends on the sub-quadratic of input length (without prefix), and our method only requires $d^2 + d$ extra parameters for representation, where $d$ is the feature dimension. In addition, we conducted experiments that compare our NTK-Attention with full parameters fine-tuning, LoRA, and P-Tuning V2 methods across vision or natural language datasets. The results indicate our approach may be a promising parameter-efficient-fine-tuning method since it has demonstrated superior performance in numerous scenarios. Our code can be found at \url{https://github.com/ChristianYang37/chiwun/tree/main/src/NTK-Attention}.
☆ Ensembles of Probabilistic Regression Trees
Tree-based ensemble methods such as random forests, gradient-boosted trees, and Bayesianadditive regression trees have been successfully used for regression problems in many applicationsand research studies. In this paper, we study ensemble versions of probabilisticregression trees that provide smooth approximations of the objective function by assigningeach observation to each region with respect to a probability distribution. We prove thatthe ensemble versions of probabilistic regression trees considered are consistent, and experimentallystudy their bias-variance trade-off and compare them with the state-of-the-art interms of performance prediction.
☆ Demystifying Forgetting in Language Model Fine-Tuning with Statistical Analysis of Example Associations
Language models (LMs) are known to suffer from forgetting of previously learned examples when fine-tuned, breaking stability of deployed LM systems. Despite efforts on mitigating forgetting, few have investigated whether, and how forgotten upstream examples are associated with newly learned tasks. Insights on such associations enable efficient and targeted mitigation of forgetting. In this paper, we empirically analyze forgetting that occurs in $N$ upstream examples while the model learns $M$ new tasks and visualize their associations with a $M \times N$ matrix. We empirically demonstrate that the degree of forgetting can often be approximated by simple multiplicative contributions of the upstream examples and newly learned tasks. We also reveal more complicated patterns where specific subsets of examples are forgotten with statistics and visualization. Following our analysis, we predict forgetting that happens on upstream examples when learning a new task with matrix completion over the empirical associations, outperforming prior approaches that rely on trainable LMs. Project website: https://inklab.usc.edu/lm-forgetting-prediction/
comment: 5 pages
☆ Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning
The emergence of in-context learning (ICL) is potentially attributed to two major abilities: task recognition (TR) for recognizing the task from demonstrations and utilizing pre-trained priors, and task learning (TL) for learning from demonstrations. However, relationships between the two abilities and how such relationships affect the emergence of ICL is unclear. In this paper, we take the first step by examining the pre-training dynamics of the emergence of ICL. With carefully designed metrics, we find that these two abilities are, in fact, competitive during pre-training. Moreover, we observe a strong negative correlation between the competition and ICL performance. Further analysis of common pre-training factors (i.e., model size, dataset size, and data curriculum) demonstrates possible ways to manage the competition. Based on these insights, we propose a simple yet effective method to better integrate these two abilities for ICL at inference time. Through adaptive ensemble learning, the performance of ICL can be significantly boosted, enabling two small models to outperform a larger one with more than twice the parameters. The code is available at https://github.com/RUCAIBox/Competitive-ICL.
comment: work in progress
♻ ☆ MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).
♻ ☆ [Experiments & Analysis] Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons
The training process of neural networks is known to be time-consuming, and having a deep architecture only aggravates the issue. This process consists mostly of matrix operations, among which matrix multiplication is the bottleneck. Several sampling-based techniques have been proposed for speeding up the training time of deep neural networks by approximating the matrix products. These techniques fall under two categories: (i) sampling a subset of nodes in every hidden layer as active at every iteration and (ii) sampling a subset of nodes from the previous layer to approximate the current layer's activations using the edges from the sampled nodes. In both cases, the matrix products are computed using only the selected samples. In this paper, we evaluate the feasibility of these approaches on CPU machines with limited computational resources. Making a connection between the two research directions as special cases of approximating matrix multiplications in the context of neural networks, we provide a negative theoretical analysis that shows feedforward approximation is an obstacle against scalability. We conduct comprehensive experimental evaluations that demonstrate the most pressing challenges and limitations associated with the studied approaches. We observe that the hashing-based node selection method is not scalable to a large number of layers, confirming our theoretical analysis. Finally, we identify directions for future research.
♻ ☆ Contextual Continuum Bandits: Static Versus Dynamic Regret
We study the contextual continuum bandits problem, where the learner sequentially receives a side information vector and has to choose an action in a convex set, minimizing a function associated to the context. The goal is to minimize all the underlying functions for the received contexts, leading to a dynamic (contextual) notion of regret, which is stronger than the standard static regret. Assuming that the objective functions are H\"older with respect to the contexts, we demonstrate that any algorithm achieving a sub-linear static regret can be extended to achieve a sub-linear dynamic regret. We further study the case of strongly convex and smooth functions when the observations are noisy. Inspired by the interior point method and employing self-concordant barriers, we propose an algorithm achieving a sub-linear dynamic regret. Lastly, we present a minimax lower bound, implying two key facts. First, no algorithm can achieve sub-linear dynamic regret over functions that are not continuous with respect to the context. Second, for strongly convex and smooth functions, the algorithm that we propose achieves, up to a logarithmic factor, the minimax optimal rate of dynamic regret as a function of the number of queries.
♻ ☆ RankCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
comment: 12 pages, 4 figures, 6 tables. Code and model checkpoints are available at https://github.com/Jam1ezhang/RankCLIP
♻ ☆ Pseudo-Bayesian Optimization
Bayesian Optimization is a popular approach for optimizing expensive black-box functions. Its key idea is to use a surrogate model to approximate the objective and, importantly, quantify the associated uncertainty that allows a sequential search of query points that balance exploitation-exploration. Gaussian process (GP) has been a primary candidate for the surrogate model, thanks to its Bayesian-principled uncertainty quantification power and modeling flexibility. However, its challenges have also spurred an array of alternatives whose convergence properties could be more opaque. Motivated by these, we study in this paper an axiomatic framework that elicits the minimal requirements to guarantee black-box optimization convergence that could apply beyond GP-based methods. Moreover, we leverage the design freedom in our framework, which we call Pseudo-Bayesian Optimization, to construct empirically superior algorithms. In particular, we show how using simple local regression, and a suitable "randomized prior" construction to quantify uncertainty, not only guarantees convergence but also consistently outperforms state-of-the-art benchmarks in examples ranging from high-dimensional synthetic experiments to realistic hyperparameter tuning and robotic applications.
♻ ☆ FedConPE: Efficient Federated Conversational Bandits with Heterogeneous Clients IJCAI
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase elimination-based federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax near-optimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
comment: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024
♻ ☆ Physics-aware Machine Learning Revolutionizes Scientific Paradigm for Machine Learning and Process-based Hydrology
Accurate hydrological understanding and water cycle prediction are crucial for addressing scientific and societal challenges associated with the management of water resources, particularly under the dynamic influence of anthropogenic climate change. Existing reviews predominantly concentrate on the development of machine learning (ML) in this field, yet there is a clear distinction between hydrology and ML as separate paradigms. Here, we introduce physics-aware ML as a transformative approach to overcome the perceived barrier and revolutionize both fields. Specifically, we present a comprehensive review of the physics-aware ML methods, building a structured community (PaML) of existing methodologies that integrate prior physical knowledge or physics-based modeling into ML. We systematically analyze these PaML methodologies with respect to four aspects: physical data-guided ML, physics-informed ML, physics-embedded ML, and physics-aware hybrid learning. PaML facilitates ML-aided hypotheses, accelerating insights from big data and fostering scientific discoveries. We first conduct a systematic review of hydrology in PaML, including rainfall-runoff hydrological processes and hydrodynamic processes, and highlight the most promising and challenging directions for different objectives and PaML methods. Finally, a new PaML-based hydrology platform, termed HydroPML, is released as a foundation for hydrological applications. HydroPML enhances the explainability and causality of ML and lays the groundwork for the digital water cycle's realization. The HydroPML platform is publicly available at https://hydropml.github.io/.
comment: 33 pages, 6 figures
♻ ☆ Dynamic Basis Function Interpolation for Adaptive In Situ Data Integration in Ocean Modeling
We propose a new method for combining in situ buoy measurements with Earth system models (ESMs) to improve the accuracy of temperature predictions in the ocean. The technique utilizes the dynamics \textit{and} modes identified in ESMs alongside buoy measurements to improve accuracy while preserving features such as seasonality. We use this technique, which we call Dynamic Basis Function Interpolation, to correct errors in localized temperature predictions made by the Model for Prediction Across Scales Ocean component (MPAS-O) with the Global Drifter Program's in situ ocean buoy dataset.
♻ ☆ FRAPPE: A Group Fairness Framework for Post-Processing Everything ICML 2024
Despite achieving promising fairness-error trade-offs, in-processing mitigation techniques for group fairness cannot be employed in numerous practical applications with limited computation resources or no access to the training pipeline of the prediction model. In these situations, post-processing is a viable alternative. However, current methods are tailored to specific problem settings and fairness definitions and hence, are not as broadly applicable as in-processing. In this work, we propose a framework that turns any regularized in-processing method into a post-processing approach. This procedure prescribes a way to obtain post-processing techniques for a much broader range of problem settings than the prior post-processing literature. We show theoretically and through extensive experiments that our framework preserves the good fairness-error trade-offs achieved with in-processing and can improve over the effectiveness of prior post-processing methods. Finally, we demonstrate several advantages of a modular mitigation strategy that disentangles the training of the prediction model from the fairness mitigation, including better performance on tasks with partial group labels.
comment: Conference paper at ICML 2024
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.
♻ ☆ Certified Robust Accuracy of Neural Networks Are Bounded due to Bayes Errors
Adversarial examples pose a security threat to many critical systems built on neural networks. While certified training improves robustness, it also decreases accuracy noticeably. Despite various proposals for addressing this issue, the significant accuracy drop remains. More importantly, it is not clear whether there is a certain fundamental limit on achieving robustness whilst maintaining accuracy. In this work, we offer a novel perspective based on Bayes errors. By adopting Bayes error to robustness analysis, we investigate the limit of certified robust accuracy, taking into account data distribution uncertainties. We first show that the accuracy inevitably decreases in the pursuit of robustness due to changed Bayes error in the altered data distribution. Subsequently, we establish an upper bound for certified robust accuracy, considering the distribution of individual classes and their boundaries. Our theoretical results are empirically evaluated on real-world datasets and are shown to be consistent with the limited success of existing certified training results, e.g., for CIFAR10, our analysis results in an upper bound (of certified robust accuracy) of 67.49\%, meanwhile existing approaches are only able to increase it from 53.89\% in 2017 to 62.84\% in 2023.
comment: accepted by CAV 2024
♻ ☆ Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods
Efficient computation of the optimal transport distance between two distributions serves as an algorithm subroutine that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $\varepsilon$ additive accuracy with runtime $\widetilde{O}( n^2/\varepsilon)$, where $n$ denotes the dimension of the probability distributions of interest. Our algorithm achieves the state-of-the-art computational guarantees among all first-order methods, while exhibiting favorable numerical performance compared to classical algorithms like Sinkhorn and Greenkhorn. Underlying our algorithm designs are two key elements: (a) converting the original problem into a bilinear minimax problem over probability distributions; (b) exploiting the extragradient idea -- in conjunction with entropy regularization and adaptive learning rates -- to accelerate convergence.
♻ ☆ Multimodal Variational Autoencoder for Low-cost Cardiac Hemodynamics Instability Detection
Recent advancements in non-invasive detection of cardiac hemodynamic instability (CHDI) primarily focus on applying machine learning techniques to a single data modality, e.g. cardiac magnetic resonance imaging (MRI). Despite their potential, these approaches often fall short especially when the size of labeled patient data is limited, a common challenge in the medical domain. Furthermore, only a few studies have explored multimodal methods to study CHDI, which mostly rely on costly modalities such as cardiac MRI and echocardiogram. In response to these limitations, we propose a novel multimodal variational autoencoder ($\text{CardioVAE}_\text{X,G}$) to integrate low-cost chest X-ray (CXR) and electrocardiogram (ECG) modalities with pre-training on a large unlabeled dataset. Specifically, $\text{CardioVAE}_\text{X,G}$ introduces a novel tri-stream pre-training strategy to learn both shared and modality-specific features, thus enabling fine-tuning with both unimodal and multimodal datasets. We pre-train $\text{CardioVAE}_\text{X,G}$ on a large, unlabeled dataset of $50,982$ subjects from a subset of MIMIC database and then fine-tune the pre-trained model on a labeled dataset of $795$ subjects from the ASPIRE registry. Comprehensive evaluations against existing methods show that $\text{CardioVAE}_\text{X,G}$ offers promising performance (AUROC $=0.79$ and Accuracy $=0.77$), representing a significant step forward in non-invasive prediction of CHDI. Our model also excels in producing fine interpretations of predictions directly associated with clinical features, thereby supporting clinical decision-making.
♻ ☆ A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing Models KDD
Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT. We have provided all the datasets and code on our website at https://github.com/rattlesnakey/Q-MCKT.
comment: 25 pages, 9 figures, Accepted by TKDD
♻ ☆ Biology-inspired joint distribution neurons based on Hierarchical Correlation Reconstruction allowing for multidirectional neural networks
Popular artificial neural networks (ANN) optimize parameters for unidirectional value propagation, assuming some arbitrary parametrization type like Multi-Layer Perceptron (MLP) or Kolmogorov-Arnold Network (KAN). In contrast, for biological neurons e.g. "it is not uncommon for axonal propagation of action potentials to happen in both directions"~\cite{axon} - suggesting they are optimized to continuously operate in multidirectional way. Additionally, statistical dependencies a single neuron could model is not just (expected) value dependence, but entire joint distributions including also higher moments. Such more agnostic joint distribution neuron would allow for multidirectional propagation (of distributions or values) e.g. $\rho(x|y,z)$ or $\rho(y,z|x)$ by substituting to $\rho(x,y,z)$ and normalizing. There will be discussed Hierarchical Correlation Reconstruction (HCR) for such neuron model: assuming $\rho(x,y,z)=\sum_{ijk} a_{ijk} f_i(x) f_j(y) f_k(z)$ type parametrization of joint distribution in polynomial basis $f_i$, which allows for flexible, inexpensive processing including nonlinearities, direct model estimation and update, trained through standard backpropagation or novel ways for such structure up to tensor decomposition or information bottleneck approach. Using only pairwise (input-output) dependencies, its expected value prediction becomes KAN-like with trained activation functions as polynomials, can be extended by adding higher order dependencies through included products - in conscious interpretable way, allowing for multidirectional propagation of both values and probability densities.
comment: 6 pages, 4 figures
♻ ☆ GLIMPSE: Generalized Local Imaging with MLPs
Deep learning is the current de facto state of the art in tomographic imaging. A common approach is to feed the result of a simple inversion, for example the backprojection, to a convolutional neural network (CNN) which then computes the reconstruction. Despite strong results on 'in-distribution' test data similar to the training data, backprojection from sparse-view data delocalizes singularities, so these approaches require a large receptive field to perform well. As a consequence, they overfit to certain global structures which leads to poor generalization on out-of-distribution (OOD) samples. Moreover, their memory complexity and training time scale unfavorably with image resolution, making them impractical for application at realistic clinical resolutions, especially in 3D: a standard U-Net requires a substantial 140GB of memory and 2600 seconds per epoch on a research-grade GPU when training on 1024x1024 images. In this paper, we introduce GLIMPSE, a local processing neural network for computed tomography which reconstructs a pixel value by feeding only the measurements associated with the neighborhood of the pixel to a simple MLP. While achieving comparable or better performance with successful CNNs like the U-Net on in-distribution test data, GLIMPSE significantly outperforms them on OOD samples while maintaining a memory footprint almost independent of image resolution; 5GB memory suffices to train on 1024x1024 images. Further, we built GLIMPSE to be fully differentiable, which enables feats such as recovery of accurate projection angles if they are out of calibration.
comment: 12 pages, 10 figures
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ WWW: What, When, Where to Compute-in-Memory
Compute-in-memory (CiM) has emerged as a highly energy efficient solution for performing matrix multiplication during Machine Learning (ML) inference. However, integrating compute in memory poses key questions, such as 1) What type of CiM to use: Given a multitude of CiM design characteristics, determining their suitability from architecture perspective is needed. 2) When to use CiM: ML inference includes workloads with a variety of memory and compute requirements, making it difficult to identify when CiM is more beneficial. 3) Where to integrate CiM: Each memory level has different bandwidth and capacity, creating different data reuse opportunities for CiM integration. To answer such questions regarding on-chip CiM integration for accelerating ML workloads, we use an analytical architecture evaluation methodology where we tailor the dataflow mapping. The mapping algorithm aims to achieve highest weight reuse and reduced data movements for a given CiM prototype and workload. Our experiments show that CiM integrated memory improves energy efficiency by up to 3.4x and throughput by up to 15.6x compared to tensor-core-like baseline architecture, with INT-8 precision under iso-area constraints. We believe the proposed work provides insights into what type of CiM to use, and when and where to optimally integrate it in the cache hierarchy for efficient matrix multiplication.
comment: updated methodology
♻ ☆ Predicting and Interpreting Energy Barriers of Metallic Glasses with Graph Neural Networks ICML 2024
Metallic Glasses (MGs) are widely used materials that are stronger than steel while being shapeable as plastic. While understanding the structure-property relationship of MGs remains a challenge in materials science, studying their energy barriers (EBs) as an intermediary step shows promise. In this work, we utilize Graph Neural Networks (GNNs) to model MGs and study EBs. We contribute a new dataset for EB prediction and a novel Symmetrized GNN (SymGNN) model that is E(3)-invariant in expectation. SymGNN handles invariance by aggregating over orthogonal transformations of the graph structure. When applied to EB prediction, SymGNN are more accurate than molecular dynamics (MD) local-sampling methods and other machine-learning models. Compared to precise MD simulations, SymGNN reduces the inference time on new MGs from roughly 41 days to less than one second. We apply explanation algorithms to reveal the relationship between structures and EBs. The structures that we identify through explanations match the medium-range order (MRO) hypothesis and possess unique topological properties. Our work enables effective prediction and interpretation of MG EBs, bolstering material science research.
comment: Accepted at ICML 2024
♻ ☆ Compressed representation of brain genetic transcription
The architecture of the brain is too complex to be intuitively surveyable without the use of compressed representations that project its variation into a compact, navigable space. The task is especially challenging with high-dimensional data, such as gene expression, where the joint complexity of anatomical and transcriptional patterns demands maximum compression. Established practice is to use standard principal component analysis (PCA), whose computational felicity is offset by limited expressivity, especially at great compression ratios. Employing whole-brain, voxel-wise Allen Brain Atlas transcription data, here we systematically compare compressed representations based on the most widely supported linear and non-linear methods-PCA, kernel PCA, non-negative matrix factorization (NMF), t-stochastic neighbour embedding (t-SNE), uniform manifold approximation and projection (UMAP), and deep auto-encoding-quantifying reconstruction fidelity, anatomical coherence, and predictive utility with respect to signalling, microstructural, and metabolic targets. We show that deep auto-encoders yield superior representations across all metrics of performance and target domains, supporting their use as the reference standard for representing transcription patterns in the human brain.
comment: 22 pages, 5 main figures, 1 supplementary figure
♻ ☆ A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification ICIP 2024
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which, for image classification, experience high latency due to the number of operations they perform, which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as vectors and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
comment: Accepted to IEEE ICIP 2024
♻ ☆ QC-Forest: a Classical-Quantum Algorithm to Provably Speedup Retraining of Random Forest
Random Forest (RF) is a popular tree-ensemble method for supervised learning, prized for its ease of use and flexibility. Online RF models require to account for new training data to maintain model accuracy. This is particularly important in applications where data is periodically and sequentially generated over time in data streams, such as auto-driving systems, and credit card payments. In this setting, performing periodic model retraining with the old and new data accumulated is beneficial as it fully captures possible drifts in the data distribution over time. However, this is unpractical with state-of-the-art classical algorithms for RF as they scale linearly with the accumulated number of samples. We propose QC-Forest, a classical-quantum algorithm designed to time-efficiently retrain RF models in the streaming setting for multi-class classification and regression, achieving a runtime poly-logarithmic in the total number of accumulated samples. QC-Forest leverages Des-q, a quantum algorithm for single tree construction and retraining proposed by Kumar et al. by expanding to multi-class classification, as the original proposal was limited to binary classes, and introducing an exact classical method to replace an underlying quantum subroutine incurring a finite error, while maintaining the same poly-logarithmic dependence. Finally, we showcase that QC-Forest achieves competitive accuracy in comparison to state-of-the-art RF methods on widely used benchmark datasets with up to 80,000 samples, while significantly speeding up the model retrain.
♻ ☆ Diffusion model for relational inference
Dynamical behaviors of complex interacting systems, including brain activities, financial price movements, and physical collective phenomena, are associated with underlying interactions between the system's components. The issue of uncovering interaction relations in such systems using observable dynamics is called relational inference. In this study, we propose a Diffusion model for Relational Inference (DiffRI), inspired by a self-supervised method for probabilistic time series imputation. DiffRI learns to infer the probability of the presence of connections between components through conditional diffusion modeling.
♻ ☆ Imputation of missing values in multi-view data
Data for which a set of objects is described by multiple distinct feature sets (called views) is known as multi-view data. When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This may lead to very large quantities of missing data which, especially when combined with high-dimensionality, can make the application of conditional imputation methods computationally infeasible. However, the multi-view structure could be leveraged to reduce the complexity and computational load of imputation. We introduce a new imputation method based on the existing stacked penalized logistic regression (StaPLR) algorithm for multi-view learning. It performs imputation in a dimension-reduced space to address computational challenges inherent to the multi-view context. We compare the performance of the new imputation method with several existing imputation algorithms in simulated data sets and a real data application. The results show that the new imputation method leads to competitive results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
comment: 49 pages, 15 figures. Accepted manuscript
♻ ☆ The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof
Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries -- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.
comment: 27 pages. Preparing code for release. v2: added / updated some citations
♻ ☆ Keep Moving: identifying task-relevant subspaces to maximise plasticity for newly learned tasks
Continual learning algorithms strive to acquire new knowledge while preserving prior information. Often, these algorithms emphasise stability and restrict network updates upon learning new tasks. In many cases, such restrictions come at a cost to the model's plasticity, i.e. the model's ability to adapt to the requirements of a new task. But is all change detrimental? Here, we approach this question by proposing that activation spaces in neural networks can be decomposed into two subspaces: a readout range in which change affects prior tasks and a null space in which change does not alter prior performance. Based on experiments with this novel technique, we show that, indeed, not all activation change is associated with forgetting. Instead, only change in the subspace visible to the readout of a task can lead to decreased stability, while restricting change outside of this subspace is associated only with a loss of plasticity. Analysing various commonly used algorithms, we show that regularisation-based techniques do not fully disentangle the two spaces and, as a result, restrict plasticity more than need be. We expand our results by investigating a linear model in which we can manipulate learning in the two subspaces directly and thus causally link activation changes to stability and plasticity. For hierarchical, nonlinear cases, we present an approximation that enables us to estimate functionally relevant subspaces at every layer of a deep nonlinear network, corroborating our previous insights. Together, this work provides novel means to derive insights into the mechanisms behind stability and plasticity in continual learning and may serve as a diagnostic tool to guide developments of future continual learning algorithms that stabilise inference while allowing maximal space for learning.
comment: 17 pages, 7 figures, Revision now accepted at COLLAS 2024
♻ ☆ Asynchronous Multi-Server Federated Learning for Geo-Distributed Clients
Federated learning (FL) systems enable multiple clients to train a machine learning model iteratively through synchronously exchanging the intermediate model weights with a single server. The scalability of such FL systems can be limited by two factors: server idle time due to synchronous communication and the risk of a single server becoming the bottleneck. In this paper, we propose a new FL architecture, to our knowledge, the first multi-server FL system that is entirely asynchronous, and therefore addresses these two limitations simultaneously. Our solution keeps both servers and clients continuously active. As in previous multi-server methods, clients interact solely with their nearest server, ensuring efficient update integration into the model. Differently, however, servers also periodically update each other asynchronously, and never postpone interactions with clients. We compare our solution to three representative baselines - FedAvg, FedAsync and HierFAVG - on the MNIST and CIFAR-10 image classification datasets and on the WikiText-2 language modeling dataset. Our solution converges to similar or higher accuracy levels than previous baselines and requires 61% less time to do so in geo-distributed settings.
♻ ☆ Precipitation Downscaling with Spatiotemporal Video Diffusion
In climate science and meteorology, high-resolution local precipitation (rain and snowfall) predictions are limited by the computational costs of simulation-based methods. Statistical downscaling, or super-resolution, is a common workaround where a low-resolution prediction is improved using statistical approaches. Unlike traditional computer vision tasks, weather and climate applications require capturing the accurate conditional distribution of high-resolution given low-resolution patterns to assure reliable ensemble averages and unbiased estimates of extreme events, such as heavy rain. This work extends recent video diffusion models to precipitation super-resolution, employing a deterministic downscaler followed by a temporally-conditioned diffusion model to capture noise characteristics and high-frequency patterns. We test our approach on FV3GFS output, an established large-scale global atmosphere model, and compare it against six state-of-the-art baselines. Our analysis, capturing CRPS, MSE, precipitation distributions, and qualitative aspects using California and the Himalayas as examples, establishes our method as a new standard for data-driven precipitation downscaling.
♻ ☆ Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation
Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation, which can be problematic in high-dimensional action spaces due to the curse of dimensionality. This reliance results in data inefficiency as maintaining a state-action-value function in such spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning. This approach, which we refer to as Vlearn, effectively circumvents the limitations of existing methods by eliminating the necessity for an explicit state-action-value function. To this end, we introduce a novel importance sampling loss for learning deep value functions from off-policy data. While this is common for linear methods, it has not been combined with deep value function networks. This transfer to deep methods is not straightforward and requires novel design choices such as robust policy updates, twin value function networks to avoid an optimization bias, and importance weight clipping. We also present a novel analysis of the variance of our estimate compared to commonly used importance sampling estimators such as V-trace. Our approach improves sample complexity as well as final performance and ensures consistent and robust performance across various benchmark tasks. Eliminating the state-action-value function in Vlearn facilitates a streamlined learning process, enabling more effective exploration and exploitation in complex environments.
♻ ☆ Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data
Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
♻ ☆ Lessons on Datasets and Paradigms in Machine Learning for Symbolic Computation: A Case Study on CAD
Symbolic Computation algorithms and their implementation in computer algebra systems often contain choices which do not affect the correctness of the output but can significantly impact the resources required: such choices can benefit from having them made separately for each problem via a machine learning model. This study reports lessons on such use of machine learning in symbolic computation, in particular on the importance of analysing datasets prior to machine learning and on the different machine learning paradigms that may be utilised. We present results for a particular case study, the selection of variable ordering for cylindrical algebraic decomposition, but expect that the lessons learned are applicable to other decisions in symbolic computation. We utilise an existing dataset of examples derived from applications which was found to be imbalanced with respect to the variable ordering decision. We introduce an augmentation technique for polynomial systems problems that allows us to balance and further augment the dataset, improving the machine learning results by 28\% and 38\% on average, respectively. We then demonstrate how the existing machine learning methodology used for the problem $-$ classification $-$ might be recast into the regression paradigm. While this does not have a radical change on the performance, it does widen the scope in which the methodology can be applied to make choices.
♻ ☆ Contractive Systems Improve Graph Neural Networks Against Adversarial Attacks
Graph Neural Networks (GNNs) have established themselves as a key component in addressing diverse graph-based tasks. Despite their notable successes, GNNs remain susceptible to input perturbations in the form of adversarial attacks. This paper introduces an innovative approach to fortify GNNs against adversarial perturbations through the lens of contractive dynamical systems. Our method introduces graph neural layers based on differential equations with contractive properties, which, as we show, improve the robustness of GNNs. A distinctive feature of the proposed approach is the simultaneous learned evolution of both the node features and the adjacency matrix, yielding an intrinsic enhancement of model robustness to perturbations in the input features and the connectivity of the graph. We mathematically derive the underpinnings of our novel architecture and provide theoretical insights to reason about its expected behavior. We demonstrate the efficacy of our method through numerous real-world benchmarks, reading on par or improved performance compared to existing methods.
♻ ☆ Active Few-Shot Fine-Tuning
We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
♻ ☆ Measuring and Mitigating Biases in Motor Insurance Pricing
The non-life insurance sector operates within a highly competitive and tightly regulated framework, confronting a pivotal juncture in the formulation of pricing strategies. Insurers are compelled to harness a range of statistical methodologies and available data to construct optimal pricing structures that align with the overarching corporate strategy while accommodating the dynamics of market competition. Given the fundamental societal role played by insurance, premium rates are subject to rigorous scrutiny by regulatory authorities. These rates must conform to principles of transparency, explainability, and ethical considerations. Consequently, the act of pricing transcends mere statistical calculations and carries the weight of strategic and societal factors. These multifaceted concerns may drive insurers to establish equitable premiums, taking into account various variables. For instance, regulations mandate the provision of equitable premiums, considering factors such as policyholder gender or mutualist group dynamics in accordance with respective corporate strategies. Age-based premium fairness is also mandated. In certain insurance domains, variables such as the presence of serious illnesses or disabilities are emerging as new dimensions for evaluating fairness. Regardless of the motivating factor prompting an insurer to adopt fairer pricing strategies for a specific variable, the insurer must possess the capability to define, measure, and ultimately mitigate any ethical biases inherent in its pricing practices while upholding standards of consistency and performance. This study seeks to provide a comprehensive set of tools for these endeavors and assess their effectiveness through practical application in the context of automobile insurance.
♻ ☆ On the rate of convergence of an over-parametrized Transformer classifier learned by gradient descent
One of the most recent and fascinating breakthroughs in artificial intelligence is ChatGPT, a chatbot which can simulate human conversation. ChatGPT is an instance of GPT4, which is a language model based on generative gredictive gransformers. So if one wants to study from a theoretical point of view, how powerful such artificial intelligence can be, one approach is to consider transformer networks and to study which problems one can solve with these networks theoretically. Here it is not only important what kind of models these network can approximate, or how they can generalize their knowledge learned by choosing the best possible approximation to a concrete data set, but also how well optimization of such transformer network based on concrete data set works. In this article we consider all these three different aspects simultaneously and show a theoretical upper bound on the missclassification probability of a transformer network fitted to the observed data. For simplicity we focus in this context on transformer encoder networks which can be applied to define an estimate in the context of a classification problem involving natural language.
♻ ☆ On the Inductive Biases of Demographic Parity-based Fair Learning Algorithms
Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP) notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method.
♻ ☆ Trading Devil: Robust backdoor attack via Stochastic investment models and Bayesian approach
With the growing use of voice-activated systems and speech recognition technologies, the danger of backdoor attacks on audio data has grown significantly. This research looks at a specific type of attack, known as a Stochastic investment-based backdoor attack (MarketBack), in which adversaries strategically manipulate the stylistic properties of audio to fool speech recognition systems. The security and integrity of machine learning models are seriously threatened by backdoor attacks, in order to maintain the reliability of audio applications and systems, the identification of such attacks becomes crucial in the context of audio data. Experimental results demonstrated that MarketBack is feasible to achieve an average attack success rate close to 100% in seven victim models when poisoning less than 1% of the training data.
comment: (Last update) Stochastic investment models and a Bayesian approach to better modeling of uncertainty : adversarial machine learning or Stochastic market. arXiv admin note: substantial text overlap with arXiv:2402.05967
♻ ☆ Eigenpruning: an Interpretability-Inspired PEFT Method NAACL 2024
We introduce eigenpruning, a method that removes singular values from weight matrices in an LLM to improve its performance in a particular task. This method is inspired by interpretability methods designed to automatically find subnetworks of a model which solve a specific task. In our tests, the pruned model outperforms the original model by a large margin, while only requiring minimal computation to prune the weight matrices. In the case of a small synthetic task in integer multiplication, the Phi-2 model can improve its accuracy in the test set from 13.75% to 97.50%. Interestingly, these results seem to indicate the existence of a computation path that can solve the task very effectively, but it was not being used by the original model. Finally, we publicly release our implementation.
comment: Extended abstract accepted to LatinX at NAACL 2024
♻ ☆ Neural Collapse in Multi-label Learning with Pick-all-label Loss
We study deep neural networks for the multi-label classification (MLab) task through the lens of neural collapse (NC). Previous works have been restricted to the multi-class classification setting and discovered a prevalent NC phenomenon comprising of the following properties for the last-layer features: (i) the variability of features within every class collapses to zero, (ii) the set of feature means form an equi-angular tight frame (ETF), and (iii) the last layer classifiers collapse to the feature mean upon some scaling. We generalize the study to multi-label learning, and prove for the first time that a generalized NC phenomenon holds with the "pick-all-label" formulation, which we term as MLab NC. While the ETF geometry remains consistent for features with a single label, multi-label scenarios introduce a unique combinatorial aspect we term the "tag-wise average" property, where the means of features with multiple labels are the scaled averages of means for single-label instances. Theoretically, under proper assumptions on the features, we establish that the only global optimizer of the pick-all-label cross-entropy loss satisfy the multi-label NC. In practice, we demonstrate that our findings can lead to better test performance with more efficient training techniques for MLab learning.
♻ ☆ Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.
comment: Medical Imaging with Deep Learning (MIDL) 2024 (Oral)
♻ ☆ FlowPrecision: Advancing FPGA-Based Real-Time Fluid Flow Estimation with Linear Quantization
In industrial and environmental monitoring, achieving real-time and precise fluid flow measurement remains a critical challenge. This study applies linear quantization in FPGA-based soft sensors for fluid flow estimation, significantly enhancing Neural Network model precision by overcoming the limitations of traditional fixed-point quantization. Our approach achieves up to a 10.10% reduction in Mean Squared Error and a notable 9.39% improvement in inference speed through targeted hardware optimizations. Validated across multiple data sets, our findings demonstrate that the optimized FPGA-based quantized models can provide efficient, accurate real-time inference, offering a viable alternative to cloud-based processing in pervasive autonomous systems.
comment: 6 pages, 3 figures, The 22nd International Conference on Pervasive Computing and Communications (PerCom 2024), PerConAI Workshop
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
♻ ☆ The statistical thermodynamics of generative diffusion models: Phase transitions, symmetry breaking and critical instability
Generative diffusion models have achieved spectacular performance in many areas of machine learning and generative modeling. While the fundamental ideas behind these models come from non-equilibrium physics, variational inference and stochastic calculus, in this paper we show that many aspects of these models can be understood using the tools of equilibrium statistical mechanics. Using this reformulation, we show that generative diffusion models undergo second-order phase transitions corresponding to symmetry breaking phenomena. We show that these phase-transitions are always in a mean-field universality class, as they are the result of a self-consistency condition in the generative dynamics. We argue that the critical instability that arises from the phase transitions lies at the heart of their generative capabilities, which are characterized by a set of mean-field critical exponents. Finally, we show that the dynamic equation of the generative process can be interpreted as a stochastic adiabatic transformation that minimizes the free energy while keeping the system in thermal equilibrium.
♻ ☆ Decoupled Subgraph Federated Learning
We address the challenge of federated learning on graph-structured data distributed across multiple clients. Specifically, we focus on the prevalent scenario of interconnected subgraphs, where interconnections between different clients play a critical role. We present a novel framework for this scenario, named FedStruct, that harnesses deep structural dependencies. To uphold privacy, unlike existing methods, FedStruct eliminates the necessity of sharing or generating sensitive node features or embeddings among clients. Instead, it leverages explicit global graph structure information to capture inter-node dependencies. We validate the effectiveness of FedStruct through experimental results conducted on six datasets for semi-supervised node classification, showcasing performance close to the centralized approach across various scenarios, including different data partitioning methods, varying levels of label availability, and number of clients.
comment: Updated version. Main changes: 1. Title; 2. Added discussion on communication complexity and a pruned version of our framework; 3. Focused on our general framework for the scenario where server lacks knowledge of global graph connections, and discussed the scenario with complete knowledge in Appendix; 4. Compared FedStruct with FedPub; 5. Included results using FedStar and for federated averaging
♻ ☆ Asynchronous Byzantine Federated Learning
Federated learning (FL) enables a set of geographically distributed clients to collectively train a model through a server. Classically, the training process is synchronous, but can be made asynchronous to maintain its speed in presence of slow clients and in heterogeneous networks. The vast majority of Byzantine fault-tolerant FL systems however rely on a synchronous training process. Our solution is one of the first Byzantine-resilient and asynchronous FL algorithms that does not require an auxiliary server dataset and is not delayed by stragglers, which are shortcomings of previous works. Intuitively, the server in our solution waits to receive a minimum number of updates from clients on its latest model to safely update it, and is later able to safely leverage the updates that late clients might send. We compare the performance of our solution with state-of-the-art algorithms on both image and text datasets under gradient inversion, perturbation, and backdoor attacks. Our results indicate that our solution trains a model faster than previous synchronous FL solution, and maintains a higher accuracy, up to 1.54x and up to 1.75x for perturbation and gradient inversion attacks respectively, in the presence of Byzantine clients than previous asynchronous FL solutions.
♻ ☆ Learning-based Multi-continuum Model for Multiscale Flow Problems
Multiscale problems can usually be approximated through numerical homogenization by an equation with some effective parameters that can capture the macroscopic behavior of the original system on the coarse grid to speed up the simulation. However, this approach usually assumes scale separation and that the heterogeneity of the solution can be approximated by the solution average in each coarse block. For complex multiscale problems, the computed single effective properties/continuum might be inadequate. In this paper, we propose a novel learning-based multi-continuum model to enrich the homogenized equation and improve the accuracy of the single continuum model for multiscale problems with some given data. Without loss of generalization, we consider a two-continuum case. The first flow equation keeps the information of the original homogenized equation with an additional interaction term. The second continuum is newly introduced, and the effective permeability in the second flow equation is determined by a neural network. The interaction term between the two continua aligns with that used in the Dual-porosity model but with a learnable coefficient determined by another neural network. The new model with neural network terms is then optimized using trusted data. We discuss both direct back-propagation and the adjoint method for the PDE-constraint optimization problem. Our proposed learning-based multi-continuum model can resolve multiple interacted media within each coarse grid block and describe the mass transfer among them, and it has been demonstrated to significantly improve the simulation results through numerical experiments involving both linear and nonlinear flow equations.
comment: Corrected typos
♻ ☆ Bayesian Structural Model Updating with Multimodal Variational Autoencoder
A novel framework for Bayesian structural model updating is presented in this study. The proposed method utilizes the surrogate unimodal encoders of a multimodal variational autoencoder (VAE). The method facilitates an approximation of the likelihood when dealing with a small number of observations. It is particularly suitable for high-dimensional correlated simultaneous observations applicable to various dynamic analysis models. The proposed approach was benchmarked using a numerical model of a single-story frame building with acceleration and dynamic strain measurements. Additionally, an example involving a Bayesian update of nonlinear model parameters for a three-degree-of-freedom lumped mass model demonstrates computational efficiency when compared to using the original VAE, while maintaining adequate accuracy for practical applications.
comment: 44 pages, 21 figures
♻ ☆ Progress Measures for Grokking on Real-world Tasks
Grokking, a phenomenon where machine learning models generalize long after overfitting, has been primarily observed and studied in algorithmic tasks. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss. We challenge the prevalent hypothesis that the $L_2$ norm of weights is the primary cause of grokking by demonstrating that grokking can occur outside the expected range of weight norms. To better understand grokking, we introduce three new progress measures: activation sparsity, absolute weight entropy, and approximate local circuit complexity. These measures are conceptually related to generalization and demonstrate a stronger correlation with grokking in real-world datasets compared to weight norms. Our findings suggest that while weight norms might usually correlate with grokking and our progress measures, they are not causative, and our proposed measures provide a better understanding of the dynamics of grokking.
comment: 5 pages
♻ ☆ SSUMamba: Spatial-Spectral Selective State Space Model for Hyperspectral Image Denoising
Denoising is a crucial preprocessing step for hyperspectral images (HSIs) due to noise arising from intraimaging mechanisms and environmental factors. Long-range spatial-spectral correlation modeling is beneficial for HSI denoising but often comes with high computational complexity. Based on the state space model (SSM), Mamba is known for its remarkable long-range dependency modeling capabilities and computational efficiency. Building on this, we introduce a memory-efficient spatial-spectral UMamba (SSUMamba) for HSI denoising, with the spatial-spectral continuous scan (SSCS) Mamba being the core component. SSCS Mamba alternates the row, column, and band in six different orders to generate the sequence and uses the bidirectional SSM to exploit long-range spatial-spectral dependencies. In each order, the images are rearranged between adjacent scans to ensure spatial-spectral continuity. Additionally, 3D convolutions are embedded into the SSCS Mamba to enhance local spatial-spectral modeling. Experiments demonstrate that SSUMamba achieves superior denoising results with lower memory consumption per batch compared to transformer-based methods. The source code is available at https://github.com/lronkitty/SSUMamba.
♻ ☆ ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming
When building Large Language Models (LLMs), it is paramount to bear safety in mind and protect them with guardrails. Indeed, LLMs should never generate content promoting or normalizing harmful, illegal, or unethical behavior that may contribute to harm to individuals or society. This principle applies to both normal and adversarial use. In response, we introduce ALERT, a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It is designed to evaluate the safety of LLMs through red teaming methodologies and consists of more than 45k instructions categorized using our novel taxonomy. By subjecting LLMs to adversarial testing scenarios, ALERT aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models. Furthermore, the fine-grained taxonomy enables researchers to perform an in-depth evaluation that also helps one to assess the alignment with various policies. In our experiments, we extensively evaluate 10 popular open- and closed-source LLMs and demonstrate that many of them still struggle to attain reasonable levels of safety.
comment: 17 pages, preprint
♻ ☆ Dependency-Aware Semi-Structured Sparsity: Declining Roles of Outliers in Pruning GLU-based LLMs
The rapid growth in the scale of Large Language Models (LLMs) has led to significant computational and memory costs, making model compression techniques such as network pruning increasingly crucial for their efficient deployment. Recent LLMs such as LLaMA2 and Mistral have adopted GLU-based MLP architectures. However, current LLM pruning strategies are primarily based on insights from older LLM architectures, necessitating a reevaluation of these strategies to suit the new architectural characteristics. Contrary to traditional beliefs, we find that outliers play a diminished role in the input projections of GLU-based MLPs. Leveraging this new insight, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel pruning method for GLU-based LLMs. DaSS balances the flexibility of unstructured pruning and the structural consistency of dependency-based structured pruning by considering both of weight magnitude and corresponding intermediate activation norms in weight pruning metric. Empirical evaluations on the Mistral, Gemma, and LLaMA2 model families demonstrate the consistent effectiveness of DaSS in the prevailing GLU variants.
♻ ☆ Improving Neural Topic Models with Wasserstein Knowledge Distillation ECIR 2023
Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher's. The distilled model also outperforms several other competitive topic models on topic coherence.
comment: Accepted at ECIR 2023
♻ ☆ FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models
The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pre-trained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pre-trained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. The primary goal of this study is twofold: first, to elucidate the relationship between filtering and unlearning processes, and second, to formulate a methodology aimed at mitigating the display of undesirable outputs generated from models characterized as black-box systems. Theoretical analysis in this study demonstrates that, in the context of black-box models, filtering can be seen as a form of weak unlearning. Our proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space.
♻ ☆ Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty
We present a novel $Q$-learning algorithm tailored to solve distributionally robust Markov decision problems where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.
♻ ☆ On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization
Adaptive gradient methods are workhorses in deep learning. However, the convergence guarantees of adaptive gradient methods for nonconvex optimization have not been thoroughly studied. In this paper, we provide a fine-grained convergence analysis for a general class of adaptive gradient methods including AMSGrad, RMSProp and AdaGrad. For smooth nonconvex functions, we prove that adaptive gradient methods in expectation converge to a first-order stationary point. Our convergence rate is better than existing results for adaptive gradient methods in terms of dimension. In addition, we also prove high probability bounds on the convergence rates of AMSGrad, RMSProp as well as AdaGrad, which have not been established before. Our analyses shed light on better understanding the mechanism behind adaptive gradient methods in optimizing nonconvex objectives.
comment: 25 pages, 2 tables. Published in Transactions on Machine Learning Research (TMLR)
Systems and Control 27
☆ Spatio-temporal Patterns between ENSO and Weather-related Power Outages in the Continental United States
El Ni\~no-Southern Oscillation (ENSO) exhibits significant impacts on the frequency of extreme weather events and its socio-economic implications prevail on a global scale. However, a fundamental gap still exists in understanding the relationship between the ENSO and weather-related power outages in the continental United States. Through 24-year (2000-2023) composite and statistical analysis, our study reveals that higher power outage numbers (PONs) are observed from the developing winter to the decaying summer of La Ni\~na phases. In particular, during the decaying spring, high La Ni\~na intensity favors the occurrences of power outage over the west coast and east of the United States, by modulating the frequency of extreme precipitations and heatwaves. Furthermore, projected increasing heatwaves from the Coupled Model Intercomparison Project Phase 6 (CMIP6) indicate that spring-time PONs over the eastern United States occur about 11 times higher for the mid-term future (2041-2060) and almost 26 times higher for the long-term future (2081-2100), compared with 2000-2023. Our study provides a strong recommendation for building a more climate-resilient power system.
☆ Adaptive Deep Neural Network-Based Control Barrier Functions
Safety constraints of nonlinear control systems are commonly enforced through the use of control barrier functions (CBFs). Uncertainties in the dynamic model can disrupt forward invariance guarantees or cause the state to be restricted to an overly conservative subset of the safe set. In this paper, adaptive deep neural networks (DNNs) are combined with CBFs to produce a family of controllers that ensure safety while learning the system's dynamics in real-time without the requirement for pre-training. By basing the least squares adaptation law on a state derivative estimator-based identification error, the DNN parameter estimation error is shown to be uniformly ultimately bounded. The convergent bound on the parameter estimation error is then used to formulate CBF-constraints in an optimization-based controller to guarantee safety despite model uncertainty. Furthermore, the developed method is applicable for use under intermittent loss of state-feedback. Comparative simulation results demonstrate the ability of the developed method to ensure safety in an adaptive cruise control problem and when feedback is lost, unlike baseline methods.
comment: 7 pages, 2 figures, 28 references
☆ Ring-LWE based encrypted controller with unlimited number of recursive multiplications and effect of error growth
In this paper, we propose a method to encrypt linear dynamic controllers that enables an unlimited number of recursive homomorphic multiplications on a Ring Learning With Errors (Ring-LWE) based cryptosystem without bootstrapping. Unlike LWE based schemes, where a scalar error is injected during encryption for security, Ring-LWE based schemes are based on polynomial rings and inject error as a polynomial having multiple error coefficients. Such errors accumulate under recursive homomorphic operations, and it has been studied that their effect can be suppressed by the closed-loop stability when dynamic controllers are encrypted using LWE based schemes. We show that this also holds for the proposed controller encrypted using a Ring-LWE based scheme. Specifically, only the constant terms of the error polynomials affect the control performance, and their effect can be arbitrarily bounded even when the noneffective terms diverge. Furthermore, a novel packing algorithm is applied, resulting in reduced computation time and enhanced memory efficiency. Simulation results demonstrate the effectiveness of the proposed method.
comment: 12 pages, 3 figures
☆ Robustness Analysis of AI Models in Critical Energy Systems
This paper analyzes the robustness of state-of-the-art AI-based models for power grid operations under the $N-1$ security criterion. While these models perform well in regular grid settings, our results highlight a significant loss in accuracy following the disconnection of a line.%under this security criterion. Using graph theory-based analysis, we demonstrate the impact of node connectivity on this loss. Our findings emphasize the need for practical scenario considerations in developing AI methodologies for critical infrastructure.
☆ Adaptive Robust Controller for handling Unknown Uncertainty of Robotic Manipulators
The ability to achieve precise and smooth trajectory tracking is crucial for ensuring the successful execution of various tasks involving robotic manipulators. State-of-the-art techniques require accurate mathematical models of the robot dynamics, and robustness to model uncertainties is achieved by relying on precise bounds on the model mismatch. In this paper, we propose a novel adaptive robust feedback linearization scheme able to compensate for model uncertainties without any a-priori knowledge on them, and we provide a theoretical proof of convergence under mild assumptions. We evaluate the method on a simulated RR robot. First, we consider a nominal model with known model mismatch, which allows us to compare our strategy with state-of-the-art uncertainty-aware methods. Second, we implement the proposed control law in combination with a learned model, for which uncertainty bounds are not available. Results show that our method leads to performance comparable to uncertainty-aware methods while requiring less prior knowledge.
☆ Resource Optimization for Tail-Based Control in Wireless Networked Control Systems
Achieving control stability is one of the key design challenges of scalable Wireless Networked Control Systems (WNCS) under limited communication and computing resources. This paper explores the use of an alternative control concept defined as tail-based control, which extends the classical Linear Quadratic Regulator (LQR) cost function for multiple dynamic control systems over a shared wireless network. We cast the control of multiple control systems as a network-wide optimization problem and decouple it in terms of sensor scheduling, plant state prediction, and control policies. Toward this, we propose a solution consisting of a scheduling algorithm based on Lyapunov optimization for sensing, a mechanism based on Gaussian Process Regression (GPR) for state prediction and uncertainty estimation, and a control policy based on Reinforcement Learning (RL) to ensure tail-based control stability. A set of discrete time-invariant mountain car control systems is used to evaluate the proposed solution and is compared against four variants that use state-of-the-art scheduling, prediction, and control methods. The experimental results indicate that the proposed method yields 22% reduction in overall cost in terms of communication and control resource utilization compared to state-of-the-art methods.
comment: Accepted in PIMRC 2024 conference, 6 pages, 5 figures
☆ Enhanced Optimal Power Flow Based Droop Control in MMC-MTDC Systems
Optimizing operational set points for modular multilevel converters (MMCs) in Multi-Terminal Direct Current (MTDC) transmission systems is crucial for ensuring efficient power distribution and control. This paper presents an enhanced Optimal Power Flow (OPF) model for MMC-MTDC systems, integrating a novel adaptive voltage droop control strategy. The strategy aims to minimize generation costs and DC voltage deviations while ensuring the stable operation of the MTDC grid by dynamically adjusting the system operation points. The modified Nordic 32 test system with an embedded 4-terminal DC grid is modeled in Julia and the proposed control strategy is applied to the power model. The results demonstrate the feasibility and effectiveness of the proposed droop control strategy, affirming its potential value in enhancing the performance and reliability of hybrid AC-DC power systems.
☆ Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto Scaling
Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.
☆ Reliable State Estimation in a Truck-Semitrailer Combination using an Artificial Neural Network-Aided Extended Kalman Filter
Advanced driver assistance systems are critically dependent on reliable and accurate information regarding a vehicles' driving state. For estimation of unknown quantities, model-based and learning-based methods exist, but both suffer from individual limitations. On the one hand, model-based estimation performance is often limited by the models' accuracy. On the other hand, learning-based estimators usually do not perform well in "unknown" conditions (bad generalization), which is particularly critical for semitrailers as their payload changes significantly in operation. To the best of the authors' knowledge, this work is the first to analyze the capability of state-of-the-art estimators for semitrailers to generalize across "unknown" loading states. Moreover, a novel hybrid Extended Kalman Filter (H-EKF) that takes advantage of accurate Artificial Neural Network (ANN) estimates while preserving reliable generalization capability is presented. It estimates the articulation angle between truck and semitrailer, lateral tire forces and the truck steering angle utilizing sensor data of a standard semitrailer only. An experimental comparison based on a full-scale truck-semitrailer combination indicates the superiority of the H-EKF compared to a state-of-the-art extended Kalman filter and an ANN estimator.
comment: 8 pages, 3 figures, accepted for publication at 2024 European Control Conference (ECC)
☆ Robust nonlinear state-feedback control of second-order systems
This note proposes a novel nonlinear state feedback controller for perturbed second-order systems. In analogy to a linear proportional-derivative (PD) output feedback control, the proposed nonlinear scheme uses the output state of interest and its time derivative for a robust finite-time regulation. The control has only one free design parameter, and the closed-loop system is shown to be uniformly asymptotically stable in the presence of matched disturbances. We derive a strict Lyapunov function for the closed control loop with a bounded exogenous perturbation, and use it for both the control tuning and analysis of the finite-time convergence. Apart from the numerical results, a revealing experimental example is also shown in favor of the proposed control and in comparison with PD and sub-optimal nonlinear damping regulators.
comment: 6 pages, 6 figures
☆ Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game Perspective
In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.
comment: Accepted for publication in L4DC 2024
☆ Design and Control of a Low-cost Non-backdrivable End-effector Upper Limb Rehabilitation Device
This paper presents the development of an upper limb end-effector based rehabilitation device for stroke patients, offering assistance or resistance along any 2-dimensional trajectory during physical therapy. It employs a non-backdrivable ball-screw-driven mechanism for enhanced control accuracy. The control system features three novel algorithms: First, the Implicit Euler velocity control algorithm (IEVC) highlighted for its state-of-the-art accuracy, stability, efficiency and generalizability in motion restriction control. Second, an Admittance Virtual Dynamics simulation algorithm that achieves a smooth and natural human interaction with the non-backdrivable end-effector. Third, a generalized impedance force calculation algorithm allowing efficient impedance control on any trajectory or area boundary. Experimental validation demonstrated the system's effectiveness in accurate end-effector position control across various trajectories and configurations. The proposed upper limb end-effector-based rehabilitation device, with its high performance and adaptability, holds significant promise for extensive clinical application, potentially improving rehabilitation outcomes for stroke patients.
comment: 12 pages, 15 figures
☆ Optimal estimation in spatially distributed systems: how far to share measurements from?
We consider the centralized optimal estimation problem in spatially distributed systems. We use the setting of spatially invariant systems as an idealization for which concrete and detailed results are given. Such estimators are known to have a degree of spatial localization in the sense that the estimator gains decay in space, with the spatial decay rates serving as a proxy for how far measurements need to be shared in an optimal distributed estimator. In particular, we examine the dependence of spatial decay rates on problem specifications such as system dynamics, measurement and process noise variances, as well as their spatial autocorrelations. We propose non-dimensional parameters that characterize the decay rates as a function of problem specifications. In particular, we find an interesting matching condition between the characteristic lengthscale of the dynamics and the measurement noise correlation lengthscale for which the optimal centralized estimator is completely decentralized. A new technique - termed the branch point locus - is introduced to quantify spatial decay rates in terms of analyticity regions in the complex spatial frequency plane. Our results are illustrated through two case studies of systems with dynamics modeled by diffusion and the Swift-Hohenberg equation, respectively.
☆ Diffusion-Based Failure Sampling for Cyber-Physical Systems
Validating safety-critical autonomous systems in high-dimensional domains such as robotics presents a significant challenge. Existing black-box approaches based on Markov chain Monte Carlo may require an enormous number of samples, while methods based on importance sampling often rely on simple parametric families that may struggle to represent the distribution over failures. We propose to sample the distribution over failures using a conditional denoising diffusion model, which has shown success in complex high-dimensional problems such as robotic task planning. We iteratively train a diffusion model to produce state trajectories closer to failure. We demonstrate the effectiveness of our approach on high-dimensional robotic validation tasks, improving sample efficiency and mode coverage compared to existing black-box techniques.
comment: Under review at RA-L
☆ Reachability and Controllability Analysis of the State Covariance for Linear Stochastic Systems
This paper studies the set of terminal state covariances that are reachable over a finite time horizon from a given initial state covariance for a linear stochastic system with additive noise. For discrete-time systems, a complete characterization of the set of reachable state covariances is given. For continuous-time systems, we present an upper bound on the set of reachable state covariances. Moreover, for both linear discrete-time and continuous-time systems, necessary and sufficient conditions are provided for the controllability of the state covariance over a finite horizon.
comment: 15 pages, 1 figure
☆ Utilization of G-Programming Language for Educational Control Application: Case Study of Magnetic Levitation of Elastic Beam
This paper presents the practical employment of G-Programming tools to demonstrate, design, and implement traditional control algorithms upon magnetic levitation system. The complexity of controlling this type of fast dynamic and sensitive system is vital for highlighting the capabilities of LabVIEW G-programming in control education. PID and Lead-Lag controllers are designed and implemented within the LabVIEW environment, with the ability to tune and optimize the controllers utilizing the Virtual Instruments (VIs) of the control design and simulation toolkit. The paper enables the reader to understand the modelling, testing the control action, and dynamic simulation of the system. Then, deploying the control law in real time. It can be concluded that the G programming shows a suitable and easy tool for facilitating hands-on, experiential learning and validation in control systems engineering.
☆ Physically Analyzable AI-Based Nonlinear Platoon Dynamics Modeling During Traffic Oscillation: A Koopman Approach
Given the complexity and nonlinearity inherent in traffic dynamics within vehicular platoons, there exists a critical need for a modeling methodology with high accuracy while concurrently achieving physical analyzability. Currently, there are two predominant approaches: the physics model-based approach and the Artificial Intelligence (AI)--based approach. Knowing the facts that the physical-based model usually lacks sufficient modeling accuracy and potential function mismatches and the pure-AI-based method lacks analyzability, this paper innovatively proposes an AI-based Koopman approach to model the unknown nonlinear platoon dynamics harnessing the power of AI and simultaneously maintain physical analyzability, with a particular focus on periods of traffic oscillation. Specifically, this research first employs a deep learning framework to generate the embedding function that lifts the original space into the embedding space. Given the embedding space descriptiveness, the platoon dynamics can be expressed as a linear dynamical system founded by the Koopman theory. Based on that, the routine of linear dynamical system analysis can be conducted on the learned traffic linear dynamics in the embedding space. By that, the physical interpretability and analyzability of model-based methods with the heightened precision inherent in data-driven approaches can be synergized. Comparative experiments have been conducted with existing modeling approaches, which suggests our method's superiority in accuracy. Additionally, a phase plane analysis is performed, further evidencing our approach's effectiveness in replicating the complex dynamic patterns. Moreover, the proposed methodology is proven to feature the capability of analyzing the stability, attesting to the physical analyzability.
☆ Singular knee identification to support emergence recognition in physical swarm and cellular automata trajectories
After decades of attention, emergence continues to lack a centralized mathematical definition that leads to a rigorous emergence test applicable to physical flocks and swarms, particularly those containing both deterministic elements (eg, interactions) and stochastic perturbations like measurement noise. This study develops a heuristic test based on singular value curve analysis of data matrices containing deterministic and Gaussian noise signals. The minimum detection criteria are identified, and statistical and matrix space analysis developed to determine upper and lower bounds. This study applies the analysis to representative examples by using recorded trajectories of mixed deterministic and stochastic trajectories for multi-agent, cellular automata, and biological video. Examples include Cucker Smale and Vicsek flocking, Gaussian noise and its integration, recorded observations of bird flocking, and 1D cellular automata. Ensemble simulations including measurement noise are performed to compute statistical variation and discussed relative to random matrix theory noise bounds. The results indicate singular knee analysis of recorded trajectories can detect gradated levels on a continuum of structure and noise. Across the eight singular value decay metrics considered, the angle subtended at the singular value knee emerges with the most potential for supporting cross-embodiment emergence detection, the size of noise bounds is used as an indication of required sample size, and the presence of a large fraction of singular values inside noise bounds as an indication of noise.
comment: 14 pages, 11 figures; includes 2 supplementary pages
☆ Age of Information Versions: a Semantic View of Markov Source Monitoring
We consider the problem of real-time remote monitoring of a two-state Markov process, where a sensor observes the state of the source and makes a decision on whether to transmit the status updates over an unreliable channel or not. We introduce a modified randomized stationary sampling and transmission policy where the decision to perform sampling occurs probabilistically depending on the current state of the source and whether the system was in a sync state during the previous time slot or not. We then propose two new performance metrics, coined the Version Innovation Age (VIA) and the Age of Incorrect Version (AoIV) and analyze their performance under the modified randomized stationary and other state-of-the-art sampling and transmission policies. Specifically, we derive closed-form expressions for the distribution and the average of VIA, AoIV, and Age of Incorrect Information (AoII) under these policies. Furthermore, we formulate and solve three constrained optimization problems. The first optimization problem aims to minimize the average VIA subject to constraints on the time-averaged sampling cost and time-averaged reconstruction error. In the second and third problems, the objective is to minimize the average AoIV and AoII, respectively, while considering a constraint on the time-averaged sampling cost. Finally, we compare the performance of various sampling and transmission policies and identify the conditions under which each policy outperforms the others in optimizing the proposed metrics.
comment: arXiv admin note: substantial text overlap with arXiv:2401.17691
♻ ☆ Using power system modelling outputs to identify weather-induced extreme events in highly renewable systems
In highly renewable power systems the increased weather dependence can result in new resilience challenges, such as renewable energy droughts, or a lack of sufficient renewable generation at times of high demand. The weather conditions responsible for these challenges have been well-studied in the literature. However, in reality multi-day resilience challenges are triggered by complex interactions between high demand, low renewable availability, electricity transmission constraints and storage dynamics. We show these challenges cannot be rigorously understood from an exclusively power systems, or meteorological, perspective. We propose a new method that uses electricity shadow prices - obtained by a European power system model based on 40 years of reanalysis data - to identify the most difficult periods driving system investments. Such difficult periods are driven by large-scale weather conditions such as low wind and cold temperature periods of various lengths associated with stationary high pressure over Europe. However, purely meteorological approaches fail to identify which events lead to the largest system stress over the multi-decadal study period due to the influence of subtle transmission bottlenecks and storage issues across multiple regions. These extreme events also do not relate strongly to traditional weather patterns (such as Euro-Atlantic weather regimes or the North Atlantic Oscillation index). We therefore compile a new set of weather patterns to define energy system stress events which include the impacts of electricity storage and large-scale interconnection. Without interdisciplinary studies combining state-of-the-art energy meteorology and modelling, further strive for adequate renewable power systems will be hampered.
♻ ☆ Safe Connectivity Maintenance of Underactuated Multi-Agent Networks in Dynamic Oceanic Environments
Autonomous multi-agent systems are increasingly being deployed in environments where winds and ocean currents have a significant influence. Recent work has developed control policies for single agents that leverage flows to achieve their objectives in dynamic environments. However, in multi-agent systems, these flows can cause agents to collide or drift apart and lose direct inter-agent communications, especially when agents have low propulsion capabilities. To address these challenges, we propose a hierarchical multi-agent control approach that allows arbitrary single-agent performance policies that are unaware of other agents to be used in multi-agent systems while ensuring safe operation. We first develop a safety controller using potential functions, solely dedicated to avoiding collisions and maintaining inter-agent communication. Next, we design a low-interference safe interaction (LISIC) policy that trades off the performance policy and the safety control to ensure safe and performant operation. Specifically, when the agents are at an appropriate distance, LISIC prioritizes the performance policy while smoothly increasing the safety controller when necessary. We prove that under mild assumptions on the flows experienced by the agents, our approach can guarantee safety. Additionally, we demonstrate the effectiveness of our method in realistic settings through an extensive empirical analysis with simulations of fleets of underactuated autonomous surface vehicles operating in dynamic ocean currents where these assumptions do not always hold.
comment: 8 pages, Published at European Control Conference 2024 (ECC 2024) Nicolas Hoischen and Marius Wiggert contributed equally to this work
♻ ☆ The Loewner framework for parametric systems: Taming the curse of dimensionality
The Loewner framework is an interpolatory approach designed for approximating linear and nonlinear systems. The goal here is to extend this framework to linear parametric systems with an arbitrary number n of parameters. One main innovation established here is the construction of data-based realizations for any number of parameters. Equally importantly, we show how to alleviate the computational burden, by avoiding the explicit construction of large-scale n-dimensional Loewner matrices of size $N \times N$. This reduces the complexity from $O(N^3)$ to about $O(N^{1.4})$, thus taming the curse of dimensionality and making the solution scalable to very large data sets. To achieve this, a new generalized multivariate rational function realization is defined. Then, we introduce the n-dimensional multivariate Loewner matrices and show that they can be computed by solving a coupled set of Sylvester equations. The null space of these Loewner matrices then allows the construction of the multivariate barycentric transfer function. The principal result of this work is to show how the null space of the n-dimensional Loewner matrix can be computed using a sequence of 1-dimensional Loewner matrices, leading to a drastic computational burden reduction. Finally, we suggest two algorithms (one direct and one iterative) to construct, directly from data, multivariate (or parametric) realizations ensuring (approximate) interpolation. Numerical examples highlight the effectiveness and scalability of the method.
comment: 32 pages, 4 figures
♻ ☆ Log-Scale Quantization in Distributed First-Order Methods: Gradient-based Learning from Distributed Data
Decentralized strategies are of interest for learning from large-scale data over networks. This paper studies learning over a network of geographically distributed nodes/agents subject to quantization. Each node possesses a private local cost function, collectively contributing to a global cost function, which the proposed methodology aims to minimize. In contrast to many existing literature, the information exchange among nodes is quantized. We adopt a first-order computationally-efficient distributed optimization algorithm (with no extra inner consensus loop) that leverages node-level gradient correction based on local data and network-level gradient aggregation only over nearby nodes. This method only requires balanced networks with no need for stochastic weight design. It can handle log-scale quantized data exchange over possibly time-varying and switching network setups. We analyze convergence over both structured networks (for example, training over data-centers) and ad-hoc multi-agent networks (for example, training over dynamic robotic networks). Through analysis and experimental validation, we show that (i) structured networks generally result in a smaller optimality gap, and (ii) logarithmic quantization leads to smaller optimality gap compared to uniform quantization.
♻ ☆ Joint Mechanical and Electrical Adjustment of IRS-aided LEO Satellite MIMO Communications
In this letter, we propose a joint mechanical and electrical adjustment of intelligent reflecting surface (IRS) for the performance improvements of low-earth orbit (LEO) satellite multiple-input multiple-output (MIMO) communications. In particular, we construct a three-dimensional (3D) MIMO channel model for the mechanically-tilted IRS in general deployment, and consider two types of scenarios with and without the direct path of LEO-ground user link due to the orbital flight. With the aim of maximizing the end-to-end performance, we jointly optimize tilting angle and phase shift of IRS along with the transceiver beamforming, whose performance superiority is verified via simulations with the Orbcomm LEO satellite using a real orbit data.
comment: 5 pages, 6 figures
♻ ☆ Agent Coordination via Contextual Regression (AgentCONCUR) for Data Center Flexibility
A network of spatially distributed data centers can provide operational flexibility to power systems by shifting computing tasks among electrically remote locations. However, harnessing this flexibility in real-time through the standard optimization techniques is challenged by the need for sensitive operational datasets and substantial computational resources. To alleviate the data and computational requirements, this paper introduces a coordination mechanism based on contextual regression. This mechanism, abbreviated as AgentCONCUR, associates cost-optimal task shifts with public and trusted contextual data (e.g., real-time prices) and uses regression on this data as a coordination policy. Notably, regression-based coordination does not learn the optimal coordination actions from a labeled dataset. Instead, it exploits the optimization structure of the coordination problem to ensure feasible and cost-effective actions. A NYISO-based study reveals large coordination gains and the optimal features for the successful regression-based coordination.
♻ ☆ Exploring the Capabilities and Limitations of Large Language Models in the Electric Energy Sector
Large Language Models (LLMs) as chatbots have drawn remarkable attention thanks to their versatile capability in natural language processing as well as in a wide range of tasks. While there has been great enthusiasm towards adopting such foundational model-based artificial intelligence tools in all sectors possible, the capabilities and limitations of such LLMs in improving the operation of the electric energy sector need to be explored, and this article identifies fruitful directions in this regard. Key future research directions include data collection systems for fine-tuning LLMs, embedding power system-specific tools in the LLMs, and retrieval augmented generation (RAG)-based knowledge pool to improve the quality of LLM responses and LLMs in safety-critical use cases.
♻ ☆ Thin Film Reconfigurable Intelligent Surface for Harmonic Beam Steering
This letter explores an implementation of a novel thin film 1-by-4 reconfigurable intelligent surface (RIS) designed for future communication and sensing scenarios. Utilizing cost-effective inkjet printing methods and additive manufacturing, our approach significantly simplifies the RIS construction process and reduces production costs. The RIS, fabricated on a flexible and lightweight polyethylene terephthalate (PET) substrate, integrates antennas, switching circuitry, and a microcontroller unit (MCU), without a ground shield. This setup enables individual and simultaneous control of each RIS element, manipulating the captured carrier signal by reflecting and refracting its dominant harmonics. Beams of the harmonics can be steered to multiple desired directions at both front and back sides of the surface. Measurement results of the beam steering show that the RIS has the potential to enable RIS-aided communication and sensing applications.
comment: 5 pages, 4 figures, IEEE letter
Multiagent Systems 10
☆ Vahana.jl -- A framework (not only) for large-scale agent-based models
Agent-based models (ABMs) offer a powerful framework for understanding complex systems. However, their computational demands often become a significant barrier as the number of agents and complexity of the simulation increase. Traditional ABM platforms often struggle to fully exploit modern computing resources, hindering the development of large-scale simulations. This paper presents Vahana.jl, a high performance computing open source framework that aims to address these limitations. Building on the formalism of synchronous graph dynamical systems, Vahana.jl is especially well suited for models with a focus on (social) networks. The framework seamlessly supports distribution across multiple compute nodes, enabling simulations that would otherwise be beyond the capabilities of a single machine. Implemented in Julia, Vahana.jl leverages the interactive Read-Eval-Print Loop (REPL) environment, facilitating rapid model development and experimentation.
☆ Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory
The emergence of Large Language Models (LLMs) and advancements in Artificial Intelligence (AI) offer an opportunity for computational social science research at scale. Building upon prior explorations of LLM agent design, our work introduces a simulated agent society where complex social relationships dynamically form and evolve over time. Agents are imbued with psychological drives and placed in a sandbox survival environment. We conduct an evaluation of the agent society through the lens of Thomas Hobbes's seminal Social Contract Theory (SCT). We analyze whether, as the theory postulates, agents seek to escape a brutish "state of nature" by surrendering rights to an absolute sovereign in exchange for order and security. Our experiments unveil an alignment: Initially, agents engage in unrestrained conflict, mirroring Hobbes's depiction of the state of nature. However, as the simulation progresses, social contracts emerge, leading to the authorization of an absolute sovereign and the establishment of a peaceful commonwealth founded on mutual cooperation. This congruence between our LLM agent society's evolutionary trajectory and Hobbes's theoretical account indicates LLMs' capability to model intricate social dynamics and potentially replicate forces that shape human societies. By enabling such insights into group behavior and emergent societal phenomena, LLM-driven multi-agent simulations, while unable to simulate all the nuances of human behavior, may hold potential for advancing our understanding of social structures, group dynamics, and complex human systems.
☆ Tractable Equilibrium Computation in Markov Games through Risk Aversion
A significant roadblock to the development of principled multi-agent reinforcement learning is the fact that desired solution concepts like Nash equilibria may be intractable to compute. To overcome this obstacle, we take inspiration from behavioral economics and show that -- by imbuing agents with important features of human decision-making like risk aversion and bounded rationality -- a class of risk-averse quantal response equilibria (RQE) become tractable to compute in all $n$-player matrix and finite-horizon Markov games. In particular, we show that they emerge as the endpoint of no-regret learning in suitably adjusted versions of the games. Crucially, the class of computationally tractable RQE is independent of the underlying game structure and only depends on agents' degree of risk-aversion and bounded rationality. To validate the richness of this class of solution concepts we show that it captures peoples' patterns of play in a number of 2-player matrix games previously studied in experimental economics. Furthermore, we give a first analysis of the sample complexity of computing these equilibria in finite-horizon Markov games when one has access to a generative model and validate our findings on a simple multi-agent reinforcement learning benchmark.
comment: preprint of multi-agent RL with risk-averse equilibria
☆ Robust Cooperative Multi-Agent Reinforcement Learning:A Mean-Field Type Game Perspective
In this paper, we study the problem of robust cooperative multi-agent reinforcement learning (RL) where a large number of cooperative agents with distributed information aim to learn policies in the presence of \emph{stochastic} and \emph{non-stochastic} uncertainties whose distributions are respectively known and unknown. Focusing on policy optimization that accounts for both types of uncertainties, we formulate the problem in a worst-case (minimax) framework, which is is intractable in general. Thus, we focus on the Linear Quadratic setting to derive benchmark solutions. First, since no standard theory exists for this problem due to the distributed information structure, we utilize the Mean-Field Type Game (MFTG) paradigm to establish guarantees on the solution quality in the sense of achieved Nash equilibrium of the MFTG. This in turn allows us to compare the performance against the corresponding original robust multi-agent control problem. Then, we propose a Receding-horizon Gradient Descent Ascent RL algorithm to find the MFTG Nash equilibrium and we prove a non-asymptotic rate of convergence. Finally, we provide numerical experiments to demonstrate the efficacy of our approach relative to a baseline algorithm.
comment: Accepted for publication in L4DC 2024
☆ Resource Allocation with Karma Mechanisms
Monetary markets serve as established resource allocation mechanisms, typically achieving efficient solutions with limited information. However, they are susceptible to market failures, particularly under the presence of public goods, externalities, or inequality of economic power. Moreover, in many resource allocating contexts, money faces social, ethical, and legal constraints. Consequently, research increasingly explores artificial currencies and non-monetary markets, with Karma emerging as a notable concept. Karma, a non-tradeable, resource-inherent currency for prosumer resources, operates on the principles of contribution and consumption of specific resources. It embodies fairness, near incentive compatibility, Pareto-efficiency, robustness to population heterogeneity, and can incentivize a reduction in resource scarcity. The literature on Karma is scattered across disciplines, varies in scope, and lacks of conceptual clarity and coherence. Thus, this study undertakes a comprehensive review of the Karma mechanism, systematically comparing its resource allocation applications and elucidating overlooked mechanism design elements. Through a systematic mapping study, this review situates Karma within its literature context, offers a structured design parameter framework, and develops a road-map for future research directions.
☆ MultiAgent Collaboration Attack: Investigating Adversarial Attacks in Large Language Model Collaborations via Debate
Large Language Models (LLMs) have shown exceptional results on current benchmarks when working individually. The advancement in their capabilities, along with a reduction in parameter size and inference times, has facilitated the use of these models as agents, enabling interactions among multiple models to execute complex tasks. Such collaborations offer several advantages, including the use of specialized models (e.g. coding), improved confidence through multiple computations, and enhanced divergent thinking, leading to more diverse outputs. Thus, the collaborative use of language models is expected to grow significantly in the coming years. In this work, we evaluate the behavior of a network of models collaborating through debate under the influence of an adversary. We introduce pertinent metrics to assess the adversary's effectiveness, focusing on system accuracy and model agreement. Our findings highlight the importance of a model's persuasive ability in influencing others. Additionally, we explore inference-time methods to generate more compelling arguments and evaluate the potential of prompt-based mitigation as a defensive strategy.
☆ Generating geographically and economically realistic large-scale synthetic contact networks: A general method using publicly available data
Synthetic contact networks are useful for modeling epidemic spread and social transmission, but data to infer realistic contact patterns that take account of assortative connections at the geographic and economic levels is limited. We developed a method to generate synthetic contact networks for any region of the United States based on publicly available data. First, we generate a synthetic population of individuals within households from US census data using combinatorial optimization. Then, individuals are assigned to workplaces and schools using commute data, employment statistics, and school enrollment data. The resulting population is then connected into a realistic contact network using graph generation algorithms. We test the method on two census regions and show that the synthetic populations accurately reflect the source data. We further show that the contact networks have distinct properties compared to networks generated without a synthetic population, and that those differences affect the rate of disease transmission in an epidemiological simulation. We provide open-source software to generate a synthetic population and contact network for any area within the US.
comment: 12 pages,2 figures, 4 tables
☆ Singular knee identification to support emergence recognition in physical swarm and cellular automata trajectories
After decades of attention, emergence continues to lack a centralized mathematical definition that leads to a rigorous emergence test applicable to physical flocks and swarms, particularly those containing both deterministic elements (eg, interactions) and stochastic perturbations like measurement noise. This study develops a heuristic test based on singular value curve analysis of data matrices containing deterministic and Gaussian noise signals. The minimum detection criteria are identified, and statistical and matrix space analysis developed to determine upper and lower bounds. This study applies the analysis to representative examples by using recorded trajectories of mixed deterministic and stochastic trajectories for multi-agent, cellular automata, and biological video. Examples include Cucker Smale and Vicsek flocking, Gaussian noise and its integration, recorded observations of bird flocking, and 1D cellular automata. Ensemble simulations including measurement noise are performed to compute statistical variation and discussed relative to random matrix theory noise bounds. The results indicate singular knee analysis of recorded trajectories can detect gradated levels on a continuum of structure and noise. Across the eight singular value decay metrics considered, the angle subtended at the singular value knee emerges with the most potential for supporting cross-embodiment emergence detection, the size of noise bounds is used as an indication of required sample size, and the presence of a large fraction of singular values inside noise bounds as an indication of noise.
comment: 14 pages, 11 figures; includes 2 supplementary pages
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
♻ ☆ ArgMed-Agents: Explainable Clinical Decision Reasoning with LLM Disscusion via Argumentation Schemes
There are two main barriers to using large language models (LLMs) in clinical reasoning. Firstly, while LLMs exhibit significant promise in Natural Language Processing (NLP) tasks, their performance in complex reasoning and planning falls short of expectations. Secondly, LLMs use uninterpretable methods to make clinical decisions that are fundamentally different from the clinician's cognitive processes. This leads to user distrust. In this paper, we present a multi-agent framework called ArgMed-Agents, which aims to enable LLM-based agents to make explainable clinical decision reasoning through interaction. ArgMed-Agents performs self-argumentation iterations via Argumentation Scheme for Clinical Discussion (a reasoning mechanism for modeling cognitive processes in clinical reasoning), and then constructs the argumentation process as a directed graph representing conflicting relationships. Ultimately, use symbolic solver to identify a series of rational and coherent arguments to support decision. We construct a formal model of ArgMed-Agents and present conjectures for theoretical guarantees. ArgMed-Agents enables LLMs to mimic the process of clinical argumentative reasoning by generating explanations of reasoning in a self-directed manner. The setup experiments show that ArgMed-Agents not only improves accuracy in complex clinical decision reasoning problems compared to other prompt methods, but more importantly, it provides users with decision explanations that increase their confidence.
Artificial Intelligence 163
☆ Model Merging and Safety Alignment: One Bad Model Spoils the Bunch
Merging Large Language Models (LLMs) is a cost-effective technique for combining multiple expert LLMs into a single versatile model, retaining the expertise of the original ones. However, current approaches often overlook the importance of safety alignment during merging, leading to highly misaligned models. This work investigates the effects of model merging on alignment. We evaluate several popular model merging techniques, demonstrating that existing methods do not only transfer domain expertise but also propagate misalignment. We propose a simple two-step approach to address this problem: (i) generating synthetic safety and domain-specific data, and (ii) incorporating these generated data into the optimization process of existing data-aware model merging techniques. This allows us to treat alignment as a skill that can be maximized in the resulting merged LLM. Our experiments illustrate the effectiveness of integrating alignment-related data during merging, resulting in models that excel in both domain expertise and alignment.
comment: Under review
☆ Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
When presented with questions involving visual thinking, humans naturally switch reasoning modalities, often forming mental images or drawing visual aids. Large language models have shown promising results in arithmetic and symbolic reasoning by expressing intermediate reasoning in text as a chain of thought, yet struggle to extend this capability to answer text queries that are easily solved by visual reasoning, even with extensive multimodal pretraining. We introduce a simple method, whiteboard-of-thought prompting, to unlock the visual reasoning capabilities of multimodal large language models across modalities. Whiteboard-of-thought prompting provides multimodal large language models with a metaphorical `whiteboard' to draw out reasoning steps as images, then returns these images back to the model for further processing. We find this can be accomplished with no demonstrations or specialized modules, instead leveraging models' existing ability to write code with libraries such as Matplotlib and Turtle. This simple approach shows state-of-the-art results on four difficult natural language tasks that involve visual and spatial reasoning. We identify multiple settings where GPT-4o using chain-of-thought fails dramatically, including more than one where it achieves $0\%$ accuracy, while whiteboard-of-thought enables up to $92\%$ accuracy in these same settings. We present a detailed exploration of where the technique succeeds as well as its sources of error.
comment: Project website: whiteboard.cs.columbia.edu/
☆ CooHOI: Learning Cooperative Human-Object Interaction with Manipulated Object Dynamics
Recent years have seen significant advancements in humanoid control, largely due to the availability of large-scale motion capture data and the application of reinforcement learning methodologies. However, many real-world tasks, such as moving large and heavy furniture, require multi-character collaboration. Given the scarcity of data on multi-character collaboration and the efficiency challenges associated with multi-agent learning, these tasks cannot be straightforwardly addressed using training paradigms designed for single-agent scenarios. In this paper, we introduce Cooperative Human-Object Interaction (CooHOI), a novel framework that addresses multi-character objects transporting through a two-phase learning paradigm: individual skill acquisition and subsequent transfer. Initially, a single agent learns to perform tasks using the Adversarial Motion Priors (AMP) framework. Following this, the agent learns to collaborate with others by considering the shared dynamics of the manipulated object during parallel training using Multi Agent Proximal Policy Optimization (MAPPO). When one agent interacts with the object, resulting in specific object dynamics changes, the other agents learn to respond appropriately, thereby achieving implicit communication and coordination between teammates. Unlike previous approaches that relied on tracking-based methods for multi-character HOI, CooHOI is inherently efficient, does not depend on motion capture data of multi-character interactions, and can be seamlessly extended to include more participants and a wide range of object types
☆ GraphReader: Building Graph-based Agent to Enhance Long-Context Abilities of Large Language Models
Long-context capabilities are essential for large language models (LLMs) to tackle complex and long-input tasks. Despite numerous efforts made to optimize LLMs for long contexts, challenges persist in robustly processing long inputs. In this paper, we introduce GraphReader, a graph-based agent system designed to handle long texts by structuring them into a graph and employing an agent to explore this graph autonomously. Upon receiving a question, the agent first undertakes a step-by-step analysis and devises a rational plan. It then invokes a set of predefined functions to read node content and neighbors, facilitating a coarse-to-fine exploration of the graph. Throughout the exploration, the agent continuously records new insights and reflects on current circumstances to optimize the process until it has gathered sufficient information to generate an answer. Experimental results on the LV-Eval dataset reveal that GraphReader, using a 4k context window, consistently outperforms GPT-4-128k across context lengths from 16k to 256k by a large margin. Additionally, our approach demonstrates superior performance on four challenging single-hop and multi-hop benchmarks.
comment: The first four authors contributed equally, 27 pages
☆ Connecting the Dots: LLMs can Infer and Verbalize Latent Structure from Disparate Training Data
One way to address safety risks from large language models (LLMs) is to censor dangerous knowledge from their training data. While this removes the explicit information, implicit information can remain scattered across various training documents. Could an LLM infer the censored knowledge by piecing together these implicit hints? As a step towards answering this question, we study inductive out-of-context reasoning (OOCR), a type of generalization in which LLMs infer latent information from evidence distributed across training documents and apply it to downstream tasks without in-context learning. Using a suite of five tasks, we demonstrate that frontier LLMs can perform inductive OOCR. In one experiment we finetune an LLM on a corpus consisting only of distances between an unknown city and other known cities. Remarkably, without in-context examples or Chain of Thought, the LLM can verbalize that the unknown city is Paris and use this fact to answer downstream questions. Further experiments show that LLMs trained only on individual coin flip outcomes can verbalize whether the coin is biased, and those trained only on pairs $(x,f(x))$ can articulate a definition of $f$ and compute inverses. While OOCR succeeds in a range of cases, we also show that it is unreliable, particularly for smaller LLMs learning complex structures. Overall, the ability of LLMs to "connect the dots" without explicit in-context learning poses a potential obstacle to monitoring and controlling the knowledge acquired by LLMs.
☆ IRASim: Learning Interactive Real-Robot Action Simulators
Scalable robot learning in the real world is limited by the cost and safety issues of real robots. In addition, rolling out robot trajectories in the real world can be time-consuming and labor-intensive. In this paper, we propose to learn an interactive real-robot action simulator as an alternative. We introduce a novel method, IRASim, which leverages the power of generative models to generate extremely realistic videos of a robot arm that executes a given action trajectory, starting from an initial given frame. To validate the effectiveness of our method, we create a new benchmark, IRASim Benchmark, based on three real-robot datasets and perform extensive experiments on the benchmark. Results show that IRASim outperforms all the baseline methods and is more preferable in human evaluations. We hope that IRASim can serve as an effective and scalable approach to enhance robot learning in the real world. To promote research for generative real-robot action simulators, we open-source code, benchmark, and checkpoints at https: //gen-irasim.github.io.
comment: Opensource, project website: https://gen-irasim.github.io
☆ A Benchmarking Study of Kolmogorov-Arnold Networks on Tabular Data
Kolmogorov-Arnold Networks (KANs) have very recently been introduced into the world of machine learning, quickly capturing the attention of the entire community. However, KANs have mostly been tested for approximating complex functions or processing synthetic data, while a test on real-world tabular datasets is currently lacking. In this paper, we present a benchmarking study comparing KANs and Multi-Layer Perceptrons (MLPs) on tabular datasets. The study evaluates task performance and training times. From the results obtained on the various datasets, KANs demonstrate superior or comparable accuracy and F1 scores, excelling particularly in datasets with numerous instances, suggesting robust handling of complex data. We also highlight that this performance improvement of KANs comes with a higher computational cost when compared to MLPs of comparable sizes.
☆ DeciMamba: Exploring the Length Extrapolation Potential of Mamba
Long-range sequence processing poses a significant challenge for Transformers due to their quadratic complexity in input length. A promising alternative is Mamba, which demonstrates high performance and achieves Transformer-level capabilities while requiring substantially fewer computational resources. In this paper we explore the length-generalization capabilities of Mamba, which we find to be relatively limited. Through a series of visualizations and analyses we identify that the limitations arise from a restricted effective receptive field, dictated by the sequence length used during training. To address this constraint, we introduce DeciMamba, a context-extension method specifically designed for Mamba. This mechanism, built on top of a hidden filtering mechanism embedded within the S6 layer, enables the trained model to extrapolate well even without additional training. Empirical experiments over real-world long-range NLP tasks show that DeciMamba can extrapolate to context lengths that are 25x times longer than the ones seen during training, and does so without utilizing additional computational resources. We will release our code and models.
comment: Link To Official Implementation: https://github.com/assafbk/DeciMamba
☆ Towards evolution of Deep Neural Networks through contrastive Self-Supervised learning
Deep Neural Networks (DNNs) have been successfully applied to a wide range of problems. However, two main limitations are commonly pointed out. The first one is that they require long time to design. The other is that they heavily rely on labelled data, which can sometimes be costly and hard to obtain. In order to address the first problem, neuroevolution has been proved to be a plausible option to automate the design of DNNs. As for the second problem, self-supervised learning has been used to leverage unlabelled data to learn representations. Our goal is to study how neuroevolution can help self-supervised learning to bridge the gap to supervised learning in terms of performance. In this work, we propose a framework that is able to evolve deep neural networks using self-supervised learning. Our results on the CIFAR-10 dataset show that it is possible to evolve adequate neural networks while reducing the reliance on labelled data. Moreover, an analysis to the structure of the evolved networks suggests that the amount of labelled data fed to them has less effect on the structure of networks that learned via self-supervised learning, when compared to individuals that relied on supervised learning.
comment: IEEE World Congress on Computational Intelligence (WCCI) 2024; Keywords: NeuroEvolution, Deep Learning, Evolutionary Machine Learning
☆ Fantastic Copyrighted Beasts and How (Not) to Generate Them
Recent studies show that image and video generation models can be prompted to reproduce copyrighted content from their training data, raising serious legal concerns around copyright infringement. Copyrighted characters, in particular, pose a difficult challenge for image generation services, with at least one lawsuit already awarding damages based on the generation of these characters. Yet, little research has empirically examined this issue. We conduct a systematic evaluation to fill this gap. First, we build CopyCat, an evaluation suite consisting of diverse copyrighted characters and a novel evaluation pipeline. Our evaluation considers both the detection of similarity to copyrighted characters and generated image's consistency with user input. Our evaluation systematically shows that both image and video generation models can still generate characters even if characters' names are not explicitly mentioned in the prompt, sometimes with only two generic keywords (e.g., prompting with "videogame, plumber" consistently generates Nintendo's Mario character). We then introduce techniques to semi-automatically identify such keywords or descriptions that trigger character generation. Using our evaluation suite, we study runtime mitigation strategies, including both existing methods and new strategies we propose. Our findings reveal that commonly employed strategies, such as prompt rewriting in the DALL-E system, are not sufficient as standalone guardrails. These strategies must be coupled with other approaches, like negative prompting, to effectively reduce the unintended generation of copyrighted characters. Our work provides empirical grounding to the discussion of copyright mitigation strategies and offers actionable insights for model deployers actively implementing them.
☆ PostMark: A Robust Blackbox Watermark for Large Language Models
The most effective techniques to detect LLM-generated text rely on inserting a detectable signature -- or watermark -- during the model's decoding process. Most existing watermarking methods require access to the underlying LLM's logits, which LLM API providers are loath to share due to fears of model distillation. As such, these watermarks must be implemented independently by each LLM provider. In this paper, we develop PostMark, a modular post-hoc watermarking procedure in which an input-dependent set of words (determined via a semantic embedding) is inserted into the text after the decoding process has completed. Critically, PostMark does not require logit access, which means it can be implemented by a third party. We also show that PostMark is more robust to paraphrasing attacks than existing watermarking methods: our experiments cover eight baseline algorithms, five base LLMs, and three datasets. Finally, we evaluate the impact of PostMark on text quality using both automated and human assessments, highlighting the trade-off between quality and robustness to paraphrasing. We release our code, outputs, and annotations at https://github.com/lilakk/PostMark.
comment: preprint; 18 pages, 5 figures
☆ Solving a Stackelberg Game on Transportation Networks in a Dynamic Crime Scenario: A Mixed Approach on Multi-Layer Networks
Interdicting a criminal with limited police resources is a challenging task as the criminal changes location over time. The size of the large transportation network further adds to the difficulty of this scenario. To tackle this issue, we consider the concept of a layered graph. At each time stamp, we create a copy of the entire transportation network to track the possible movements of both players, the attacker and the defenders. We consider a Stackelberg game in a dynamic crime scenario where the attacker changes location over time while the defenders attempt to interdict the attacker on his escape route. Given a set of defender strategies, the optimal attacker strategy is determined by applying Dijkstra's algorithm on the layered networks. Here, the attacker aims to minimize while the defenders aim to maximize the probability of interdiction. We develop an approximation algorithm on the layered networks to find near-optimal strategy for defenders. The efficacy of the developed approach is compared with the adopted MILP approach. We compare the results in terms of computational time and solution quality. The quality of the results demonstrates the need for the developed approach, as it effectively solves the complex problem within a short amount of time.
☆ V-LASIK: Consistent Glasses-Removal from Videos Using Synthetic Data
Diffusion-based generative models have recently shown remarkable image and video editing capabilities. However, local video editing, particularly removal of small attributes like glasses, remains a challenge. Existing methods either alter the videos excessively, generate unrealistic artifacts, or fail to perform the requested edit consistently throughout the video. In this work, we focus on consistent and identity-preserving removal of glasses in videos, using it as a case study for consistent local attribute removal in videos. Due to the lack of paired data, we adopt a weakly supervised approach and generate synthetic imperfect data, using an adjusted pretrained diffusion model. We show that despite data imperfection, by learning from our generated data and leveraging the prior of pretrained diffusion models, our model is able to perform the desired edit consistently while preserving the original video content. Furthermore, we exemplify the generalization ability of our method to other local video editing tasks by applying it successfully to facial sticker-removal. Our approach demonstrates significant improvement over existing methods, showcasing the potential of leveraging synthetic data and strong video priors for local video editing tasks.
☆ Evidence of a log scaling law for political persuasion with large language models
Large language models can now generate political messages as persuasive as those written by humans, raising concerns about how far this persuasiveness may continue to increase with model size. Here, we generate 720 persuasive messages on 10 U.S. political issues from 24 language models spanning several orders of magnitude in size. We then deploy these messages in a large-scale randomized survey experiment (N = 25,982) to estimate the persuasive capability of each model. Our findings are twofold. First, we find evidence of a log scaling law: model persuasiveness is characterized by sharply diminishing returns, such that current frontier models are barely more persuasive than models smaller in size by an order of magnitude or more. Second, mere task completion (coherence, staying on topic) appears to account for larger models' persuasive advantage. These findings suggest that further scaling model size will not much increase the persuasiveness of static LLM-generated messages.
comment: 16 pages, 4 figures
☆ On Newton's Method to Unlearn Neural Networks
Machine unlearning facilitates personal data ownership, including the ``right to be forgotten''. The proliferation of applications of \emph{neural networks} (NNs) trained on users' personal data calls for the need to develop algorithms to unlearn an NN. Since retraining is costly, efficiency is often achieved through approximate unlearning which aims to unlearn a trained NN to be close to the retrained one (in distribution). Though the Newton's method has been used by previous works to approximately unlearn linear models, adapting it for unlearning an NN often encounters degenerate Hessians that make computing the Newton's update impossible. In this paper, we will first show that when coupled with naive yet often effective solutions to mitigate the degeneracy issue for unlearning, the Newton's method surprisingly suffers from catastrophic forgetting. To overcome this difficulty, we revise the Newton's method to include a theoretically justified regularizer and propose a cubic-regularized Newton's method for unlearning an NN. The cubic regularizer comes with the benefits of not requiring manual finetuning and affording a natural interpretation. Empirical evaluation on several models and real-world datasets shows that our method is more resilient to catastrophic forgetting and performs better than the baselines, especially in sequential unlearning.
☆ Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
☆ Revealing Vision-Language Integration in the Brain with Multimodal Networks ICML 2024
We use (multi)modal deep neural networks (DNNs) to probe for sites of multimodal integration in the human brain by predicting stereoencephalography (SEEG) recordings taken while human subjects watched movies. We operationalize sites of multimodal integration as regions where a multimodal vision-language model predicts recordings better than unimodal language, unimodal vision, or linearly-integrated language-vision models. Our target DNN models span different architectures (e.g., convolutional networks and transformers) and multimodal training techniques (e.g., cross-attention and contrastive learning). As a key enabling step, we first demonstrate that trained vision and language models systematically outperform their randomly initialized counterparts in their ability to predict SEEG signals. We then compare unimodal and multimodal models against one another. Because our target DNN models often have different architectures, number of parameters, and training sets (possibly obscuring those differences attributable to integration), we carry out a controlled comparison of two models (SLIP and SimCLR), which keep all of these attributes the same aside from input modality. Using this approach, we identify a sizable number of neural sites (on average 141 out of 1090 total sites or 12.94%) and brain regions where multimodal integration seems to occur. Additionally, we find that among the variants of multimodal training techniques we assess, CLIP-style training is the best suited for downstream prediction of the neural activity in these sites.
comment: ICML 2024; 23 pages, 11 figures
☆ On Layer-wise Representation Similarity: Application for Multi-Exit Models with a Single Classifier
Analyzing the similarity of internal representations within and across different models has been an important technique for understanding the behavior of deep neural networks. Most existing methods for analyzing the similarity between representations of high dimensions, such as those based on Canonical Correlation Analysis (CCA) and widely used Centered Kernel Alignment (CKA), rely on statistical properties of the representations for a set of data points. In this paper, we focus on transformer models and study the similarity of representations between the hidden layers of individual transformers. In this context, we show that a simple sample-wise cosine similarity metric is capable of capturing the similarity and aligns with the complicated CKA. Our experimental results on common transformers reveal that representations across layers are positively correlated, albeit the similarity decreases when layers are far apart. We then propose an aligned training approach to enhance the similarity between internal representations, with trained models that enjoy the following properties: (1) the last-layer classifier can be directly applied right after any hidden layers, yielding intermediate layer accuracies much higher than those under standard training, (2) the layer-wise accuracies monotonically increase and reveal the minimal depth needed for the given task, (3) when served as multi-exit models, they achieve on-par performance with standard multi-exit architectures which consist of additional classifiers designed for early exiting in shallow layers. To our knowledge, our work is the first to show that one common classifier is sufficient for multi-exit models. We conduct experiments on both vision and NLP tasks to demonstrate the performance of the proposed aligned training.
☆ SafeSora: Towards Safety Alignment of Text2Video Generation via a Human Preference Dataset
To mitigate the risk of harmful outputs from large vision models (LVMs), we introduce the SafeSora dataset to promote research on aligning text-to-video generation with human values. This dataset encompasses human preferences in text-to-video generation tasks along two primary dimensions: helpfulness and harmlessness. To capture in-depth human preferences and facilitate structured reasoning by crowdworkers, we subdivide helpfulness into 4 sub-dimensions and harmlessness into 12 sub-categories, serving as the basis for pilot annotations. The SafeSora dataset includes 14,711 unique prompts, 57,333 unique videos generated by 4 distinct LVMs, and 51,691 pairs of preference annotations labeled by humans. We further demonstrate the utility of the SafeSora dataset through several applications, including training the text-video moderation model and aligning LVMs with human preference by fine-tuning a prompt augmentation module or the diffusion model. These applications highlight its potential as the foundation for text-to-video alignment research, such as human preference modeling and the development and validation of alignment algorithms.
☆ Learning telic-controllable state representations
Computational accounts of purposeful behavior consist of descriptive and normative aspects. The former enable agents to ascertain the current (or future) state of affairs in the world and the latter to evaluate the desirability, or lack thereof, of these states with respect to the agent's goals. In Reinforcement Learning, the normative aspect (reward and value functions) is assumed to depend on a pre-defined and fixed descriptive one (state representation). Alternatively, these two aspects may emerge interdependently: goals can be, and indeed often are, expressed in terms of state representation features, but they may also serve to shape state representations themselves. Here, we illustrate a novel theoretical framing of state representation learning in bounded agents, coupling descriptive and normative aspects via the notion of goal-directed, or telic, states. We define a new controllability property of telic state representations to characterize the tradeoff between their granularity and the policy complexity capacity required to reach all telic states. We propose an algorithm for learning controllable state representations and demonstrate it using a simple navigation task with changing goals. Our framework highlights the crucial role of deliberate ignorance - knowing what to ignore - for learning state representations that are both goal-flexible and simple. More broadly, our work provides a concrete step towards a unified theoretical view of natural and artificial learning through the lens of goals.
☆ Fusion of Movement and Naive Predictions for Point Forecasting in Univariate Random Walks
Traditional methods for point forecasting in univariate random walks often fail to surpass naive benchmarks due to data unpredictability. This study introduces a novel forecasting method that fuses movement prediction (binary classification) with naive forecasts for accurate one-step-ahead point forecasting. The method's efficacy is demonstrated through theoretical analysis, simulations, and real-world data experiments. It reliably exceeds naive forecasts with movement prediction accuracies as low as 0.55, outperforming baseline models like ARIMA, linear regression, MLP, and LSTM networks in forecasting the S\&P 500 index and Bitcoin prices. This method is particularly advantageous when accurate point predictions are challenging but accurate movement predictions are attainable, translating movement predictions into point forecasts in random walk contexts.
☆ Centimeter Positioning Accuracy using AI/ML for 6G Applications ICML
This research looks at using AI/ML to achieve centimeter-level user positioning in 6G applications such as the Industrial Internet of Things (IIoT). Initial results show that our AI/ML-based method can estimate user positions with an accuracy of 17 cm in an indoor factory environment. In this proposal, we highlight our approaches and future directions.
comment: 2 Pages, 2 Figures, ICMLCN Conference, Stockholm, Sweden
☆ Rewarding What Matters: Step-by-Step Reinforcement Learning for Task-Oriented Dialogue
Reinforcement learning (RL) is a powerful approach to enhance task-oriented dialogue (TOD) systems. However, existing RL methods tend to mainly focus on generation tasks, such as dialogue policy learning (DPL) or response generation (RG), while neglecting dialogue state tracking (DST) for understanding. This narrow focus limits the systems to achieve globally optimal performance by overlooking the interdependence between understanding and generation. Additionally, RL methods face challenges with sparse and delayed rewards, which complicates training and optimization. To address these issues, we extend RL into both understanding and generation tasks by introducing step-by-step rewards throughout the token generation. The understanding reward increases as more slots are correctly filled in DST, while the generation reward grows with the accurate inclusion of user requests. Our approach provides a balanced optimization aligned with task completion. Experimental results demonstrate that our approach effectively enhances the performance of TOD systems and achieves new state-of-the-art results on three widely used datasets, including MultiWOZ2.0, MultiWOZ2.1, and In-Car. Our approach also shows superior few-shot ability in low-resource settings compared to current models.
☆ APEER: Automatic Prompt Engineering Enhances Large Language Model Reranking
Large Language Models (LLMs) have significantly enhanced Information Retrieval (IR) across various modules, such as reranking. Despite impressive performance, current zero-shot relevance ranking with LLMs heavily relies on human prompt engineering. Existing automatic prompt engineering algorithms primarily focus on language modeling and classification tasks, leaving the domain of IR, particularly reranking, underexplored. Directly applying current prompt engineering algorithms to relevance ranking is challenging due to the integration of query and long passage pairs in the input, where the ranking complexity surpasses classification tasks. To reduce human effort and unlock the potential of prompt optimization in reranking, we introduce a novel automatic prompt engineering algorithm named APEER. APEER iteratively generates refined prompts through feedback and preference optimization. Extensive experiments with four LLMs and ten datasets demonstrate the substantial performance improvement of APEER over existing state-of-the-art (SoTA) manual prompts. Furthermore, we find that the prompts generated by APEER exhibit better transferability across diverse tasks and LLMs. Code is available at https://github.com/jincan333/APEER.
☆ Graph Representation Learning Strategies for Omics Data: A Case Study on Parkinson's Disease
Omics data analysis is crucial for studying complex diseases, but its high dimensionality and heterogeneity challenge classical statistical and machine learning methods. Graph neural networks have emerged as promising alternatives, yet the optimal strategies for their design and optimization in real-world biomedical challenges remain unclear. This study evaluates various graph representation learning models for case-control classification using high-throughput biological data from Parkinson's disease and control samples. We compare topologies derived from sample similarity networks and molecular interaction networks, including protein-protein and metabolite-metabolite interactions (PPI, MMI). Graph Convolutional Network (GCNs), Chebyshev spectral graph convolution (ChebyNet), and Graph Attention Network (GAT), are evaluated alongside advanced architectures like graph transformers, the graph U-net, and simpler models like multilayer perceptron (MLP). These models are systematically applied to transcriptomics and metabolomics data independently. Our comparative analysis highlights the benefits and limitations of various architectures in extracting patterns from omics data, paving the way for more accurate and interpretable models in biomedical research.
comment: Submitted to Machine Learning in Computational Biology 2024 as an extended abstract, 2 pages + 1 appendix
☆ CollaFuse: Collaborative Diffusion Models
In the landscape of generative artificial intelligence, diffusion-based models have emerged as a promising method for generating synthetic images. However, the application of diffusion models poses numerous challenges, particularly concerning data availability, computational requirements, and privacy. Traditional approaches to address these shortcomings, like federated learning, often impose significant computational burdens on individual clients, especially those with constrained resources. In response to these challenges, we introduce a novel approach for distributed collaborative diffusion models inspired by split learning. Our approach facilitates collaborative training of diffusion models while alleviating client computational burdens during image synthesis. This reduced computational burden is achieved by retaining data and computationally inexpensive processes locally at each client while outsourcing the computationally expensive processes to shared, more efficient server resources. Through experiments on the common CelebA dataset, our approach demonstrates enhanced privacy by reducing the necessity for sharing raw data. These capabilities hold significant potential across various application areas, including the design of edge computing solutions. Thus, our work advances distributed machine learning by contributing to the evolution of collaborative diffusion models.
comment: 13 pages, 7 figures
☆ Control when confidence is costly NeurIPS 2024
We develop a version of stochastic control that accounts for computational costs of inference. Past studies identified efficient coding without control, or efficient control that neglects the cost of synthesizing information. Here we combine these concepts into a framework where agents rationally approximate inference for efficient control. Specifically, we study Linear Quadratic Gaussian (LQG) control with an added internal cost on the relative precision of the posterior probability over the world state. This creates a trade-off: an agent can obtain more utility overall by sacrificing some task performance, if doing so saves enough bits during inference. We discover that the rational strategy that solves the joint inference and control problem goes through phase transitions depending on the task demands, switching from a costly but optimal inference to a family of suboptimal inferences related by rotation transformations, each misestimate the stability of the world. In all cases, the agent moves more to think less. This work provides a foundation for a new type of rational computations that could be used by both brains and machines for efficient but computationally constrained control.
comment: 9 pages, 4 figures, submitted to NeurIPS 2024
☆ SynDARin: Synthesising Datasets for Automated Reasoning in Low-Resource Languages
Question Answering (QA) datasets have been instrumental in developing and evaluating Large Language Model (LLM) capabilities. However, such datasets are scarce for languages other than English due to the cost and difficulties of collection and manual annotation. This means that producing novel models and measuring the performance of multilingual LLMs in low-resource languages is challenging. To mitigate this, we propose $\textbf{S}$yn$\textbf{DAR}$in, a method for generating and validating QA datasets for low-resource languages. We utilize parallel content mining to obtain $\textit{human-curated}$ paragraphs between English and the target language. We use the English data as context to $\textit{generate}$ synthetic multiple-choice (MC) question-answer pairs, which are automatically translated and further validated for quality. Combining these with their designated non-English $\textit{human-curated}$ paragraphs form the final QA dataset. The method allows to maintain the content quality, reduces the likelihood of factual errors, and circumvents the need for costly annotation. To test the method, we created a QA dataset with $1.2$K samples for the Armenian language. The human evaluation shows that $98\%$ of the generated English data maintains quality and diversity in the question types and topics, while the translation validation pipeline can filter out $\sim70\%$ of data with poor quality. We use the dataset to benchmark state-of-the-art LLMs, showing their inability to achieve human accuracy with some model performances closer to random chance. This shows that the generated dataset is non-trivial and can be used to evaluate reasoning capabilities in low-resource language.
☆ FutureNet-LOF: Joint Trajectory Prediction and Lane Occupancy Field Prediction with Future Context Encoding
Most prior motion prediction endeavors in autonomous driving have inadequately encoded future scenarios, leading to predictions that may fail to accurately capture the diverse movements of agents (e.g., vehicles or pedestrians). To address this, we propose FutureNet, which explicitly integrates initially predicted trajectories into the future scenario and further encodes these future contexts to enhance subsequent forecasting. Additionally, most previous motion forecasting works have focused on predicting independent futures for each agent. However, safe and smooth autonomous driving requires accurately predicting the diverse future behaviors of numerous surrounding agents jointly in complex dynamic environments. Given that all agents occupy certain potential travel spaces and possess lane driving priority, we propose Lane Occupancy Field (LOF), a new representation with lane semantics for motion forecasting in autonomous driving. LOF can simultaneously capture the joint probability distribution of all road participants' future spatial-temporal positions. Due to the high compatibility between lane occupancy field prediction and trajectory prediction, we propose a novel network with future context encoding for the joint prediction of these two tasks. Our approach ranks 1st on two large-scale motion forecasting benchmarks: Argoverse 1 and Argoverse 2.
comment: 10 pages
☆ FVEL: Interactive Formal Verification Environment with Large Language Models via Theorem Proving
Formal verification (FV) has witnessed growing significance with current emerging program synthesis by the evolving large language models (LLMs). However, current formal verification mainly resorts to symbolic verifiers or hand-craft rules, resulting in limitations for extensive and flexible verification. On the other hand, formal languages for automated theorem proving, such as Isabelle, as another line of rigorous verification, are maintained with comprehensive rules and theorems. In this paper, we propose FVEL, an interactive Formal Verification Environment with LLMs. Specifically, FVEL transforms a given code to be verified into Isabelle, and then conducts verification via neural automated theorem proving with an LLM. The joined paradigm leverages the rigorous yet abundant formulated and organized rules in Isabelle and is also convenient for introducing and adjusting cutting-edge LLMs. To achieve this goal, we extract a large-scale FVELER3. The FVELER dataset includes code dependencies and verification processes that are formulated in Isabelle, containing 758 theories, 29,125 lemmas, and 200,646 proof steps in total with in-depth dependencies. We benchmark FVELER in the FVEL environment by first fine-tuning LLMs with FVELER and then evaluating them on Code2Inv and SV-COMP. The results show that FVEL with FVELER fine-tuned Llama3- 8B solves 17.39% (69 -> 81) more problems, and Mistral-7B 12% (75 -> 84) more problems in SV-COMP. And the proportion of proof errors is reduced. Project page: https://fveler.github.io/.
☆ Fair Streaming Feature Selection
Streaming feature selection techniques have become essential in processing real-time data streams, as they facilitate the identification of the most relevant attributes from continuously updating information. Despite their performance, current algorithms to streaming feature selection frequently fall short in managing biases and avoiding discrimination that could be perpetuated by sensitive attributes, potentially leading to unfair outcomes in the resulting models. To address this issue, we propose FairSFS, a novel algorithm for Fair Streaming Feature Selection, to uphold fairness in the feature selection process without compromising the ability to handle data in an online manner. FairSFS adapts to incoming feature vectors by dynamically adjusting the feature set and discerns the correlations between classification attributes and sensitive attributes from this revised set, thereby forestalling the propagation of sensitive data. Empirical evaluations show that FairSFS not only maintains accuracy that is on par with leading streaming feature selection methods and existing fair feature techniques but also significantly improves fairness metrics.
comment: 30 pages, 10 figures
☆ Computation-Efficient Semi-Supervised Learning for ECG-based Cardiovascular Diseases Detection
Label scarcity problem is the main challenge that hinders the wide application of deep learning systems in automatic cardiovascular diseases (CVDs) detection using electrocardiography (ECG). Tuning pre-trained models alleviates this problem by transferring knowledge learned from large datasets to downstream small datasets. However, bottlenecks in computational efficiency and CVDs detection performance limit its clinical applications. It is difficult to improve the detection performance without significantly sacrificing model computational efficiency. Here, we propose a computation-efficient semi-supervised learning paradigm (FastECG) for robust and computation-efficient CVDs detection using ECG. It enables a robust adaptation of pre-trained models on downstream datasets with limited supervision and high computational efficiency. First, a random-deactivation technique is developed to achieve robust and fast low-rank adaptation of pre-trained weights. Subsequently, we propose a one-shot rank allocation module to determine the optimal ranks for the update matrices of the pre-trained weights. Finally, a lightweight semi-supervised learning pipeline is introduced to enhance model performance by leveraging labeled and unlabeled data with high computational efficiency. Extensive experiments on four downstream ECG datasets demonstrate that FastECG not only outperforms the state-of-the-art methods in multi-label CVDs detection but also consumes fewer GPU footprints, training time, and parameter storage space. As such, this paradigm provides an effective solution for achieving high computational efficiency and robust detection performance in the clinical applications of pre-trained models under limited supervision.
☆ Artificial Leviathan: Exploring Social Evolution of LLM Agents Through the Lens of Hobbesian Social Contract Theory
The emergence of Large Language Models (LLMs) and advancements in Artificial Intelligence (AI) offer an opportunity for computational social science research at scale. Building upon prior explorations of LLM agent design, our work introduces a simulated agent society where complex social relationships dynamically form and evolve over time. Agents are imbued with psychological drives and placed in a sandbox survival environment. We conduct an evaluation of the agent society through the lens of Thomas Hobbes's seminal Social Contract Theory (SCT). We analyze whether, as the theory postulates, agents seek to escape a brutish "state of nature" by surrendering rights to an absolute sovereign in exchange for order and security. Our experiments unveil an alignment: Initially, agents engage in unrestrained conflict, mirroring Hobbes's depiction of the state of nature. However, as the simulation progresses, social contracts emerge, leading to the authorization of an absolute sovereign and the establishment of a peaceful commonwealth founded on mutual cooperation. This congruence between our LLM agent society's evolutionary trajectory and Hobbes's theoretical account indicates LLMs' capability to model intricate social dynamics and potentially replicate forces that shape human societies. By enabling such insights into group behavior and emergent societal phenomena, LLM-driven multi-agent simulations, while unable to simulate all the nuances of human behavior, may hold potential for advancing our understanding of social structures, group dynamics, and complex human systems.
☆ PoseBench: Benchmarking the Robustness of Pose Estimation Models under Corruptions
Pose estimation aims to accurately identify anatomical keypoints in humans and animals using monocular images, which is crucial for various applications such as human-machine interaction, embodied AI, and autonomous driving. While current models show promising results, they are typically trained and tested on clean data, potentially overlooking the corruption during real-world deployment and thus posing safety risks in practical scenarios. To address this issue, we introduce PoseBench, a comprehensive benchmark designed to evaluate the robustness of pose estimation models against real-world corruption. We evaluated 60 representative models, including top-down, bottom-up, heatmap-based, regression-based, and classification-based methods, across three datasets for human and animal pose estimation. Our evaluation involves 10 types of corruption in four categories: 1) blur and noise, 2) compression and color loss, 3) severe lighting, and 4) masks. Our findings reveal that state-of-the-art models are vulnerable to common real-world corruptions and exhibit distinct behaviors when tackling human and animal pose estimation tasks. To improve model robustness, we delve into various design considerations, including input resolution, pre-training datasets, backbone capacity, post-processing, and data augmentations. We hope that our benchmark will serve as a foundation for advancing research in robust pose estimation. The benchmark and source code will be released at https://xymsh.github.io/PoseBench
comment: Technical report. Project page: https://xymsh.github.io/PoseBench/
☆ Communication-Efficient Byzantine-Resilient Federated Zero-Order Optimization
We introduce CYBER-0, the first zero-order optimization algorithm for memory-and-communication efficient Federated Learning, resilient to Byzantine faults. We show through extensive numerical experiments on the MNIST dataset and finetuning RoBERTa-Large that CYBER-0 outperforms state-of-the-art algorithms in terms of communication and memory efficiency while reaching similar accuracy. We provide theoretical guarantees on its convergence for convex loss functions.
☆ Robustness Analysis of AI Models in Critical Energy Systems
This paper analyzes the robustness of state-of-the-art AI-based models for power grid operations under the $N-1$ security criterion. While these models perform well in regular grid settings, our results highlight a significant loss in accuracy following the disconnection of a line.%under this security criterion. Using graph theory-based analysis, we demonstrate the impact of node connectivity on this loss. Our findings emphasize the need for practical scenario considerations in developing AI methodologies for critical infrastructure.
☆ The neural correlates of logical-mathematical symbol systems processing resemble that of spatial cognition more than natural language processing
The ability to manipulate logical-mathematical symbols (LMS), encompassing tasks such as calculation, reasoning, and programming, is a cognitive skill arguably unique to humans. Considering the relatively recent emergence of this ability in human evolutionary history, it has been suggested that LMS processing may build upon more fundamental cognitive systems, possibly through neuronal recycling. Previous studies have pinpointed two primary candidates, natural language processing and spatial cognition. Existing comparisons between these domains largely relied on task-level comparison, which may be confounded by task idiosyncrasy. The present study instead compared the neural correlates at the domain level with both automated meta-analysis and synthesized maps based on three representative LMS tasks, reasoning, calculation, and mental programming. Our results revealed a more substantial cortical overlap between LMS processing and spatial cognition, in contrast to language processing. Furthermore, in regions activated by both spatial and language processing, the multivariate activation pattern for LMS processing exhibited greater multivariate similarity to spatial cognition than to language processing. A hierarchical clustering analysis further indicated that typical LMS tasks were indistinguishable from spatial cognition tasks at the neural level, suggesting an inherent connection between these two cognitive processes. Taken together, our findings support the hypothesis that spatial cognition is likely the basis of LMS processing, which may shed light on the limitations of large language models in logical reasoning, particularly those trained exclusively on textual data without explicit emphasis on spatial content.
☆ Automatic Labels are as Effective as Manual Labels in Biomedical Images Classification with Deep Learning
The increasing availability of biomedical data is helping to design more robust deep learning (DL) algorithms to analyze biomedical samples. Currently, one of the main limitations to train DL algorithms to perform a specific task is the need for medical experts to label data. Automatic methods to label data exist, however automatic labels can be noisy and it is not completely clear when automatic labels can be adopted to train DL models. This paper aims to investigate under which circumstances automatic labels can be adopted to train a DL model on the classification of Whole Slide Images (WSI). The analysis involves multiple architectures, such as Convolutional Neural Networks (CNN) and Vision Transformer (ViT), and over 10000 WSIs, collected from three use cases: celiac disease, lung cancer and colon cancer, which one including respectively binary, multiclass and multilabel data. The results allow identifying 10% as the percentage of noisy labels that lead to train competitive models for the classification of WSIs. Therefore, an algorithm generating automatic labels needs to fit this criterion to be adopted. The application of the Semantic Knowledge Extractor Tool (SKET) algorithm to generate automatic labels leads to performance comparable to the one obtained with manual labels, since it generates a percentage of noisy labels between 2-5%. Automatic labels are as effective as manual ones, reaching solid performance comparable to the one obtained training models with manual labels.
comment: pre-print of the journal paper
☆ iWISDM: Assessing instruction following in multimodal models at scale
The ability to perform complex tasks from detailed instructions is a key to many remarkable achievements of our species. As humans, we are not only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs (either text or vision), narrowing the scope of multimodal assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap between these models' ability to precisely follow instructions with that of humans.
☆ Exploring Spatial Representations in the Historical Lake District Texts with LLM-based Relation Extraction
Navigating historical narratives poses a challenge in unveiling the spatial intricacies of past landscapes. The proposed work addresses this challenge within the context of the English Lake District, employing the Corpus of the Lake District Writing. The method utilizes a generative pre-trained transformer model to extract spatial relations from the textual descriptions in the corpus. The study applies this large language model to understand the spatial dimensions inherent in historical narratives comprehensively. The outcomes are presented as semantic triples, capturing the nuanced connections between entities and locations, and visualized as a network, offering a graphical representation of the spatial narrative. The study contributes to a deeper comprehension of the English Lake District's spatial tapestry and provides an approach to uncovering spatial relations within diverse historical contexts.
Self-supervised Interpretable Concept-based Models for Text Classification
Despite their success, Large-Language Models (LLMs) still face criticism as their lack of interpretability limits their controllability and reliability. Traditional post-hoc interpretation methods, based on attention and gradient-based analysis, offer limited insight into the model's decision-making processes. In the image field, Concept-based models have emerged as explainable-by-design architectures, employing human-interpretable features as intermediate representations. However, these methods have not been yet adapted to textual data, mainly because they require expensive concept annotations, which are impractical for real-world text data. This paper addresses this challenge by proposing a self-supervised Interpretable Concept Embedding Models (ICEMs). We leverage the generalization abilities of LLMs to predict the concepts labels in a self-supervised way, while we deliver the final predictions with an interpretable function. The results of our experiments show that ICEMs can be trained in a self-supervised way achieving similar performance to fully supervised concept-based models and end-to-end black-box ones. Additionally, we show that our models are (i) interpretable, offering meaningful logical explanations for their predictions; (ii) interactable, allowing humans to modify intermediate predictions through concept interventions; and (iii) controllable, guiding the LLMs' decoding process to follow a required decision-making path.
☆ LiveMind: Low-latency Large Language Models with Simultaneous Inference
In this paper, we introduce a novel low-latency inference framework for large language models (LLMs) inference which enables LLMs to perform inferences with incomplete prompts. By reallocating computational processes to prompt input phase, we achieve a substantial reduction in latency, thereby significantly enhancing the interactive experience for users of LLMs. The framework adeptly manages the visibility of the streaming prompt to the model, allowing it to infer from incomplete prompts or await additional prompts. Compared with traditional inference methods that utilize complete prompts, our approach demonstrates an average reduction of 59% in response latency on the MMLU-Pro dataset, while maintaining comparable accuracy. Additionally, our framework facilitates collaborative inference and output across different models. By employing an LLM for inference and a small language model (SLM) for output, we achieve an average 68% reduction in response latency, alongside a 5.5% improvement in accuracy on the MMLU-Pro dataset compared with the SLM baseline. For long prompts exceeding 20 sentences, the response latency can be reduced by up to 93%.
☆ The Fire Thief Is Also the Keeper: Balancing Usability and Privacy in Prompts
The rapid adoption of online chatbots represents a significant advancement in artificial intelligence. However, this convenience brings considerable privacy concerns, as prompts can inadvertently contain sensitive information exposed to large language models (LLMs). Limited by high computational costs, reduced task usability, and excessive system modifications, previous works based on local deployment, embedding perturbation, and homomorphic encryption are inapplicable to online prompt-based LLM applications. To address these issues, this paper introduces Prompt Privacy Sanitizer (i.e., ProSan), an end-to-end prompt privacy protection framework that can produce anonymized prompts with contextual privacy removed while maintaining task usability and human readability. It can also be seamlessly integrated into the online LLM service pipeline. To achieve high usability and dynamic anonymity, ProSan flexibly adjusts its protection targets and strength based on the importance of the words and the privacy leakage risk of the prompts. Additionally, ProSan is capable of adapting to diverse computational resource conditions, ensuring privacy protection even for mobile devices with limited computing power. Our experiments demonstrate that ProSan effectively removes private information across various tasks, including question answering, text summarization, and code generation, with minimal reduction in task performance.
☆ Identifying User Goals from UI Trajectories
Autonomous agents that interact with graphical user interfaces (GUIs) hold significant potential for enhancing user experiences. To further improve these experiences, agents need to be personalized and proactive. By effectively comprehending user intentions through their actions and interactions with GUIs, agents will be better positioned to achieve these goals. This paper introduces the task of goal identification from observed UI trajectories, aiming to infer the user's intended task based on their GUI interactions. We propose a novel evaluation metric to assess whether two task descriptions are paraphrases within a specific UI environment. By Leveraging the inverse relation with the UI automation task, we utilized the Android-In-The-Wild and Mind2Web datasets for our experiments. Using our metric and these datasets, we conducted several experiments comparing the performance of humans and state-of-the-art models, specifically GPT-4 and Gemini-1.5 Pro. Our results show that Gemini performs better than GPT but still underperforms compared to humans, indicating significant room for improvement.
☆ Robust Few-shot Transfer Learning for Knowledge Base Question Answering with Unanswerable Questions
Real-world KBQA applications require models that are (1) robust -- e.g., can differentiate between answerable and unanswerable questions, and (2) low-resource -- do not require large training data. Towards this goal, we propose the novel task of few-shot transfer for KBQA with unanswerable questions. We present FUn-FuSIC that extends the state-of-the-art (SoTA) few-shot transfer model for answerable-only KBQA to handle unanswerability. It iteratively prompts an LLM to generate logical forms for the question by providing feedback using a diverse suite of syntactic, semantic and execution guided checks, and adapts self-consistency to assess confidence of the LLM to decide answerability. Experiments over newly constructed datasets show that FUn-FuSIC outperforms suitable adaptations of the SoTA model for KBQA with unanswerability, and the SoTA model for answerable-only few-shot-transfer KBQA.
☆ Infusing clinical knowledge into tokenisers for language models
This study introduces a novel knowledge enhanced tokenisation mechanism, K-Tokeniser, for clinical text processing. Technically, at initialisation stage, K-Tokeniser populates global representations of tokens based on semantic types of domain concepts (such as drugs or diseases) from either a domain ontology like Unified Medical Language System or the training data of the task related corpus. At training or inference stage, sentence level localised context will be utilised for choosing the optimal global token representation to realise the semantic-based tokenisation. To avoid pretraining using the new tokeniser, an embedding initialisation approach is proposed to generate representations for new tokens. Using three transformer-based language models, a comprehensive set of experiments are conducted on four real-world datasets for evaluating K-Tokeniser in a wide range of clinical text analytics tasks including clinical concept and relation extraction, automated clinical coding, clinical phenotype identification, and clinical research article classification. Overall, our models demonstrate consistent improvements over their counterparts in all tasks. In particular, substantial improvements are observed in the automated clinical coding task with 13\% increase on Micro $F_1$ score. Furthermore, K-Tokeniser also shows significant capacities in facilitating quicker converge of language models. Specifically, using K-Tokeniser, the language models would only require 50\% of the training data to achieve the best performance of the baseline tokeniser using all training data in the concept extraction task and less than 20\% of the data for the automated coding task. It is worth mentioning that all these improvements require no pre-training process, making the approach generalisable.
comment: 18 pages, 6 figures
☆ Cross-level Requirement Traceability: A Novel Approach Integrating Bag-of-Words and Word Embedding for Enhanced Similarity Functionality
Requirement traceability is the process of identifying the inter-dependencies between requirements. It poses a significant challenge when conducted manually, especially when dealing with requirements at various levels of abstraction. In this work, we propose a novel approach to automate the task of linking high-level business requirements with more technical system requirements. The proposed approach begins by representing each requirement using a Bag of-Words (BOW) model combined with the Term Frequency-Inverse Document Frequency (TF-IDF) scoring function. Then, we suggested an enhanced cosine similarity that uses recent advances in word embedding representation to correct traditional cosine similarity function limitations. To evaluate the effectiveness of our approach, we conducted experiments on three well-known datasets: COEST, WARC(NFR), and WARC(FRS). The results demonstrate that our approach significantly improves efficiency compared to existing methods. We achieved better results with an increase of approximately 18.4% in one of the datasets, as measured by the F2 score.
☆ Identifiable Exchangeable Mechanisms for Causal Structure and Representation Learning
Identifying latent representations or causal structures is important for good generalization and downstream task performance. However, both fields have been developed rather independently. We observe that several methods in both representation and causal structure learning rely on the same data-generating process (DGP), namely, exchangeable but not i.i.d. (independent and identically distributed) data. We provide a unified framework, termed Identifiable Exchangeable Mechanisms (IEM), for representation and structure learning under the lens of exchangeability. IEM provides new insights that let us relax the necessary conditions for causal structure identification in exchangeable non--i.i.d. data. We also demonstrate the existence of a duality condition in identifiable representation learning, leading to new identifiability results. We hope this work will pave the way for further research in causal representation learning.
☆ AI in Space for Scientific Missions: Strategies for Minimizing Neural-Network Model Upload
Artificial Intelligence (AI) has the potential to revolutionize space exploration by delegating several spacecraft decisions to an onboard AI instead of relying on ground control and predefined procedures. It is likely that there will be an AI/ML Processing Unit onboard the spacecraft running an inference engine. The neural-network will have pre-installed parameters that can be updated onboard by uploading, by telecommands, parameters obtained by training on the ground. However, satellite uplinks have limited bandwidth and transmissions can be costly. Furthermore, a mission operating with a suboptimal neural network will miss out on valuable scientific data. Smaller networks can thereby decrease the uplink cost, while increasing the value of the scientific data that is downloaded. In this work, we evaluate and discuss the use of reduced-precision and bare-minimum neural networks to reduce the time for upload. As an example of an AI use case, we focus on the NASA's Magnetosperic MultiScale (MMS) mission. We show how an AI onboard could be used in the Earth's magnetosphere to classify data to selectively downlink higher value data or to recognize a region-of-interest to trigger a burst-mode, collecting data at a high-rate. Using a simple filtering scheme and algorithm, we show how the start and end of a region-of-interest can be detected in on a stream of classifications. To provide the classifications, we use an established Convolutional Neural Network (CNN) trained to an accuracy >94%. We also show how the network can be reduced to a single linear layer and trained to the same accuracy as the established CNN. Thereby, reducing the overall size of the model by up to 98.9%. We further show how each network can be reduced by up to 75% of its original size, by using lower-precision formats to represent the network parameters, with a change in accuracy of less than 0.6 percentage points.
☆ DASB -- Discrete Audio and Speech Benchmark
Discrete audio tokens have recently gained considerable attention for their potential to connect audio and language processing, enabling the creation of modern multimodal large language models. Ideal audio tokens must effectively preserve phonetic and semantic content along with paralinguistic information, speaker identity, and other details. While several types of audio tokens have been recently proposed, identifying the optimal tokenizer for various tasks is challenging due to the inconsistent evaluation settings in existing studies. To address this gap, we release the Discrete Audio and Speech Benchmark (DASB), a comprehensive leaderboard for benchmarking discrete audio tokens across a wide range of discriminative tasks, including speech recognition, speaker identification and verification, emotion recognition, keyword spotting, and intent classification, as well as generative tasks such as speech enhancement, separation, and text-to-speech. Our results show that, on average, semantic tokens outperform compression tokens across most discriminative and generative tasks. However, the performance gap between semantic tokens and standard continuous representations remains substantial, highlighting the need for further research in this field.
comment: 9 pages, 5 tables
☆ Revisiting Modularity Maximization for Graph Clustering: A Contrastive Learning Perspective KDD 2024
Graph clustering, a fundamental and challenging task in graph mining, aims to classify nodes in a graph into several disjoint clusters. In recent years, graph contrastive learning (GCL) has emerged as a dominant line of research in graph clustering and advances the new state-of-the-art. However, GCL-based methods heavily rely on graph augmentations and contrastive schemes, which may potentially introduce challenges such as semantic drift and scalability issues. Another promising line of research involves the adoption of modularity maximization, a popular and effective measure for community detection, as the guiding principle for clustering tasks. Despite the recent progress, the underlying mechanism of modularity maximization is still not well understood. In this work, we dig into the hidden success of modularity maximization for graph clustering. Our analysis reveals the strong connections between modularity maximization and graph contrastive learning, where positive and negative examples are naturally defined by modularity. In light of our results, we propose a community-aware graph clustering framework, coined MAGI, which leverages modularity maximization as a contrastive pretext task to effectively uncover the underlying information of communities in graphs, while avoiding the problem of semantic drift. Extensive experiments on multiple graph datasets verify the effectiveness of MAGI in terms of scalability and clustering performance compared to state-of-the-art graph clustering methods. Notably, MAGI easily scales a sufficiently large graph with 100M nodes while outperforming strong baselines.
comment: KDD 2024 research track. Code available at https://github.com/EdisonLeeeee/MAGI
☆ VAIYAKARANA : A Benchmark for Automatic Grammar Correction in Bangla
Bangla (Bengali) is the fifth most spoken language globally and, yet, the problem of automatic grammar correction in Bangla is still in its nascent stage. This is mostly due to the need for a large corpus of grammatically incorrect sentences, with their corresponding correct counterparts. The present state-of-the-art techniques to curate a corpus for grammatically wrong sentences involve random swapping, insertion and deletion of words. However,these steps may not always generate grammatically wrong sentences in Bangla. In this work, we propose a pragmatic approach to generate grammatically wrong sentences in Bangla. We first categorize the different kinds of errors in Bangla into 5 broad classes and 12 finer classes. We then use these to generate grammatically wrong sentences systematically from a correct sentence. This approach can generate a large number of wrong sentences and can, thus, mitigate the challenge of lacking a large corpus for neural networks. We provide a dataset, Vaiyakarana, consisting of 92,830 grammatically incorrect sentences as well as 18,426 correct sentences. We also collected 619 human-generated sentences from essays written by Bangla native speakers. This helped us to understand errors that are more frequent. We evaluated our corpus against neural models and LLMs and also benchmark it against human evaluators who are native speakers of Bangla. Our analysis shows that native speakers are far more accurate than state-of-the-art models to detect whether the sentence is grammatically correct. Our methodology of generating erroneous sentences can be applied for most other Indian languages as well.
☆ Q*: Improving Multi-step Reasoning for LLMs with Deliberative Planning
Large Language Models (LLMs) have demonstrated impressive capability in many nature language tasks. However, the auto-regressive generation process makes LLMs prone to produce errors, hallucinations and inconsistent statements when performing multi-step reasoning. In this paper, we aim to alleviate the pathology by introducing Q*, a general, versatile and agile framework for guiding LLMs decoding process with deliberative planning. By learning a plug-and-play Q-value model as heuristic function, our Q* can effectively guide LLMs to select the most promising next step without fine-tuning LLMs for each task, which avoids the significant computational overhead and potential risk of performance degeneration on other tasks. Extensive experiments on GSM8K, MATH and MBPP confirm the superiority of our method.
☆ Learning to Plan for Retrieval-Augmented Large Language Models from Knowledge Graphs
Improving the performance of large language models (LLMs) in complex question-answering (QA) scenarios has always been a research focal point. Recent studies have attempted to enhance LLMs' performance by combining step-wise planning with external retrieval. While effective for advanced models like GPT-3.5, smaller LLMs face challenges in decomposing complex questions, necessitating supervised fine-tuning. Previous work has relied on manual annotation and knowledge distillation from teacher LLMs, which are time-consuming and not accurate enough. In this paper, we introduce a novel framework for enhancing LLMs' planning capabilities by using planning data derived from knowledge graphs (KGs). LLMs fine-tuned with this data have improved planning capabilities, better equipping them to handle complex QA tasks that involve retrieval. Evaluations on multiple datasets, including our newly proposed benchmark, highlight the effectiveness of our framework and the benefits of KG-derived planning data.
comment: Work in progress
☆ FairX: A comprehensive benchmarking tool for model analysis using fairness, utility, and explainability
We present FairX, an open-source Python-based benchmarking tool designed for the comprehensive analysis of models under the umbrella of fairness, utility, and eXplainability (XAI). FairX enables users to train benchmarking bias-removal models and evaluate their fairness using a wide array of fairness metrics, data utility metrics, and generate explanations for model predictions, all within a unified framework. Existing benchmarking tools do not have the way to evaluate synthetic data generated from fair generative models, also they do not have the support for training fair generative models either. In FairX, we add fair generative models in the collection of our fair-model library (pre-processing, in-processing, post-processing) and evaluation metrics for evaluating the quality of synthetic fair data. This version of FairX supports both tabular and image datasets. It also allows users to provide their own custom datasets. The open-source FairX benchmarking package is publicly available at https://github.com/fahim-sikder/FairX.
☆ Augmenting Query and Passage for Retrieval-Augmented Generation using LLMs for Open-Domain Question Answering
Retrieval-augmented generation (RAG) has received much attention for Open-domain question-answering (ODQA) tasks as a means to compensate for the parametric knowledge of large language models (LLMs). While previous approaches focused on processing retrieved passages to remove irrelevant context, they still rely heavily on the quality of retrieved passages which can degrade if the question is ambiguous or complex. In this paper, we propose a simple yet efficient method called question and passage augmentation via LLMs for open-domain QA. Our method first decomposes the original questions into multiple-step sub-questions. By augmenting the original question with detailed sub-questions and planning, we are able to make the query more specific on what needs to be retrieved, improving the retrieval performance. In addition, to compensate for the case where the retrieved passages contain distracting information or divided opinions, we augment the retrieved passages with self-generated passages by LLMs to guide the answer extraction. Experimental results show that the proposed scheme outperforms the previous state-of-the-art and achieves significant performance gain over existing RAG methods.
☆ Step-Back Profiling: Distilling User History for Personalized Scientific Writing
Large language models (LLMs) excel at a variety of natural language processing tasks, yet they struggle to generate personalized content for individuals, particularly in real-world scenarios like scientific writing. Addressing this challenge, we introduce Step-Back Profiling to personalize LLMs by distilling user history into concise profiles, including essential traits and preferences of users. Regarding our experiments, we construct a Personalized Scientific Writing (PSW) dataset to study multiuser personalization. PSW requires the models to write scientific papers given specialized author groups with diverse academic backgrounds. As for the results, we demonstrate the effectiveness of capturing user characteristics via Step-Back Profiling for collaborative writing. Moreover, our approach outperforms the baselines by up to 3.6 points on the general personalization benchmark (LaMP), including 7 personalization LLM tasks. Our extensive ablation studies validate the contributions of different components in our method and provide insights into our task definition. Our dataset and code are available at \url{https://github.com/gersteinlab/step-back-profiling}.
☆ The Impact of AI on Perceived Job Decency and Meaningfulness: A Case Study
The proliferation of Artificial Intelligence (AI) in workplaces stands to change the way humans work, with job satisfaction intrinsically linked to work life. Existing research on human-AI collaboration tends to prioritize performance over the experiential aspects of work. In contrast, this paper explores the impact of AI on job decency and meaningfulness in workplaces. Through interviews in the Information Technology (IT) domain, we not only examined the current work environment, but also explored the perceived evolution of the workplace ecosystem with the introduction of an AI. Findings from the preliminary exploratory study reveal that respondents tend to visualize a workplace where humans continue to play a dominant role, even with the introduction of advanced AIs. In this prospective scenario, AI is seen as serving as a complement rather than replacing the human workforce. Furthermore, respondents believe that the introduction of AI will maintain or potentially increase overall job satisfaction.
☆ On the Evaluation Practices in Multilingual NLP: Can Machine Translation Offer an Alternative to Human Translations?
While multilingual language models (MLMs) have been trained on 100+ languages, they are typically only evaluated across a handful of them due to a lack of available test data in most languages. This is particularly problematic when assessing MLM's potential for low-resource and unseen languages. In this paper, we present an analysis of existing evaluation frameworks in multilingual NLP, discuss their limitations, and propose several directions for more robust and reliable evaluation practices. Furthermore, we empirically study to what extent machine translation offers a {reliable alternative to human translation} for large-scale evaluation of MLMs across a wide set of languages. We use a SOTA translation model to translate test data from 4 tasks to 198 languages and use them to evaluate three MLMs. We show that while the selected subsets of high-resource test languages are generally sufficiently representative of a wider range of high-resource languages, we tend to overestimate MLMs' ability on low-resource languages. Finally, we show that simpler baselines can achieve relatively strong performance without having benefited from large-scale multilingual pretraining.
☆ Intelligent Interface: Enhancing Lecture Engagement with Didactic Activity Summaries
Recently, multiple applications of machine learning have been introduced. They include various possibilities arising when image analysis methods are applied to, broadly understood, video streams. In this context, a novel tool, developed for academic educators to enhance the teaching process by automating, summarizing, and offering prompt feedback on conducting lectures, has been developed. The implemented prototype utilizes machine learning-based techniques to recognise selected didactic and behavioural teachers' features within lecture video recordings. Specifically, users (teachers) can upload their lecture videos, which are preprocessed and analysed using machine learning models. Next, users can view summaries of recognized didactic features through interactive charts and tables. Additionally, stored ML-based prediction results support comparisons between lectures based on their didactic content. In the developed application text-based models trained on lecture transcriptions, with enhancements to the transcription quality, by adopting an automatic speech recognition solution are applied. Furthermore, the system offers flexibility for (future) integration of new/additional machine-learning models and software modules for image and video analysis.
comment: 9 pages, 6 figures
☆ VeriFlow: Modeling Distributions for Neural Network Verification
Formal verification has emerged as a promising method to ensure the safety and reliability of neural networks. Naively verifying a safety property amounts to ensuring the safety of a neural network for the whole input space irrespective of any training or test set. However, this also implies that the safety of the neural network is checked even for inputs that do not occur in the real-world and have no meaning at all, often resulting in spurious errors. To tackle this shortcoming, we propose the VeriFlow architecture as a flow based density model tailored to allow any verification approach to restrict its search to the some data distribution of interest. We argue that our architecture is particularly well suited for this purpose because of two major properties. First, we show that the transformation and log-density function that are defined by our model are piece-wise affine. Therefore, the model allows the usage of verifiers based on SMT with linear arithmetic. Second, upper density level sets (UDL) of the data distribution take the shape of an $L^p$-ball in the latent space. As a consequence, representations of UDLs specified by a given probability are effectively computable in latent space. This allows for SMT and abstract interpretation approaches with fine-grained, probabilistically interpretable, control regarding on how (a)typical the inputs subject to verification are.
☆ CityNav: Language-Goal Aerial Navigation Dataset with Geographic Information
Vision-and-language navigation (VLN) aims to guide autonomous agents through real-world environments by integrating visual and linguistic cues. While substantial progress has been made in understanding these interactive modalities in ground-level navigation, aerial navigation remains largely underexplored. This is primarily due to the scarcity of resources suitable for real-world, city-scale aerial navigation studies. To bridge this gap, we introduce CityNav, a new dataset for language-goal aerial navigation using a 3D point cloud representation from real-world cities. CityNav includes 32,637 natural language descriptions paired with human demonstration trajectories, collected from participants via a new web-based 3D simulator developed for this research. Each description specifies a navigation goal, leveraging the names and locations of landmarks within real-world cities. We also provide baseline models of navigation agents that incorporate an internal 2D spatial map representing landmarks referenced in the descriptions. We benchmark the latest aerial navigation baselines and our proposed model on the CityNav dataset. The results using this dataset reveal the following key findings: (i) Our aerial agent models trained on human demonstration trajectories outperform those trained on shortest path trajectories, highlighting the importance of human-driven navigation strategies; (ii) The integration of a 2D spatial map significantly enhances navigation efficiency at city scale. Our dataset and code are available at https://water-cookie.github.io/city-nav-proj/
comment: The first two authors are equally contributed
☆ Enhancing robustness of data-driven SHM models: adversarial training with circle loss
Structural health monitoring (SHM) is critical to safeguarding the safety and reliability of aerospace, civil, and mechanical infrastructure. Machine learning-based data-driven approaches have gained popularity in SHM due to advancements in sensors and computational power. However, machine learning models used in SHM are vulnerable to adversarial examples -- even small changes in input can lead to different model outputs. This paper aims to address this problem by discussing adversarial defenses in SHM. In this paper, we propose an adversarial training method for defense, which uses circle loss to optimize the distance between features in training to keep examples away from the decision boundary. Through this simple yet effective constraint, our method demonstrates substantial improvements in model robustness, surpassing existing defense mechanisms.
comment: 12 pages, 9 figures
☆ Raising the Bar: Investigating the Values of Large Language Models via Generative Evolving Testing
Warning: this paper contains model outputs exhibiting unethical information. Large Language Models (LLMs) have achieved significant breakthroughs, but their generated unethical content poses potential risks. Measuring value alignment of LLMs becomes crucial for their regulation and responsible deployment. Numerous datasets have been constructed to assess social bias, toxicity, and ethics in LLMs, but they suffer from evaluation chronoeffect, that is, as models rapidly evolve, existing data becomes leaked or undemanding, overestimating ever-developing LLMs. To tackle this problem, we propose GETA, a novel generative evolving testing approach that dynamically probes the underlying moral baselines of LLMs. Distinct from previous adaptive testing methods that rely on static datasets with limited difficulty, GETA incorporates an iteratively-updated item generator which infers each LLM's moral boundaries and generates difficulty-tailored testing items, accurately reflecting the true alignment extent. This process theoretically learns a joint distribution of item and model response, with item difficulty and value conformity as latent variables, where the generator co-evolves with the LLM, addressing chronoeffect. We evaluate various popular LLMs with diverse capabilities and demonstrate that GETA can create difficulty-matching testing items and more accurately assess LLMs' values, better consistent with their performance on unseen OOD and i.i.d. items, laying the groundwork for future evaluation paradigms.
comment: Work in progress
☆ EvoAgent: Towards Automatic Multi-Agent Generation via Evolutionary Algorithms
The rise of powerful large language models (LLMs) has spurred a new trend in building LLM-based autonomous agents for solving complex tasks, especially multi-agent systems. Despite the remarkable progress, we notice that existing works are heavily dependent on human-designed frameworks, which greatly limits the functional scope and scalability of agent systems. How to automatically extend the specialized agent to multi-agent systems to improve task-solving capability still remains a significant challenge. In this paper, we introduce EvoAgent, a generic method to automatically extend expert agents to multi-agent systems via the evolutionary algorithm, thereby improving the effectiveness of LLM-based agents in solving tasks. Specifically, we consider the existing agent frameworks as the initial individual and then apply a series of evolutionary operators (e.g., mutation, crossover, selection, etc.) to generate multiple agents with diverse agent settings. EvoAgent can be generalized to any LLM-based agent framework, and can automatically extend the existing agent framework to multi-agent systems without any extra human designs. Experimental results across various tasks have shown that EvoAgent can automatically generate multiple expert agents and significantly enhance the task-solving capabilities of LLM-based agents.
comment: Work in process
☆ Proving Olympiad Algebraic Inequalities without Human Demonstrations
Solving Olympiad-level mathematical problems represents a significant advancement in machine intelligence and automated reasoning. Current machine learning methods, however, struggle to solve Olympiad-level problems beyond Euclidean plane geometry due to a lack of large-scale, high-quality datasets. The challenge is even greater in algebraic systems, which involve infinite reasoning spaces within finite conditions. To address these issues, we propose AIPS, an Algebraic Inequality Proving System capable of autonomously generating complex inequality theorems and effectively solving Olympiad-level inequality problems without requiring human demonstrations. During proof search in a mixed reasoning manner, a value curriculum learning strategy on generated datasets is implemented to improve proving performance, demonstrating strong mathematical intuitions. On a test set of 20 International Mathematical Olympiad-level inequality problems, AIPS successfully solved 10, outperforming state-of-the-art methods. Furthermore, AIPS automatically generated a vast array of non-trivial theorems without human intervention, some of which have been evaluated by professional contestants and deemed to reach the level of the International Mathematical Olympiad. Notably, one theorem was selected as a competition problem in a major city 2024 Mathematical Olympiad.
comment: 36 pages, 32 figures, 2 tables
☆ REVEAL-IT: REinforcement learning with Visibility of Evolving Agent poLicy for InTerpretability
Understanding the agent's learning process, particularly the factors that contribute to its success or failure post-training, is crucial for comprehending the rationale behind the agent's decision-making process. Prior methods clarify the learning process by creating a structural causal model (SCM) or visually representing the distribution of value functions. Nevertheless, these approaches have constraints as they exclusively function in 2D-environments or with uncomplicated transition dynamics. Understanding the agent's learning process in complicated environments or tasks is more challenging. In this paper, we propose REVEAL-IT, a novel framework for explaining the learning process of an agent in complex environments. Initially, we visualize the policy structure and the agent's learning process for various training tasks. By visualizing these findings, we can understand how much a particular training task or stage affects the agent's performance in test. Then, a GNN-based explainer learns to highlight the most important section of the policy, providing a more clear and robust explanation of the agent's learning process. The experiments demonstrate that explanations derived from this framework can effectively help in the optimization of the
☆ SeCoKD: Aligning Large Language Models for In-Context Learning with Fewer Shots
Previous studies have shown that demonstrations can significantly help Large Language Models (LLMs ) perform better on the given tasks. However, this so-called In-Context Learning ( ICL ) ability is very sensitive to the presenting context, and often dozens of demonstrations are needed. In this work, we investigate if we can reduce the shot number while still maintaining a competitive performance. We present SeCoKD, a self-Knowledge Distillation ( KD ) training framework that aligns the student model with a heavily prompted variation, thereby increasing the utilization of a single demonstration. We experiment with the SeCoKD across three LLMs and six benchmarks focusing mainly on reasoning tasks. Results show that our method outperforms the base model and Supervised Fine-tuning ( SFT ), especially in zero-shot and one-shot settings by 30% and 10%, respectively. Moreover, SeCoKD brings little negative artifacts when evaluated on new tasks, which is more robust than Supervised Fine-tuning.
comment: preprint
☆ VLBiasBench: A Comprehensive Benchmark for Evaluating Bias in Large Vision-Language Model
The emergence of Large Vision-Language Models (LVLMs) marks significant strides towards achieving general artificial intelligence. However, these advancements are tempered by the outputs that often reflect biases, a concern not yet extensively investigated. Existing benchmarks are not sufficiently comprehensive in evaluating biases due to their limited data scale, single questioning format and narrow sources of bias. To address this problem, we introduce VLBiasBench, a benchmark aimed at evaluating biases in LVLMs comprehensively. In VLBiasBench, we construct a dataset encompassing nine distinct categories of social biases, including age, disability status, gender, nationality, physical appearance, race, religion, profession, social economic status and two intersectional bias categories (race x gender, and race x social economic status). To create a large-scale dataset, we use Stable Diffusion XL model to generate 46,848 high-quality images, which are combined with different questions to form 128,342 samples. These questions are categorized into open and close ended types, fully considering the sources of bias and comprehensively evaluating the biases of LVLM from multiple perspectives. We subsequently conduct extensive evaluations on 15 open-source models as well as one advanced closed-source model, providing some new insights into the biases revealing from these models. Our benchmark is available at https://github.com/Xiangkui-Cao/VLBiasBench.
☆ Timo: Towards Better Temporal Reasoning for Language Models
Reasoning about time is essential for Large Language Models (LLMs) to understand the world. Previous works focus on solving specific tasks, primarily on time-sensitive question answering. While these methods have proven effective, they cannot generalize to a wider spectrum of temporal reasoning tasks. Therefore, we propose a crucial question: Can we build a universal framework to handle a variety of temporal reasoning tasks? To that end, we systematically study 38 temporal reasoning tasks. Based on the observation that 19 tasks are directly related to mathematics, we first leverage the available mathematical dataset to set a solid foundation for temporal reasoning. However, the in-depth study indicates that focusing solely on mathematical enhancement falls short of addressing pure temporal reasoning tasks. To mitigate this limitation, we propose a simple but effective self-critic temporal optimization method to enhance the model's temporal reasoning capabilities without sacrificing general task abilities. Finally, we develop Timo, a model designed to excel in temporal reasoning at the 7B and 13B scales. Notably, Timo outperforms the counterpart LLMs by 10.0 and 7.6 in average accuracy scores and achieves the new state-of-the-art (SOTA) performance of comparable size. Extensive experiments further validate our framework's effectiveness and its generalization across diverse temporal tasks. The code is available at https://github.com/zhaochen0110/Timo.
comment: Under review
☆ Temporal Knowledge Graph Question Answering: A Survey
Knowledge Base Question Answering (KBQA) has been a long-standing field to answer questions based on knowledge bases. Recently, the evolving dynamics of knowledge have attracted a growing interest in Temporal Knowledge Graph Question Answering (TKGQA), an emerging task to answer temporal questions. However, this field grapples with ambiguities in defining temporal questions and lacks a systematic categorization of existing methods for TKGQA. In response, this paper provides a thorough survey from two perspectives: the taxonomy of temporal questions and the methodological categorization for TKGQA. Specifically, we first establish a detailed taxonomy of temporal questions engaged in prior studies. Subsequently, we provide a comprehensive review of TKGQA techniques of two categories: semantic parsing-based and TKG embedding-based. Building on this review, the paper outlines potential research directions aimed at advancing the field of TKGQA. This work aims to serve as a comprehensive reference for TKGQA and to stimulate further research.
comment: 8 pages, 3 figures
☆ Failure-Resilient Distributed Inference with Model Compression over Heterogeneous Edge Devices
The distributed inference paradigm enables the computation workload to be distributed across multiple devices, facilitating the implementations of deep learning based intelligent services on extremely resource-constrained Internet of Things (IoT) scenarios. Yet it raises great challenges to perform complicated inference tasks relying on a cluster of IoT devices that are heterogeneous in their computing/communication capacity and prone to crash or timeout failures. In this paper, we present RoCoIn, a robust cooperative inference mechanism for locally distributed execution of deep neural network-based inference tasks over heterogeneous edge devices. It creates a set of independent and compact student models that are learned from a large model using knowledge distillation for distributed deployment. In particular, the devices are strategically grouped to redundantly deploy and execute the same student model such that the inference process is resilient to any local failures, while a joint knowledge partition and student model assignment scheme are designed to minimize the response latency of the distributed inference system in the presence of devices with diverse capacities. Extensive simulations are conducted to corroborate the superior performance of our RoCoIn for distributed inference compared to several baselines, and the results demonstrate its efficacy in timely inference and failure resiliency.
☆ SimulSeamless: FBK at IWSLT 2024 Simultaneous Speech Translation
This paper describes the FBK's participation in the Simultaneous Translation Evaluation Campaign at IWSLT 2024. For this year's submission in the speech-to-text translation (ST) sub-track, we propose SimulSeamless, which is realized by combining AlignAtt and SeamlessM4T in its medium configuration. The SeamlessM4T model is used "off-the-shelf" and its simultaneous inference is enabled through the adoption of AlignAtt, a SimulST policy based on cross-attention that can be applied without any retraining or adaptation of the underlying model for the simultaneous task. We participated in all the Shared Task languages (English->{German, Japanese, Chinese}, and Czech->English), achieving acceptable or even better results compared to last year's submissions. SimulSeamless, covering more than 143 source languages and 200 target languages, is released at: https://github.com/hlt-mt/FBK-fairseq/.
☆ A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
☆ Ranking LLMs by compression
We conceptualize the process of understanding as information compression, and propose a method for ranking large language models (LLMs) based on lossless data compression. We demonstrate the equivalence of compression length under arithmetic coding with cumulative negative log probabilities when using a large language model as a prior, that is, the pre-training phase of the model is essentially the process of learning the optimal coding length. At the same time, the evaluation metric compression ratio can be obtained without actual compression, which greatly saves overhead. In this paper, we use five large language models as priors for compression, then compare their performance on challenging natural language processing tasks, including sentence completion, question answering, and coreference resolution. Experimental results show that compression ratio and model performance are positively correlated, so it can be used as a general metric to evaluate large language models.
comment: 7 pages, 4 tables
☆ A Data-Driven Guided Decoding Mechanism for Diagnostic Captioning ACL
Diagnostic Captioning (DC) automatically generates a diagnostic text from one or more medical images (e.g., X-rays, MRIs) of a patient. Treated as a draft, the generated text may assist clinicians, by providing an initial estimation of the patient's condition, speeding up and helping safeguard the diagnostic process. The accuracy of a diagnostic text, however, strongly depends on how well the key medical conditions depicted in the images are expressed. We propose a new data-driven guided decoding method that incorporates medical information, in the form of existing tags capturing key conditions of the image(s), into the beam search of the diagnostic text generation process. We evaluate the proposed method on two medical datasets using four DC systems that range from generic image-to-text systems with CNN encoders and RNN decoders to pre-trained Large Language Models. The latter can also be used in few- and zero-shot learning scenarios. In most cases, the proposed mechanism improves performance with respect to all evaluation measures. We provide an open-source implementation of the proposed method at https://github.com/nlpaueb/dmmcs.
comment: [Pre-print] ACL Findings 2024, 17 pages, 7 figures, 7 tables
☆ DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
☆ Finding Safety Neurons in Large Language Models
Large language models (LLMs) excel in various capabilities but also pose safety risks such as generating harmful content and misinformation, even after safety alignment. In this paper, we explore the inner mechanisms of safety alignment from the perspective of mechanistic interpretability, focusing on identifying and analyzing safety neurons within LLMs that are responsible for safety behaviors. We propose generation-time activation contrasting to locate these neurons and dynamic activation patching to evaluate their causal effects. Experiments on multiple recent LLMs show that: (1) Safety neurons are sparse and effective. We can restore $90$% safety performance with intervention only on about $5$% of all the neurons. (2) Safety neurons encode transferrable mechanisms. They exhibit consistent effectiveness on different red-teaming datasets. The finding of safety neurons also interprets "alignment tax". We observe that the identified key neurons for safety and helpfulness significantly overlap, but they require different activation patterns of the shared neurons. Furthermore, we demonstrate an application of safety neurons in detecting unsafe outputs before generation. Our findings may promote further research on understanding LLM alignment. The source codes will be publicly released to facilitate future research.
☆ Online Learning of Weakly Coupled MDP Policies for Load Balancing and Auto Scaling
Load balancing and auto scaling are at the core of scalable, contemporary systems, addressing dynamic resource allocation and service rate adjustments in response to workload changes. This paper introduces a novel model and algorithms for tuning load balancers coupled with auto scalers, considering bursty traffic arriving at finite queues. We begin by presenting the problem as a weakly coupled Markov Decision Processes (MDP), solvable via a linear program (LP). However, as the number of control variables of such LP grows combinatorially, we introduce a more tractable relaxed LP formulation, and extend it to tackle the problem of online parameter learning and policy optimization using a two-timescale algorithm based on the LP Lagrangian.
☆ Autonomous Robotic Drilling System for Mice Cranial Window Creation
Robotic assistance for experimental manipulation in the life sciences is expected to enable favorable outcomes, regardless of the skill of the scientist. Experimental specimens in the life sciences are subject to individual variability hence require intricate algorithms for successful autonomous robotic control. As a use case, we are studying the creation of cranial windows in mice. This operation requires the removal of an 8-mm-circular patch of the skull, which is approximately 300 um thick, but the shape and thickness of the mouse skull significantly varies depending on the strain of mouse, sex, and age. In this work, we propose an autonomous robotic drilling method with no offline planning, consisting of a trajectory planning block with execution-time feedback with completion level recognition based on image and force information. The force information allows for completion-level resolution to increase 10 fold. We evaluate the proposed method in two ways. First, in an eggshell drilling task and achieved a success rate of 95% and average drilling time of 7.1 min out of 20 trials. Second, in postmortem mice and with a success rate of 70% and average drilling time of 9.3 min out of 20 trials.
comment: 15 pages, 14 figures, to be submitted to IEEE
☆ Enhancing Monotonic Modeling with Spatio-Temporal Adaptive Awareness in Diverse Marketing
In the mobile internet era, the Online Food Ordering Service (OFOS) emerges as an integral component of inclusive finance owing to the convenience it brings to people. OFOS platforms offer dynamic allocation incentives to users and merchants through diverse marketing campaigns to encourage payments while maintaining the platforms' budget efficiency. Despite significant progress, the marketing domain continues to face two primary challenges: (i) how to allocate a limited budget with greater efficiency, demanding precision in predicting users' monotonic response (i.e. sensitivity) to incentives, and (ii) ensuring spatio-temporal adaptability and robustness in diverse marketing campaigns across different times and locations. To address these issues, we propose a Constrained Monotonic Adaptive Network (CoMAN) method for spatio-temporal perception within marketing pricing. Specifically, we capture spatio-temporal preferences within attribute features through two foundational spatio-temporal perception modules. To further enhance catching the user sensitivity differentials to incentives across varied times and locations, we design modules for learning spatio-temporal convexity and concavity as well as for expressing sensitivity functions. CoMAN can achieve a more efficient allocation of incentive investments during pricing, thus increasing the conversion rate and orders while maintaining budget efficiency. Extensive offline and online experimental results within our diverse marketing campaigns demonstrate the effectiveness of the proposed approach while outperforming the monotonic state-of-the-art method.
comment: 7 pages
☆ Measuring Sample Importance in Data Pruning for Training LLMs from a Data Compression Perspective
Compute-efficient training of large language models (LLMs) has become an important research problem. In this work, we consider data pruning as a method of data-efficient training of LLMs, where we take a data compression view on data pruning. We argue that the amount of information of a sample, or the achievable compression on its description length, represents its sample importance. The key idea is that, less informative samples are likely to contain redundant information, and thus should be pruned first. We leverage log-likelihood function of trained models as a surrogate to measure information content of samples. Experiments reveal a surprising insight that information-based pruning can enhance the generalization capability of the model, improves upon language modeling and downstream tasks as compared to the model trained on the entire dataset.
☆ EduQate: Generating Adaptive Curricula through RMABs in Education Settings NeurIPS 2024
There has been significant interest in the development of personalized and adaptive educational tools that cater to a student's individual learning progress. A crucial aspect in developing such tools is in exploring how mastery can be achieved across a diverse yet related range of content in an efficient manner. While Reinforcement Learning and Multi-armed Bandits have shown promise in educational settings, existing works often assume the independence of learning content, neglecting the prevalent interdependencies between such content. In response, we introduce Education Network Restless Multi-armed Bandits (EdNetRMABs), utilizing a network to represent the relationships between interdependent arms. Subsequently, we propose EduQate, a method employing interdependency-aware Q-learning to make informed decisions on arm selection at each time step. We establish the optimality guarantee of EduQate and demonstrate its efficacy compared to baseline policies, using students modeled from both synthetic and real-world data.
comment: 9 pages, NeurIPS 2024 pre-print
☆ EasyECR: A Library for Easy Implementation and Evaluation of Event Coreference Resolution Models
Event Coreference Resolution (ECR) is the task of clustering event mentions that refer to the same real-world event. Despite significant advancements, ECR research faces two main challenges: limited generalizability across domains due to narrow dataset evaluations, and difficulties in comparing models within diverse ECR pipelines. To address these issues, we develop EasyECR, the first open-source library designed to standardize data structures and abstract ECR pipelines for easy implementation and fair evaluation. More specifically, EasyECR integrates seven representative pipelines and ten popular benchmark datasets, enabling model evaluations on various datasets and promoting the development of robust ECR pipelines. By conducting extensive evaluation via our EasyECR, we find that, \lowercase\expandafter{\romannumeral1}) the representative ECR pipelines cannot generalize across multiple datasets, hence evaluating ECR pipelines on multiple datasets is necessary, \lowercase\expandafter{\romannumeral2}) all models in ECR pipelines have a great effect on pipeline performance, therefore, when one model in ECR pipelines are compared, it is essential to ensure that the other models remain consistent. Additionally, reproducing ECR results is not trivial, and the developed library can help reduce this discrepancy. The experimental results provide valuable baselines for future research.
comment: 14 pages, 4 figures, 12 tables
☆ Two-Stage Depth Enhanced Learning with Obstacle Map For Object Navigation
The task that requires an agent to navigate to a given object through only visual observation is called visual object navigation (VON). The main bottlenecks of VON are strategies exploration and prior knowledge exploitation. Traditional strategies exploration ignores the differences of searching and navigating stages, using the same reward in two stages, which reduces navigation performance and training efficiency. Our study enables the agent to explore larger area in searching stage and seek the optimal path in navigating stage, improving the success rate of navigation. Traditional prior knowledge exploitation focused on learning and utilizing object association, which ignored the depth and obstacle information in the environment. This paper uses the RGB and depth information of the training scene to pretrain the feature extractor, which improves navigation efficiency. The obstacle information is memorized by the agent during the navigation, reducing the probability of collision and deadlock. Depth, obstacle and other prior knowledge are concatenated and input into the policy network, and navigation actions are output under the training of two-stage rewards. We evaluated our method on AI2-Thor and RoboTHOR and demonstrated that it significantly outperforms state-of-the-art (SOTA) methods on success rate and navigation efficiency.
☆ Enhancing the LLM-Based Robot Manipulation Through Human-Robot Collaboration
Large Language Models (LLMs) are gaining popularity in the field of robotics. However, LLM-based robots are limited to simple, repetitive motions due to the poor integration between language models, robots, and the environment. This paper proposes a novel approach to enhance the performance of LLM-based autonomous manipulation through Human-Robot Collaboration (HRC). The approach involves using a prompted GPT-4 language model to decompose high-level language commands into sequences of motions that can be executed by the robot. The system also employs a YOLO-based perception algorithm, providing visual cues to the LLM, which aids in planning feasible motions within the specific environment. Additionally, an HRC method is proposed by combining teleoperation and Dynamic Movement Primitives (DMP), allowing the LLM-based robot to learn from human guidance. Real-world experiments have been conducted using the Toyota Human Support Robot for manipulation tasks. The outcomes indicate that tasks requiring complex trajectory planning and reasoning over environments can be efficiently accomplished through the incorporation of human demonstrations.
comment: IEEE Robotics and Automation Letters
☆ Graph Neural Networks for Job Shop Scheduling Problems: A Survey
Job shop scheduling problems (JSSPs) represent a critical and challenging class of combinatorial optimization problems. Recent years have witnessed a rapid increase in the application of graph neural networks (GNNs) to solve JSSPs, albeit lacking a systematic survey of the relevant literature. This paper aims to thoroughly review prevailing GNN methods for different types of JSSPs and the closely related flow-shop scheduling problems (FSPs), especially those leveraging deep reinforcement learning (DRL). We begin by presenting the graph representations of various JSSPs, followed by an introduction to the most commonly used GNN architectures. We then review current GNN-based methods for each problem type, highlighting key technical elements such as graph representations, GNN architectures, GNN tasks, and training algorithms. Finally, we summarize and analyze the advantages and limitations of GNNs in solving JSSPs and provide potential future research opportunities. We hope this survey can motivate and inspire innovative approaches for more powerful GNN-based approaches in tackling JSSPs and other scheduling problems.
☆ Memory-Efficient Gradient Unrolling for Large-Scale Bi-level Optimization
Bi-level optimization (BO) has become a fundamental mathematical framework for addressing hierarchical machine learning problems. As deep learning models continue to grow in size, the demand for scalable bi-level optimization solutions has become increasingly critical. Traditional gradient-based bi-level optimization algorithms, due to their inherent characteristics, are ill-suited to meet the demands of large-scale applications. In this paper, we introduce $\textbf{F}$orward $\textbf{G}$radient $\textbf{U}$nrolling with $\textbf{F}$orward $\textbf{F}$radient, abbreviated as $(\textbf{FG})^2\textbf{U}$, which achieves an unbiased stochastic approximation of the meta gradient for bi-level optimization. $(\text{FG})^2\text{U}$ circumvents the memory and approximation issues associated with classical bi-level optimization approaches, and delivers significantly more accurate gradient estimates than existing large-scale bi-level optimization approaches. Additionally, $(\text{FG})^2\text{U}$ is inherently designed to support parallel computing, enabling it to effectively leverage large-scale distributed computing systems to achieve significant computational efficiency. In practice, $(\text{FG})^2\text{U}$ and other methods can be strategically placed at different stages of the training process to achieve a more cost-effective two-phase paradigm. Further, $(\text{FG})^2\text{U}$ is easy to implement within popular deep learning frameworks, and can be conveniently adapted to address more challenging zeroth-order bi-level optimization scenarios. We provide a thorough convergence analysis and a comprehensive practical discussion for $(\text{FG})^2\text{U}$, complemented by extensive empirical evaluations, showcasing its superior performance in diverse large-scale bi-level optimization tasks.
☆ Personalized Music Recommendation with a Heterogeneity-aware Deep Bayesian Network
Music recommender systems are crucial in music streaming platforms, providing users with music they would enjoy. Recent studies have shown that user emotions can affect users' music mood preferences. However, existing emotion-aware music recommender systems (EMRSs) explicitly or implicitly assume that users' actual emotional states expressed by an identical emotion word are homogeneous. They also assume that users' music mood preferences are homogeneous under an identical emotional state. In this article, we propose four types of heterogeneity that an EMRS should consider: emotion heterogeneity across users, emotion heterogeneity within a user, music mood preference heterogeneity across users, and music mood preference heterogeneity within a user. We further propose a Heterogeneity-aware Deep Bayesian Network (HDBN) to model these assumptions. The HDBN mimics a user's decision process to choose music with four components: personalized prior user emotion distribution modeling, posterior user emotion distribution modeling, user grouping, and Bayesian neural network-based music mood preference prediction. We constructed a large-scale dataset called EmoMusicLJ to validate our method. Extensive experiments demonstrate that our method significantly outperforms baseline approaches on widely used HR and NDCG recommendation metrics. Ablation experiments and case studies further validate the effectiveness of our HDBN. The source code is available at https://github.com/jingrk/HDBN.
comment: 34 pages, 19 figures
☆ ReaLHF: Optimized RLHF Training for Large Language Models through Parameter Reallocation
Reinforcement Learning from Human Feedback (RLHF) stands as a pivotal technique in empowering large language model (LLM) applications. Since RLHF involves diverse computational workloads and intricate dependencies among multiple LLMs, directly adopting parallelization techniques from supervised training can result in sub-optimal performance. To overcome this limitation, we propose a novel approach named parameter ReaLlocation, which dynamically redistributes LLM parameters in the cluster and adapts parallelization strategies during training. Building upon this idea, we introduce ReaLHF, a pioneering system capable of automatically discovering and running efficient execution plans for RLHF training given the desired algorithmic and hardware configurations. ReaLHF formulates the execution plan for RLHF as an augmented dataflow graph. Based on this formulation, ReaLHF employs a tailored search algorithm with a lightweight cost estimator to discover an efficient execution plan. Subsequently, the runtime engine deploys the selected plan by effectively parallelizing computations and redistributing parameters. We evaluate ReaLHF on the LLaMA-2 models with up to $4\times70$ billion parameters and 128 GPUs. The experiment results showcase ReaLHF's substantial speedups of $2.0-10.6\times$ compared to baselines. Furthermore, the execution plans generated by ReaLHF exhibit an average of $26\%$ performance improvement over heuristic approaches based on Megatron-LM. The source code of ReaLHF is publicly available at https://github.com/openpsi-project/ReaLHF .
comment: 13 pages (15 pages with references), 13 figures
☆ Semi Supervised Heterogeneous Domain Adaptation via Disentanglement and Pseudo-Labelling
Semi-supervised domain adaptation methods leverage information from a source labelled domain with the goal of generalizing over a scarcely labelled target domain. While this setting already poses challenges due to potential distribution shifts between domains, an even more complex scenario arises when source and target data differs in modality representation (e.g. they are acquired by sensors with different characteristics). For instance, in remote sensing, images may be collected via various acquisition modes (e.g. optical or radar), different spectral characteristics (e.g. RGB or multi-spectral) and spatial resolutions. Such a setting is denoted as Semi-Supervised Heterogeneous Domain Adaptation (SSHDA) and it exhibits an even more severe distribution shift due to modality heterogeneity across domains.To cope with the challenging SSHDA setting, here we introduce SHeDD (Semi-supervised Heterogeneous Domain Adaptation via Disentanglement) an end-to-end neural framework tailored to learning a target domain classifier by leveraging both labelled and unlabelled data from heterogeneous data sources. SHeDD is designed to effectively disentangle domain-invariant representations, relevant for the downstream task, from domain-specific information, that can hinder the cross-modality transfer. Additionally, SHeDD adopts an augmentation-based consistency regularization mechanism that takes advantages of reliable pseudo-labels on the unlabelled target samples to further boost its generalization ability on the target domain. Empirical evaluations on two remote sensing benchmarks, encompassing heterogeneous data in terms of acquisition modes and spectral/spatial resolutions, demonstrate the quality of SHeDD compared to both baseline and state-of-the-art competing approaches. Our code is publicly available here: https://github.com/tanodino/SSHDA/
☆ Seg-LSTM: Performance of xLSTM for Semantic Segmentation of Remotely Sensed Images
Recent advancements in autoregressive networks with linear complexity have driven significant research progress, demonstrating exceptional performance in large language models. A representative model is the Extended Long Short-Term Memory (xLSTM), which incorporates gating mechanisms and memory structures, performing comparably to Transformer architectures in long-sequence language tasks. Autoregressive networks such as xLSTM can utilize image serialization to extend their application to visual tasks such as classification and segmentation. Although existing studies have demonstrated Vision-LSTM's impressive results in image classification, its performance in image semantic segmentation remains unverified. Our study represents the first attempt to evaluate the effectiveness of Vision-LSTM in the semantic segmentation of remotely sensed images. This evaluation is based on a specifically designed encoder-decoder architecture named Seg-LSTM, and comparisons with state-of-the-art segmentation networks. Our study found that Vision-LSTM's performance in semantic segmentation was limited and generally inferior to Vision-Transformers-based and Vision-Mamba-based models in most comparative tests. Future research directions for enhancing Vision-LSTM are recommended. The source code is available from https://github.com/zhuqinfeng1999/Seg-LSTM.
☆ Teaching Models To Survive: Proper Scoring Rule and Stochastic Optimization with Competing Risks
When data are right-censored, i.e. some outcomes are missing due to a limited period of observation, survival analysis can compute the "time to event". Multiple classes of outcomes lead to a classification variant: predicting the most likely event, known as competing risks, which has been less studied. To build a loss that estimates outcome probabilities for such settings, we introduce a strictly proper censoring-adjusted separable scoring rule that can be optimized on a subpart of the data because the evaluation is made independently of observations. It enables stochastic optimization for competing risks which we use to train gradient boosting trees. Compared to 11 state-of-the-art models, this model, MultiIncidence, performs best in estimating the probability of outcomes in survival and competing risks. It can predict at any time horizon and is much faster than existing alternatives.
☆ Exploring Layerwise Adversarial Robustness Through the Lens of t-SNE
Adversarial examples, designed to trick Artificial Neural Networks (ANNs) into producing wrong outputs, highlight vulnerabilities in these models. Exploring these weaknesses is crucial for developing defenses, and so, we propose a method to assess the adversarial robustness of image-classifying ANNs. The t-distributed Stochastic Neighbor Embedding (t-SNE) technique is used for visual inspection, and a metric, which compares the clean and perturbed embeddings, helps pinpoint weak spots in the layers. Analyzing two ANNs on CIFAR-10, one designed by humans and another via NeuroEvolution, we found that differences between clean and perturbed representations emerge early on, in the feature extraction layers, affecting subsequent classification. The findings with our metric are supported by the visual analysis of the t-SNE maps.
☆ Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real online LLM serving systems (with continuous batching) does not always yield improvement -- under higher request rates or low speculation accuracy, it paradoxically increases latency. Furthermore, there is no best speculation length work for all workloads under different system loads. Based on the observations, we develop a dynamic framework SmartSpec. SmartSpec dynamically determines the best speculation length for each request (from 0, i.e., no speculation, to many tokens) -- hence the associated speculative execution costs -- based on a new metric called goodput, which characterizes the current observed load of the entire system and the speculation accuracy. We show that SmartSpec consistently reduces average request latency by up to 3.2x compared to non-speculative decoding baselines across different sizes of target models, draft models, request rates, and datasets. Moreover, SmartSpec can be applied to different styles of speculative decoding, including traditional, model-based approaches as well as model-free methods like prompt lookup and tree-style decoding.
☆ How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
What makes a good Large Language Model (LLM)? That it performs well on the relevant benchmarks -- which hopefully measure, with some validity, the presence of capabilities that are also challenged in real application. But what makes the model perform well? What gives a model its abilities? We take a recently introduced type of benchmark that is meant to challenge capabilities in a goal-directed, agentive context through self-play of conversational games, and analyse how performance develops as a function of model characteristics like number of parameters, or type of training. We find that while there is a clear relationship between number of parameters and performance, there is still a wide spread of performance points within a given size bracket, which is to be accounted for by training parameters such as fine-tuning data quality and method. From a more practical angle, we also find a certain degree of unpredictability about performance across access methods, possible due to unexposed sampling parameters, and a, very welcome, performance stability against at least moderate weight quantisation during inference.
comment: under review
☆ Understanding Different Design Choices in Training Large Time Series Models
Inspired by Large Language Models (LLMs), Time Series Forecasting (TSF), a long-standing task in time series analysis, is undergoing a transition towards Large Time Series Models (LTSMs), aiming to train universal transformer-based models for TSF. However, training LTSMs on heterogeneous time series data poses unique challenges, including diverse frequencies, dimensions, and patterns across datasets. Recent endeavors have studied and evaluated various design choices aimed at enhancing LTSM training and generalization capabilities, spanning pre-processing techniques, model configurations, and dataset configurations. In this work, we comprehensively analyze these design choices and aim to identify the best practices for training LTSM. Moreover, we propose \emph{time series prompt}, a novel statistical prompting strategy tailored to time series data. Furthermore, based on the observations in our analysis, we introduce \texttt{LTSM-bundle}, which bundles the best design choices we have identified. Empirical results demonstrate that \texttt{LTSM-bundle} achieves superior zero-shot and few-shot performances compared to state-of-the-art LSTMs and traditional TSF methods on benchmark datasets.
☆ CryptoGPT: a 7B model rivaling GPT-4 in the task of analyzing and classifying real-time financial news
CryptoGPT: a 7B model competing with GPT-4 in a specific task -- The Impact of Automatic Annotation and Strategic Fine-Tuning via QLoRAIn this article, we present a method aimed at refining a dedicated LLM of reasonable quality with limited resources in an industrial setting via CryptoGPT. It is an LLM designed for financial news analysis for the cryptocurrency market in real-time. This project was launched in an industrial context. This model allows not only for the classification of financial information but also for providing comprehensive analysis. We refined different LLMs of the same size such as Mistral-7B and LLama-7B using semi-automatic annotation and compared them with various LLMs such as GPT-3.5 and GPT-4. Our goal is to find a balance among several needs: 1. Protecting data (by avoiding their transfer to external servers), 2. Limiting annotation cost and time, 3. Controlling the model's size (to manage deployment costs), and 4. Maintaining better analysis quality.
comment: Journ{\'e}e Nationale sur la Fouille de Textes, Pascal CUXAC; Adrien GUILLE; C{\'e}dric LOPEZ, Jun 2024, Lyon (Universit{\'e} Lumi{\`e}re Lyon 2), France
☆ Resource-efficient Medical Image Analysis with Self-adapting Forward-Forward Networks
We introduce a fast Self-adapting Forward-Forward Network (SaFF-Net) for medical imaging analysis, mitigating power consumption and resource limitations, which currently primarily stem from the prevalent reliance on back-propagation for model training and fine-tuning. Building upon the recently proposed Forward-Forward Algorithm (FFA), we introduce the Convolutional Forward-Forward Algorithm (CFFA), a parameter-efficient reformulation that is suitable for advanced image analysis and overcomes the speed and generalisation constraints of the original FFA. To address hyper-parameter sensitivity of FFAs we are also introducing a self-adapting framework SaFF-Net fine-tuning parameters during warmup and training in parallel. Our approach enables more effective model training and eliminates the previously essential requirement for an arbitrarily chosen Goodness function in FFA. We evaluate our approach on several benchmarking datasets in comparison with standard Back-Propagation (BP) neural networks showing that FFA-based networks with notably fewer parameters and function evaluations can compete with standard models, especially, in one-shot scenarios and large batch sizes. The code will be available at the time of the conference.
comment: Under Review
☆ Toward Infinite-Long Prefix in Transformer
Prompting and contextual-based fine-tuning methods, which we call Prefix Learning, have been proposed to enhance the performance of language models on various downstream tasks that can match full parameter fine-tuning. There remains a limited theoretical understanding of how these methods work. In this paper, we aim to relieve this limitation by studying the learning ability of Prefix Learning from the perspective of prefix length. In particular, we approximate the infinite-long Prefix Learning optimization process by the Neural Tangent Kernel (NTK) technique. We formulate and solve it as a learning problem of the infinite-long prefix in a one-layer attention network. Our results confirm the over-parameterization property and arbitrary small loss convergence guarantee of the infinite-long Prefix Learning in attention. To the implementation end, we propose our NTK-Attention method, which is "equivalent" to attention computation with arbitrary prefix length efficiently. Its time complexity mainly depends on the sub-quadratic of input length (without prefix), and our method only requires $d^2 + d$ extra parameters for representation, where $d$ is the feature dimension. In addition, we conducted experiments that compare our NTK-Attention with full parameters fine-tuning, LoRA, and P-Tuning V2 methods across vision or natural language datasets. The results indicate our approach may be a promising parameter-efficient-fine-tuning method since it has demonstrated superior performance in numerous scenarios. Our code can be found at \url{https://github.com/ChristianYang37/chiwun/tree/main/src/NTK-Attention}.
☆ Two Giraffes in a Dirt Field: Using Game Play to Investigate Situation Modelling in Large Multimodal Models
While the situation has improved for text-only models, it again seems to be the case currently that multimodal (text and image) models develop faster than ways to evaluate them. In this paper, we bring a recently developed evaluation paradigm from text models to multimodal models, namely evaluation through the goal-oriented game (self) play, complementing reference-based and preference-based evaluation. Specifically, we define games that challenge a model's capability to represent a situation from visual information and align such representations through dialogue. We find that the largest closed models perform rather well on the games that we define, while even the best open-weight models struggle with them. On further analysis, we find that the exceptional deep captioning capabilities of the largest models drive some of the performance. There is still room to grow for both kinds of models, ensuring the continued relevance of the benchmark.
comment: under review
☆ How to design a dataset compliant with an ML-based system ODD?
This paper focuses on a Vision-based Landing task and presents the design and the validation of a dataset that would comply with the Operational Design Domain (ODD) of a Machine-Learning (ML) system. Relying on emerging certification standards, we describe the process for establishing ODDs at both the system and image levels. In the process, we present the translation of high-level system constraints into actionable image-level properties, allowing for the definition of verifiable Data Quality Requirements (DQRs). To illustrate this approach, we use the Landing Approach Runway Detection (LARD) dataset which combines synthetic imagery and real footage, and we focus on the steps required to verify the DQRs. The replicable framework presented in this paper addresses the challenges of designing a dataset compliant with the stringent needs of ML-based systems certification in safety-critical applications.
comment: 12th European Congress on Embedded Real Time Software and Systems, Jun 2024, Toulouse, France. arXiv admin note: text overlap with arXiv:2304.09938
☆ Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
As Large Language Models (LLMs) become an important way of information seeking, there have been increasing concerns about the unethical content LLMs may generate. In this paper, we conduct a rigorous evaluation of LLMs' implicit bias towards certain groups by attacking them with carefully crafted instructions to elicit biased responses. Our attack methodology is inspired by psychometric principles in cognitive and social psychology. We propose three attack approaches, i.e., Disguise, Deception, and Teaching, based on which we built evaluation datasets for four common bias types. Each prompt attack has bilingual versions. Extensive evaluation of representative LLMs shows that 1) all three attack methods work effectively, especially the Deception attacks; 2) GLM-3 performs the best in defending our attacks, compared to GPT-3.5 and GPT-4; 3) LLMs could output content of other bias types when being taught with one type of bias. Our methodology provides a rigorous and effective way of evaluating LLMs' implicit bias and will benefit the assessments of LLMs' potential ethical risks.
comment: Code and datasets are available at https://github.com/wen112358/ImplicitBiasPsychometricEvaluation
☆ Leveraging eBPF and AI for Ransomware Nose Out
In this work, we propose a two-phased approach for real-time detection and deterrence of ransomware. To achieve this, we leverage the capabilities of eBPF (Extended Berkeley Packet Filter) and artificial intelligence to develop both proactive and reactive methods. In the first phase, we utilize signature based detection, where we employ custom eBPF programs to trace the execution of new processes and perform hash-based analysis against a known ransomware dataset. In the second, we employ a behavior-based technique that focuses on monitoring the process activities using a custom eBPF program and the creation of ransom notes, a prominent indicator of ransomware activity through the use of Natural Language Processing (NLP). By leveraging low-level tracing capabilities of eBPF and integrating NLP based machine learning algorithms, our solution achieves an impressive 99.76% accuracy in identifying ransomware incidents within a few seconds on the onset of zero-day attacks.
comment: 7 pages
☆ Feature Fusion Based on Mutual-Cross-Attention Mechanism for EEG Emotion Recognition MICCAI 2024
An objective and accurate emotion diagnostic reference is vital to psychologists, especially when dealing with patients who are difficult to communicate with for pathological reasons. Nevertheless, current systems based on Electroencephalography (EEG) data utilized for sentiment discrimination have some problems, including excessive model complexity, mediocre accuracy, and limited interpretability. Consequently, we propose a novel and effective feature fusion mechanism named Mutual-Cross-Attention (MCA). Combining with a specially customized 3D Convolutional Neural Network (3D-CNN), this purely mathematical mechanism adeptly discovers the complementary relationship between time-domain and frequency-domain features in EEG data. Furthermore, the new designed Channel-PSD-DE 3D feature also contributes to the high performance. The proposed method eventually achieves 99.49% (valence) and 99.30% (arousal) accuracy on DEAP dataset.
comment: The work has been accepted by MICCAI 2024. The uploaded one is preprint which has not undergone peer review (when applicable) or any post-submission improvements or corrections. The official DOI link will be provided once available
☆ Seeing Through AI's Lens: Enhancing Human Skepticism Towards LLM-Generated Fake News
LLMs offer valuable capabilities, yet they can be utilized by malicious users to disseminate deceptive information and generate fake news. The growing prevalence of LLMs poses difficulties in crafting detection approaches that remain effective across various text domains. Additionally, the absence of precautionary measures for AI-generated news on online social platforms is concerning. Therefore, there is an urgent need to improve people's ability to differentiate between news articles written by humans and those produced by LLMs. By providing cues in human-written and LLM-generated news, we can help individuals increase their skepticism towards fake LLM-generated news. This paper aims to elucidate simple markers that help individuals distinguish between articles penned by humans and those created by LLMs. To achieve this, we initially collected a dataset comprising 39k news articles authored by humans or generated by four distinct LLMs with varying degrees of fake. We then devise a metric named Entropy-Shift Authorship Signature (ESAS) based on the information theory and entropy principles. The proposed ESAS ranks terms or entities, like POS tagging, within news articles based on their relevance in discerning article authorship. We demonstrate the effectiveness of our metric by showing the high accuracy attained by a basic method, i.e., TF-IDF combined with logistic regression classifier, using a small set of terms with the highest ESAS score. Consequently, we introduce and scrutinize these top ESAS-ranked terms to aid individuals in strengthening their skepticism towards LLM-generated fake news.
☆ Information Guided Regularization for Fine-tuning Language Models
The pretraining-fine-tuning paradigm has been the de facto strategy for transfer learning in modern language modeling. With the understanding that task adaptation in LMs is often a function of parameters shared across tasks, we argue that a more surgical approach to regularization needs to exist for smoother transfer learning. Towards this end, we investigate how the pretraining loss landscape is affected by these task-sensitive parameters through an information-theoretic lens. We then leverage the findings from our investigations to devise a novel approach to dropout for improved model regularization and better downstream generalization. This approach, named guided dropout, is both task & architecture agnostic and adds no computational overhead to the fine-tuning process. Through empirical evaluations, we showcase that our approach to regularization yields consistently better performance, even in scenarios of data paucity, compared to standardized baselines.
☆ ACR: A Benchmark for Automatic Cohort Retrieval
Identifying patient cohorts is fundamental to numerous healthcare tasks, including clinical trial recruitment and retrospective studies. Current cohort retrieval methods in healthcare organizations rely on automated queries of structured data combined with manual curation, which are time-consuming, labor-intensive, and often yield low-quality results. Recent advancements in large language models (LLMs) and information retrieval (IR) offer promising avenues to revolutionize these systems. Major challenges include managing extensive eligibility criteria and handling the longitudinal nature of unstructured Electronic Medical Records (EMRs) while ensuring that the solution remains cost-effective for real-world application. This paper introduces a new task, Automatic Cohort Retrieval (ACR), and evaluates the performance of LLMs and commercial, domain-specific neuro-symbolic approaches. We provide a benchmark task, a query dataset, an EMR dataset, and an evaluation framework. Our findings underscore the necessity for efficient, high-quality ACR systems capable of longitudinal reasoning across extensive patient databases.
☆ Learning to Select Goals in Automated Planning with Deep-Q Learning
In this work we propose a planning and acting architecture endowed with a module which learns to select subgoals with Deep Q-Learning. This allows us to decrease the load of a planner when faced with scenarios with real-time restrictions. We have trained this architecture on a video game environment used as a standard test-bed for intelligent systems applications, testing it on different levels of the same game to evaluate its generalization abilities. We have measured the performance of our approach as more training data is made available, as well as compared it with both a state-of-the-art, classical planner and the standard Deep Q-Learning algorithm. The results obtained show our model performs better than the alternative methods considered, when both plan quality (plan length) and time requirements are taken into account. On the one hand, it is more sample-efficient than standard Deep Q-Learning, and it is able to generalize better across levels. On the other hand, it reduces problem-solving time when compared with a state-of-the-art automated planner, at the expense of obtaining plans with only 9% more actions.
comment: 25 pages, 4 figures
☆ How critically can an AI think? A framework for evaluating the quality of thinking of generative artificial intelligence
Generative AI such as those with large language models have created opportunities for innovative assessment design practices. Due to recent technological developments, there is a need to know the limits and capabilities of generative AI in terms of simulating cognitive skills. Assessing student critical thinking skills has been a feature of assessment for time immemorial, but the demands of digital assessment create unique challenges for equity, academic integrity and assessment authorship. Educators need a framework for determining their assessments vulnerability to generative AI to inform assessment design practices. This paper presents a framework that explores the capabilities of the LLM ChatGPT4 application, which is the current industry benchmark. This paper presents the Mapping of questions, AI vulnerability testing, Grading, Evaluation (MAGE) framework to methodically critique their assessments within their own disciplinary contexts. This critique will provide specific and targeted indications of their questions vulnerabilities in terms of the critical thinking skills. This can go on to form the basis of assessment design for their tasks.
☆ ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
comment: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS
☆ RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
☆ A Learn-Then-Reason Model Towards Generalization in Knowledge Base Question Answering
Large-scale knowledge bases (KBs) like Freebase and Wikidata house millions of structured knowledge. Knowledge Base Question Answering (KBQA) provides a user-friendly way to access these valuable KBs via asking natural language questions. In order to improve the generalization capabilities of KBQA models, extensive research has embraced a retrieve-then-reason framework to retrieve relevant evidence for logical expression generation. These multi-stage efforts prioritize acquiring external sources but overlook the incorporation of new knowledge into their model parameters. In effect, even advanced language models and retrievers have knowledge boundaries, thereby limiting the generalization capabilities of previous KBQA models. Therefore, this paper develops KBLLaMA, which follows a learn-then-reason framework to inject new KB knowledge into a large language model for flexible end-to-end KBQA. At the core of KBLLaMA, we study (1) how to organize new knowledge about KBQA and (2) how to facilitate the learning of the organized knowledge. Extensive experiments on various KBQA generalization tasks showcase the state-of-the-art performance of KBLLaMA. Especially on the general benchmark GrailQA and domain-specific benchmark Bio-chemical, KBLLaMA respectively derives a performance gain of up to 3.8% and 9.8% compared to the baselines.
☆ Diffusion-Based Failure Sampling for Cyber-Physical Systems
Validating safety-critical autonomous systems in high-dimensional domains such as robotics presents a significant challenge. Existing black-box approaches based on Markov chain Monte Carlo may require an enormous number of samples, while methods based on importance sampling often rely on simple parametric families that may struggle to represent the distribution over failures. We propose to sample the distribution over failures using a conditional denoising diffusion model, which has shown success in complex high-dimensional problems such as robotic task planning. We iteratively train a diffusion model to produce state trajectories closer to failure. We demonstrate the effectiveness of our approach on high-dimensional robotic validation tasks, improving sample efficiency and mode coverage compared to existing black-box techniques.
comment: Under review at RA-L
☆ An LLM Feature-based Framework for Dialogue Constructiveness Assessment
Research on dialogue constructiveness assessment focuses on (i) analysing conversational factors that influence individuals to take specific actions, win debates, change their perspectives or broaden their open-mindedness and (ii) predicting constructive outcomes following dialogues for such use cases. These objectives can be achieved by training either interpretable feature-based models (which often involve costly human annotations) or neural models such as pre-trained language models (which have empirically shown higher task accuracy but lack interpretability). We propose a novel LLM feature-based framework that combines the strengths of feature-based and neural approaches while mitigating their downsides, in assessing dialogue constructiveness. The framework first defines a set of dataset-independent and interpretable linguistic features, which can be extracted by both prompting an LLM and simple heuristics. Such features are then used to train LLM feature-based models. We apply this framework to three datasets of dialogue constructiveness and find that our LLM feature-based models significantly outperform standard feature-based models and neural models, and tend to learn more robust prediction rules instead of relying on superficial shortcuts (as seen with neural models). Further, we demonstrate that interpreting these LLM feature-based models can yield valuable insights into what makes a dialogue constructive.
☆ Compliance Cards: Computational Artifacts for Automated AI Regulation Compliance
As the artificial intelligence (AI) supply chain grows more complex, AI systems and models are increasingly likely to incorporate externally-sourced ingredients such as datasets and other models. In such cases, determining whether or not an AI system or model complies with the EU AI Act will require gathering compliance-related metadata about both the AI system or model at-large as well as those externally-supplied ingredients. There must then be an analysis that looks across all of this metadata to render a prediction about the compliance of the overall AI system or model. Up until now, this process has not been automated. Thus, it has not been possible to make real-time compliance determinations in scenarios where doing so would be advantageous, such as the iterative workflows of today's AI developers, search and acquisition of AI ingredients on communities like Hugging Face, federated and continuous learning, and more. To address this shortcoming, we introduce a highly automated system for AI Act compliance analysis. This system has two key elements. First is an interlocking set of computational artifacts that capture compliance-related metadata about both: (1) the AI system or model at-large; (2) any constituent ingredients such as datasets and models. Second is an automated analysis algorithm that operates across those computational artifacts to render a run-time prediction about whether or not the overall AI system or model complies with the AI Act. Working together, these elements promise to enhance and accelerate AI Act compliance assessments.
☆ A Large Language Model Outperforms Other Computational Approaches to the High-Throughput Phenotyping of Physician Notes
High-throughput phenotyping, the automated mapping of patient signs and symptoms to standardized ontology concepts, is essential to gaining value from electronic health records (EHR) in the support of precision medicine. Despite technological advances, high-throughput phenotyping remains a challenge. This study compares three computational approaches to high-throughput phenotyping: a Large Language Model (LLM) incorporating generative AI, a Natural Language Processing (NLP) approach utilizing deep learning for span categorization, and a hybrid approach combining word vectors with machine learning. The approach that implemented GPT-4 (a Large Language Model) demonstrated superior performance, suggesting that Large Language Models are poised to be the preferred method for high-throughput phenotyping of physician notes.
comment: Submitted to AMIA Annual Symposium 2024, San Francisco CA
☆ SciDMT: A Large-Scale Corpus for Detecting Scientific Mentions LREC
We present SciDMT, an enhanced and expanded corpus for scientific mention detection, offering a significant advancement over existing related resources. SciDMT contains annotated scientific documents for datasets (D), methods (M), and tasks (T). The corpus consists of two components: 1) the SciDMT main corpus, which includes 48 thousand scientific articles with over 1.8 million weakly annotated mention annotations in the format of in-text span, and 2) an evaluation set, which comprises 100 scientific articles manually annotated for evaluation purposes. To the best of our knowledge, SciDMT is the largest corpus for scientific entity mention detection. The corpus's scale and diversity are instrumental in developing and refining models for tasks such as indexing scientific papers, enhancing information retrieval, and improving the accessibility of scientific knowledge. We demonstrate the corpus's utility through experiments with advanced deep learning architectures like SciBERT and GPT-3.5. Our findings establish performance baselines and highlight unresolved challenges in scientific mention detection. SciDMT serves as a robust benchmark for the research community, encouraging the development of innovative models to further the field of scientific information extraction.
comment: LREC/COLING 2024
♻ ☆ MambaMixer: Efficient Selective State Space Models with Dual Token and Channel Selection
Recent advances in deep learning have mainly relied on Transformers due to their data dependency and ability to learn at scale. The attention module in these architectures, however, exhibits quadratic time and space in input size, limiting their scalability for long-sequence modeling. State Space Models (SSMs), and more specifically Selective SSMs (S6), with efficient hardware-aware implementation, have shown promising potential for long causal sequence modeling. They, however, use separate blocks for each channel and fail to filter irrelevant channels and capture inter-channel dependencies. Natural attempt to mix information across channels using MLP, attention, or SSMs results in further instability in the training of SSMs for large networks and/or nearly double the number of parameters. We present the MambaMixer block, a new SSM-based architecture with data-dependent weights that uses a dual selection mechanism across tokens and channels-called Selective Token and Channel Mixer. To mitigate doubling the number of parameters, we present a new non-causal heuristic of the S6 block with a hardware-friendly implementation. We further present an efficient variant of MambaMixer, called QSMixer, that mixes information along both sequence and embedding dimensions. As a proof of concept, we design Vision MambaMixer (ViM2) and Vision QSMixer (ViQS) architectures. To enhance their ability to capture spatial information in images, we present Switch of Scans (SoS) that dynamically uses a set of useful image scans to traverse image patches. We evaluate the performance of our methods in image classification, segmentation, and object detection. Our results underline the importance of selectively mixing across both tokens and channels and show the competitive (resp. superior) performance of our methods with well-established vision models (resp. SSM-based models).
♻ ☆ Action2Sound: Ambient-Aware Generation of Action Sounds from Egocentric Videos
Generating realistic audio for human interactions is important for many applications, such as creating sound effects for films or virtual reality games. Existing approaches implicitly assume total correspondence between the video and audio during training, yet many sounds happen off-screen and have weak to no correspondence with the visuals -- resulting in uncontrolled ambient sounds or hallucinations at test time. We propose a novel ambient-aware audio generation model, AV-LDM. We devise a novel audio-conditioning mechanism to learn to disentangle foreground action sounds from the ambient background sounds in in-the-wild training videos. Given a novel silent video, our model uses retrieval-augmented generation to create audio that matches the visual content both semantically and temporally. We train and evaluate our model on two in-the-wild egocentric video datasets Ego4D and EPIC-KITCHENS. Our model outperforms an array of existing methods, allows controllable generation of the ambient sound, and even shows promise for generalizing to computer graphics game clips. Overall, our work is the first to focus video-to-audio generation faithfully on the observed visual content despite training from uncurated clips with natural background sounds.
comment: Project page: https://vision.cs.utexas.edu/projects/action2sound
♻ ☆ [Experiments & Analysis] Evaluating the Feasibility of Sampling-Based Techniques for Training Multilayer Perceptrons
The training process of neural networks is known to be time-consuming, and having a deep architecture only aggravates the issue. This process consists mostly of matrix operations, among which matrix multiplication is the bottleneck. Several sampling-based techniques have been proposed for speeding up the training time of deep neural networks by approximating the matrix products. These techniques fall under two categories: (i) sampling a subset of nodes in every hidden layer as active at every iteration and (ii) sampling a subset of nodes from the previous layer to approximate the current layer's activations using the edges from the sampled nodes. In both cases, the matrix products are computed using only the selected samples. In this paper, we evaluate the feasibility of these approaches on CPU machines with limited computational resources. Making a connection between the two research directions as special cases of approximating matrix multiplications in the context of neural networks, we provide a negative theoretical analysis that shows feedforward approximation is an obstacle against scalability. We conduct comprehensive experimental evaluations that demonstrate the most pressing challenges and limitations associated with the studied approaches. We observe that the hashing-based node selection method is not scalable to a large number of layers, confirming our theoretical analysis. Finally, we identify directions for future research.
♻ ☆ RankCLIP: Ranking-Consistent Language-Image Pretraining
Self-supervised contrastive learning models, such as CLIP, have set new benchmarks for vision-language models in many downstream tasks. However, their dependency on rigid one-to-one mappings overlooks the complex and often multifaceted relationships between and within texts and images. To this end, we introduce RANKCLIP, a novel pretraining method that extends beyond the rigid one-to-one matching framework of CLIP and its variants. By extending the traditional pair-wise loss to list-wise, and leveraging both in-modal and cross-modal ranking consistency, RANKCLIP improves the alignment process, enabling it to capture the nuanced many-to-many relationships between and within each modality. Through comprehensive experiments, we demonstrate the effectiveness of RANKCLIP in various downstream tasks, notably achieving significant gains in zero-shot classifications over state-of-the-art methods, underscoring the importance of this enhanced learning process.
comment: 12 pages, 4 figures, 6 tables. Code and model checkpoints are available at https://github.com/Jam1ezhang/RankCLIP
♻ ☆ FedConPE: Efficient Federated Conversational Bandits with Heterogeneous Clients IJCAI
Conversational recommender systems have emerged as a potent solution for efficiently eliciting user preferences. These systems interactively present queries associated with "key terms" to users and leverage user feedback to estimate user preferences more efficiently. Nonetheless, most existing algorithms adopt a centralized approach. In this paper, we introduce FedConPE, a phase elimination-based federated conversational bandit algorithm, where $M$ agents collaboratively solve a global contextual linear bandit problem with the help of a central server while ensuring secure data management. To effectively coordinate all the clients and aggregate their collected data, FedConPE uses an adaptive approach to construct key terms that minimize uncertainty across all dimensions in the feature space. Furthermore, compared with existing federated linear bandit algorithms, FedConPE offers improved computational and communication efficiency as well as enhanced privacy protections. Our theoretical analysis shows that FedConPE is minimax near-optimal in terms of cumulative regret. We also establish upper bounds for communication costs and conversation frequency. Comprehensive evaluations demonstrate that FedConPE outperforms existing conversational bandit algorithms while using fewer conversations.
comment: Accepted to the 33rd International Joint Conference on Artificial Intelligence (IJCAI), 2024
♻ ☆ The Importance of Directional Feedback for LLM-based Optimizers NeurIPS 2023
We study the potential of using large language models (LLMs) as an interactive optimizer for solving maximization problems in a text space using natural language and numerical feedback. Inspired by the classical optimization literature, we classify the natural language feedback into directional and non-directional, where the former is a generalization of the first-order feedback to the natural language space. We find that LLMs are especially capable of optimization when they are provided with {directional feedback}. Based on this insight, we design a new LLM-based optimizer that synthesizes directional feedback from the historical optimization trace to achieve reliable improvement over iterations. Empirically, we show our LLM-based optimizer is more stable and efficient in solving optimization problems, from maximizing mathematical functions to optimizing prompts for writing poems, compared with existing techniques.
comment: Accepted and Presented at Foundation Models for Decision Making at NeurIPS 2023 (December 15, 2023). Work completed from June 2023 to September 2023
♻ ☆ Physics-aware Machine Learning Revolutionizes Scientific Paradigm for Machine Learning and Process-based Hydrology
Accurate hydrological understanding and water cycle prediction are crucial for addressing scientific and societal challenges associated with the management of water resources, particularly under the dynamic influence of anthropogenic climate change. Existing reviews predominantly concentrate on the development of machine learning (ML) in this field, yet there is a clear distinction between hydrology and ML as separate paradigms. Here, we introduce physics-aware ML as a transformative approach to overcome the perceived barrier and revolutionize both fields. Specifically, we present a comprehensive review of the physics-aware ML methods, building a structured community (PaML) of existing methodologies that integrate prior physical knowledge or physics-based modeling into ML. We systematically analyze these PaML methodologies with respect to four aspects: physical data-guided ML, physics-informed ML, physics-embedded ML, and physics-aware hybrid learning. PaML facilitates ML-aided hypotheses, accelerating insights from big data and fostering scientific discoveries. We first conduct a systematic review of hydrology in PaML, including rainfall-runoff hydrological processes and hydrodynamic processes, and highlight the most promising and challenging directions for different objectives and PaML methods. Finally, a new PaML-based hydrology platform, termed HydroPML, is released as a foundation for hydrological applications. HydroPML enhances the explainability and causality of ML and lays the groundwork for the digital water cycle's realization. The HydroPML platform is publicly available at https://hydropml.github.io/.
comment: 33 pages, 6 figures
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ Low-Rank Quantization-Aware Training for LLMs
Large language models (LLMs) are omnipresent, however their practical deployment is challenging due to their ever increasing computational and memory demands. Quantization is one of the most effective ways to make them more compute and memory efficient. Quantization-aware training (QAT) methods, generally produce the best quantized performance, however it comes at the cost of potentially long training time and excessive memory usage, making it impractical when applying for LLMs. Inspired by parameter-efficient fine-tuning (PEFT) and low-rank adaptation (LoRA) literature, we propose LR-QAT -- a lightweight and memory-efficient QAT algorithm for LLMs. LR-QAT employs several components to save memory without sacrificing predictive performance: (a) low-rank auxiliary weights that are aware of the quantization grid; (b) a downcasting operator using fixed-point or double-packed integers and (c) checkpointing. Unlike most related work, our method (i) is inference-efficient, leading to no additional overhead compared to traditional PTQ; (ii) can be seen as a general extended pretraining framework, meaning that the resulting model can still be utilized for any downstream task afterwards; (iii) can be applied across a wide range of quantization settings, such as different choices quantization granularity, activation quantization, and seamlessly combined with many PTQ techniques. We apply LR-QAT to LLaMA-2/3 and Mistral model families and validate its effectiveness on several downstream tasks. Our method outperforms common post-training quantization (PTQ) approaches and reaches the same model performance as full-model QAT at the fraction of its memory usage. Specifically, we can train a 7B LLM on a single consumer grade GPU with 24GB of memory.
♻ ☆ A Question-centric Multi-experts Contrastive Learning Framework for Improving the Accuracy and Interpretability of Deep Sequential Knowledge Tracing Models KDD
Knowledge tracing (KT) plays a crucial role in predicting students' future performance by analyzing their historical learning processes. Deep neural networks (DNNs) have shown great potential in solving the KT problem. However, there still exist some important challenges when applying deep learning techniques to model the KT process. The first challenge lies in taking the individual information of the question into modeling. This is crucial because, despite questions sharing the same knowledge component (KC), students' knowledge acquisition on homogeneous questions can vary significantly. The second challenge lies in interpreting the prediction results from existing deep learning-based KT models. In real-world applications, while it may not be necessary to have complete transparency and interpretability of the model parameters, it is crucial to present the model's prediction results in a manner that teachers find interpretable. This makes teachers accept the rationale behind the prediction results and utilize them to design teaching activities and tailored learning strategies for students. However, the inherent black-box nature of deep learning techniques often poses a hurdle for teachers to fully embrace the model's prediction results. To address these challenges, we propose a Question-centric Multi-experts Contrastive Learning framework for KT called Q-MCKT. We have provided all the datasets and code on our website at https://github.com/rattlesnakey/Q-MCKT.
comment: 25 pages, 9 figures, Accepted by TKDD
♻ ☆ VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction Optimization NAACL2024
This paper presents VisLingInstruct, a novel approach to advancing Multi-Modal Language Models (MMLMs) in zero-shot learning. Current MMLMs show impressive zero-shot abilities in multi-modal tasks, but their performance depends heavily on the quality of instructions. VisLingInstruct tackles this by autonomously evaluating and optimizing instructional texts through In-Context Learning, improving the synergy between visual perception and linguistic expression in MMLMs. Alongside this instructional advancement, we have also optimized the visual feature extraction modules in MMLMs, further augmenting their responsiveness to textual content. Our comprehensive experiments on MMLMs, based on FlanT5 and Vicuna, show that VisLingInstruct significantly improves zero-shot performance in visual multi-modal tasks. Notably, it achieves a 13.1% and 9% increase in accuracy over the prior state-of-the-art on the TextVQA and HatefulMemes datasets. Our main code is available at https://github.com/Zhudongsheng75/VisLingInstruct.
comment: Accepted to NAACL2024 main conference
♻ ☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and explore emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
♻ ☆ Fairness Concerns in App Reviews: A Study on AI-based Mobile Apps
Fairness is one of the socio-technical concerns that must be addressed in AI-based systems. Unfair AI-based systems, particularly unfair AI-based mobile apps, can pose difficulties for a significant proportion of the global population. This paper aims to analyze fairness concerns in AI-based app reviews. We first manually constructed a ground-truth dataset, including 1,132 fairness and 1,473 non-fairness reviews. Leveraging the ground-truth dataset, we developed and evaluated a set of machine learning and deep learning models that distinguish fairness reviews from non-fairness reviews. Our experiments show that our best-performing model can detect fairness reviews with a precision of 94%. We then applied the best-performing model on approximately 9.5M reviews collected from 108 AI-based apps and identified around 92K fairness reviews. Next, applying the K-means clustering technique to the 92K fairness reviews, followed by manual analysis, led to the identification of six distinct types of fairness concerns (e.g., 'receiving different quality of features and services in different platforms and devices' and 'lack of transparency and fairness in dealing with user-generated content'). Finally, the manual analysis of 2,248 app owners' responses to the fairness reviews identified six root causes (e.g., 'copyright issues') that app owners report to justify fairness concerns.
comment: 30 pages, 5 images, 6 tables, Manuscript submitted to a Journal (2024)
♻ ☆ All Languages Matter: On the Multilingual Safety of Large Language Models ACL 2024
Safety lies at the core of developing and deploying large language models (LLMs). However, previous safety benchmarks only concern the safety in one language, e.g. the majority language in the pretraining data such as English. In this work, we build the first multilingual safety benchmark for LLMs, XSafety, in response to the global deployment of LLMs in practice. XSafety covers 14 kinds of commonly used safety issues across 10 languages that span several language families. We utilize XSafety to empirically study the multilingual safety for 4 widely-used LLMs, including both close-API and open-source models. Experimental results show that all LLMs produce significantly more unsafe responses for non-English queries than English ones, indicating the necessity of developing safety alignment for non-English languages. In addition, we propose several simple and effective prompting methods to improve the multilingual safety of ChatGPT by evoking safety knowledge and improving cross-lingual generalization of safety alignment. Our prompting method can significantly reduce the ratio of unsafe responses from 19.1% to 9.7% for non-English queries. We release our data at https://github.com/Jarviswang94/Multilingual_safety_benchmark.
comment: Accepted by ACL 2024 Findings. The first multilingual safety benchmark for large language models
♻ ☆ NNPP: A Learning-Based Heuristic Model for Accelerating Optimal Path Planning on Uneven Terrain
Intelligent autonomous path planning is essential for enhancing the exploration efficiency of mobile robots operating in uneven terrains like planetary surfaces and off-road environments.In this paper, we propose the NNPP model for computing the heuristic region, enabling foundation algorithms like Astar to find the optimal path solely within this reduced search space, effectively decreasing the search time. The NNPP model learns semantic information about start and goal locations, as well as map representations, from numerous pre-annotated optimal path demonstrations, and produces a probabilistic distribution over each pixel representing the likelihood of it belonging to an optimal path on the map. More specifically, the paper computes the traversal cost for each grid cell from the slope, roughness and elevation difference obtained from the digital elevation model. Subsequently, the start and goal locations are encoded using a Gaussian distribution and different location encoding parameters are analyzed for their effect on model performance. After training, the NNPP model is able to \textcolor{revision}{accelerate} path planning on novel maps.
♻ ☆ Diffusion model for relational inference
Dynamical behaviors of complex interacting systems, including brain activities, financial price movements, and physical collective phenomena, are associated with underlying interactions between the system's components. The issue of uncovering interaction relations in such systems using observable dynamics is called relational inference. In this study, we propose a Diffusion model for Relational Inference (DiffRI), inspired by a self-supervised method for probabilistic time series imputation. DiffRI learns to infer the probability of the presence of connections between components through conditional diffusion modeling.
♻ ☆ FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning
Instruction tuning has been identified as a crucial technique for optimizing the performance of large language models (LLMs) in generating human-aligned responses. Nonetheless, gathering diversified and superior-quality instruction data for such tuning presents notable obstacles, especially in domains with rigid privacy provisions. Federated instruction tuning (FedIT) has emerged as a promising solution, by consolidating collaborative training across multiple data owners, thereby resulting in a privacy-preserving learning model. However, FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks. In this paper, we propose a novel federated algorithm, FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning. FewFedPITcomprises three vital components on the client side: (1) synthetic data generation, which utilizes LLMs' in-context learning capacity to generate synthetic data autonomously, thus expanding the local database; (2) parameter isolation training, which individually updates the public parameters in the synthetic data and the private parameters in the local data, consequently mitigating the noise impact of the synthetic data; (3) local aggregation sharing, which mixes public and private parameters before uploading, effectively preventing data extraction attacks. Extensive experiments on three open-source datasets demonstrate the effectiveness of FewFedPITin, enhancing privacy preservation and improving federated few-shot performance.
comment: Work in progress
♻ ☆ CoT-BERT: Enhancing Unsupervised Sentence Representation through Chain-of-Thought ICANN 2024
Unsupervised sentence representation learning aims to transform input sentences into fixed-length vectors enriched with intricate semantic information while obviating the reliance on labeled data. Recent strides within this domain have been significantly propelled by breakthroughs in contrastive learning and prompt engineering. Despite these advancements, the field has reached a plateau, leading some researchers to incorporate external components to enhance the quality of sentence embeddings. Such integration, though beneficial, complicates solutions and inflates demands for computational resources. In response to these challenges, this paper presents CoT-BERT, an innovative method that harnesses the progressive thinking of Chain-of-Thought reasoning to tap into the latent potential of pre-trained models like BERT. Additionally, we develop an advanced contrastive learning loss function and propose a novel template denoising strategy. Rigorous experimentation demonstrates that CoT-BERT surpasses a range of well-established baselines by relying exclusively on the intrinsic strengths of pre-trained models.
comment: Accepted by ICANN 2024
♻ ☆ The Empirical Impact of Neural Parameter Symmetries, or Lack Thereof
Many algorithms and observed phenomena in deep learning appear to be affected by parameter symmetries -- transformations of neural network parameters that do not change the underlying neural network function. These include linear mode connectivity, model merging, Bayesian neural network inference, metanetworks, and several other characteristics of optimization or loss-landscapes. However, theoretical analysis of the relationship between parameter space symmetries and these phenomena is difficult. In this work, we empirically investigate the impact of neural parameter symmetries by introducing new neural network architectures that have reduced parameter space symmetries. We develop two methods, with some provable guarantees, of modifying standard neural networks to reduce parameter space symmetries. With these new methods, we conduct a comprehensive experimental study consisting of multiple tasks aimed at assessing the effect of removing parameter symmetries. Our experiments reveal several interesting observations on the empirical impact of parameter symmetries; for instance, we observe linear mode connectivity between our networks without alignment of weight spaces, and we find that our networks allow for faster and more effective Bayesian neural network training.
comment: 27 pages. Preparing code for release. v2: added / updated some citations
♻ ☆ Advancing Abductive Reasoning in Knowledge Graphs through Complex Logical Hypothesis Generation ACL 2024
Abductive reasoning is the process of making educated guesses to provide explanations for observations. Although many applications require the use of knowledge for explanations, the utilization of abductive reasoning in conjunction with structured knowledge, such as a knowledge graph, remains largely unexplored. To fill this gap, this paper introduces the task of complex logical hypothesis generation, as an initial step towards abductive logical reasoning with KG. In this task, we aim to generate a complex logical hypothesis so that it can explain a set of observations. We find that the supervised trained generative model can generate logical hypotheses that are structurally closer to the reference hypothesis. However, when generalized to unseen observations, this training objective does not guarantee better hypothesis generation. To address this, we introduce the Reinforcement Learning from Knowledge Graph (RLF-KG) method, which minimizes differences between observations and conclusions drawn from generated hypotheses according to the KG. Experiments show that, with RLF-KG's assistance, the generated hypotheses provide better explanations, and achieve state-of-the-art results on three widely used KGs.
comment: Accepted by ACL 2024
♻ ☆ Self-supervised Learning for Human Activity Recognition Using 700,000 Person-days of Wearable Data
Advances in deep learning for human activity recognition have been relatively limited due to the lack of large labelled datasets. In this study, we leverage self-supervised learning techniques on the UK-Biobank activity tracker dataset--the largest of its kind to date--containing more than 700,000 person-days of unlabelled wearable sensor data. Our resulting activity recognition model consistently outperformed strong baselines across seven benchmark datasets, with an F1 relative improvement of 2.5%-100% (median 18.4%), the largest improvements occurring in the smaller datasets. In contrast to previous studies, our results generalise across external datasets, devices, and environments. Our open-source model will help researchers and developers to build customisable and generalisable activity classifiers with high performance.
♻ ☆ Active Few-Shot Fine-Tuning
We study the question: How can we select the right data for fine-tuning to a specific task? We call this data selection problem active fine-tuning and show that it is an instance of transductive active learning, a novel generalization of classical active learning. We propose ITL, short for information-based transductive learning, an approach which samples adaptively to maximize information gained about the specified task. We are the first to show, under general regularity assumptions, that such decision rules converge uniformly to the smallest possible uncertainty obtainable from the accessible data. We apply ITL to the few-shot fine-tuning of large neural networks and show that fine-tuning with ITL learns the task with significantly fewer examples than the state-of-the-art.
♻ ☆ Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy
Advances in artificial intelligence and human-computer interaction will likely lead to extended reality (XR) becoming pervasive. While XR can provide users with interactive, engaging, and immersive experiences, non-player characters are often utilized in pre-scripted and conventional ways. This paper argues for using large language models (LLMs) in XR by embedding them in avatars or as narratives to facilitate inclusion through prompt engineering and fine-tuning the LLMs. We argue that this inclusion will promote diversity for XR use. Furthermore, the versatile conversational capabilities of LLMs will likely increase engagement in XR, helping XR become ubiquitous. Lastly, we speculate that combining the information provided to LLM-powered spaces by users and the biometric data obtained might lead to novel privacy invasions. While exploring potential privacy breaches, examining user privacy concerns and preferences is also essential. Therefore, despite challenges, LLM-powered XR is a promising area with several opportunities.
comment: ACM Conversational User Interfaces 2024
♻ ☆ On the Inductive Biases of Demographic Parity-based Fair Learning Algorithms
Fair supervised learning algorithms assigning labels with little dependence on a sensitive attribute have attracted great attention in the machine learning community. While the demographic parity (DP) notion has been frequently used to measure a model's fairness in training fair classifiers, several studies in the literature suggest potential impacts of enforcing DP in fair learning algorithms. In this work, we analytically study the effect of standard DP-based regularization methods on the conditional distribution of the predicted label given the sensitive attribute. Our analysis shows that an imbalanced training dataset with a non-uniform distribution of the sensitive attribute could lead to a classification rule biased toward the sensitive attribute outcome holding the majority of training data. To control such inductive biases in DP-based fair learning, we propose a sensitive attribute-based distributionally robust optimization (SA-DRO) method improving robustness against the marginal distribution of the sensitive attribute. Finally, we present several numerical results on the application of DP-based learning methods to standard centralized and distributed learning problems. The empirical findings support our theoretical results on the inductive biases in DP-based fair learning algorithms and the debiasing effects of the proposed SA-DRO method.
♻ ☆ Eigenpruning: an Interpretability-Inspired PEFT Method NAACL 2024
We introduce eigenpruning, a method that removes singular values from weight matrices in an LLM to improve its performance in a particular task. This method is inspired by interpretability methods designed to automatically find subnetworks of a model which solve a specific task. In our tests, the pruned model outperforms the original model by a large margin, while only requiring minimal computation to prune the weight matrices. In the case of a small synthetic task in integer multiplication, the Phi-2 model can improve its accuracy in the test set from 13.75% to 97.50%. Interestingly, these results seem to indicate the existence of a computation path that can solve the task very effectively, but it was not being used by the original model. Finally, we publicly release our implementation.
comment: Extended abstract accepted to LatinX at NAACL 2024
♻ ☆ Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension.Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated $11$ datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.
comment: Medical Imaging with Deep Learning (MIDL) 2024 (Oral)
♻ ☆ Fusion-PSRO: Nash Policy Fusion for Policy Space Response Oracles
A popular approach for solving zero-sum games is to maintain populations of policies to approximate the Nash Equilibrium (NE). Previous studies have shown that Policy Space Response Oracle (PSRO) algorithm is an effective multi-agent reinforcement learning framework for solving such games. However, repeatedly training new policies from scratch to approximate Best Response (BR) to opponents' mixed policies at each iteration is both inefficient and costly. While some PSRO variants initialize a new policy by inheriting from past BR policies, this approach limits the exploration of new policies, especially against challenging opponents. To address this issue, we propose Fusion-PSRO, which employs policy fusion to initialize policies for better approximation to BR. By selecting high-quality base policies from meta-NE, policy fusion fuses the base policies into a new policy through model averaging. This approach allows the initialized policies to incorporate multiple expert policies, making it easier to handle difficult opponents compared to inheriting from past BR policies or initializing from scratch. Moreover, our method only modifies the policy initialization phase, allowing its application to nearly all PSRO variants without additional training overhead. Our experiments on non-transitive matrix games, Leduc Poker, and the more complex Liars Dice demonstrate that Fusion-PSRO enhances the performance of nearly all PSRO variants, achieving lower exploitability.
comment: 20 pages, 5 figures
♻ ☆ Are Large Language Models a Good Replacement of Taxonomies? VLDB 2024
Large language models (LLMs) demonstrate an impressive ability to internalize knowledge and answer natural language questions. Although previous studies validate that LLMs perform well on general knowledge while presenting poor performance on long-tail nuanced knowledge, the community is still doubtful about whether the traditional knowledge graphs should be replaced by LLMs. In this paper, we ask if the schema of knowledge graph (i.e., taxonomy) is made obsolete by LLMs. Intuitively, LLMs should perform well on common taxonomies and at taxonomy levels that are common to people. Unfortunately, there lacks a comprehensive benchmark that evaluates the LLMs over a wide range of taxonomies from common to specialized domains and at levels from root to leaf so that we can draw a confident conclusion. To narrow the research gap, we constructed a novel taxonomy hierarchical structure discovery benchmark named TaxoGlimpse to evaluate the performance of LLMs over taxonomies. TaxoGlimpse covers ten representative taxonomies from common to specialized domains with in-depth experiments of different levels of entities in this taxonomy from root to leaf. Our comprehensive experiments of eighteen state-of-the-art LLMs under three prompting settings validate that LLMs can still not well capture the knowledge of specialized taxonomies and leaf-level entities.
comment: Accepted by VLDB 2024
♻ ☆ White Men Lead, Black Women Help? Benchmarking Language Agency Social Biases in LLMs
Language agency is an important aspect of evaluating social biases in texts. While several studies approached agency-related bias in human-written language, very limited research has investigated such biases in Large Language Model (LLM)-generated content. In addition, previous research often relies on string-matching techniques to identify agentic and communal words within texts, which fall short of accurately classifying language agency. We introduce the novel Language Agency Bias Evaluation (LABE) benchmark, which comprehensively evaluates biases in LLMs by analyzing agency levels attributed to different demographic groups in model generations. LABE leverages 5,400 template-based prompts, an accurate agency classifier, and corresponding bias metrics to test for gender, racial, and intersectional language agency biases in LLMs on 3 text generation tasks: biographies, professor reviews, and reference letters. To build better and more accurate automated agency classifiers, we also contribute and release the Language Agency Classification (LAC) dataset, consisting of 3,724 agentic and communal sentences. Using LABE, we unveil previously under-explored language agency social biases in 3 recent LLMs: ChatGPT, Llama3, and Mistral. We observe that: (1) For the same text category, LLM generations demonstrate higher levels of gender bias than human-written texts; (2) On most generation tasks, models show remarkably higher levels of intersectional bias than the other bias aspects. Those who are at the intersection of gender and racial minority groups -- such as Black females -- are consistently described by texts with lower levels of agency; (3) Among the 3 LLMs investigated, Llama3 demonstrates greatest overall bias in language agency; (4) Not only does prompt-based mitigation fail to resolve language agency bias in LLMs, but it frequently leads to the exacerbation of biases in generated texts.
♻ ☆ Progress Measures for Grokking on Real-world Tasks
Grokking, a phenomenon where machine learning models generalize long after overfitting, has been primarily observed and studied in algorithmic tasks. This paper explores grokking in real-world datasets using deep neural networks for classification under the cross-entropy loss. We challenge the prevalent hypothesis that the $L_2$ norm of weights is the primary cause of grokking by demonstrating that grokking can occur outside the expected range of weight norms. To better understand grokking, we introduce three new progress measures: activation sparsity, absolute weight entropy, and approximate local circuit complexity. These measures are conceptually related to generalization and demonstrate a stronger correlation with grokking in real-world datasets compared to weight norms. Our findings suggest that while weight norms might usually correlate with grokking and our progress measures, they are not causative, and our proposed measures provide a better understanding of the dynamics of grokking.
comment: 5 pages
♻ ☆ Xmodel-VLM: A Simple Baseline for Multimodal Vision Language Model
We introduce Xmodel-VLM, a cutting-edge multimodal vision language model. It is designed for efficient deployment on consumer GPU servers. Our work directly confronts a pivotal industry issue by grappling with the prohibitive service costs that hinder the broad adoption of large-scale multimodal systems. Through rigorous training, we have developed a 1B-scale language model from the ground up, employing the LLaVA paradigm for modal alignment. The result, which we call Xmodel-VLM, is a lightweight yet powerful multimodal vision language model. Extensive testing across numerous classic multimodal benchmarks has revealed that despite its smaller size and faster execution, Xmodel-VLM delivers performance comparable to that of larger models. Our model checkpoints and code are publicly available on GitHub at https://github.com/XiaoduoAILab/XmodelVLM.
♻ ☆ MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation ACL 2024
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map, systematically exploring multiple candidate nodes or sub-goals step by step. Extensive experiments demonstrate that our MapGPT is applicable to both GPT-4 and GPT-4V, achieving state-of-the-art zero-shot performance on R2R and REVERIE simultaneously (~10% and ~12% improvements in SR), and showcasing the newly emergent global thinking and path planning abilities of the GPT.
comment: LLM/VLM-based VLN Agents. Accepted to ACL 2024. Project: https://chen-judge.github.io/MapGPT/
♻ ☆ LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction SIGIR 2024
Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).
comment: SIGIR 2024 industry track
♻ ☆ The logic behind desirable sets of things, and its filter representation
We identify the (filter representation of the) logic behind the recent theory of coherent sets of desirable (sets of) things, which generalise coherent sets of desirable (sets of) gambles as well as coherent choice functions, and show that this identification allows us to establish various representation results for such coherent models in terms of simpler ones.
comment: 38 pages, 1 figure, 2 tables
♻ ☆ Dependency-Aware Semi-Structured Sparsity: Declining Roles of Outliers in Pruning GLU-based LLMs
The rapid growth in the scale of Large Language Models (LLMs) has led to significant computational and memory costs, making model compression techniques such as network pruning increasingly crucial for their efficient deployment. Recent LLMs such as LLaMA2 and Mistral have adopted GLU-based MLP architectures. However, current LLM pruning strategies are primarily based on insights from older LLM architectures, necessitating a reevaluation of these strategies to suit the new architectural characteristics. Contrary to traditional beliefs, we find that outliers play a diminished role in the input projections of GLU-based MLPs. Leveraging this new insight, we propose Dependency-aware Semi-structured Sparsity (DaSS), a novel pruning method for GLU-based LLMs. DaSS balances the flexibility of unstructured pruning and the structural consistency of dependency-based structured pruning by considering both of weight magnitude and corresponding intermediate activation norms in weight pruning metric. Empirical evaluations on the Mistral, Gemma, and LLaMA2 model families demonstrate the consistent effectiveness of DaSS in the prevailing GLU variants.
♻ ☆ FAST: Feature Aware Similarity Thresholding for Weak Unlearning in Black-Box Generative Models
The heightened emphasis on the regulation of deep generative models, propelled by escalating concerns pertaining to privacy and compliance with regulatory frameworks, underscores the imperative need for precise control mechanisms over these models. This urgency is particularly underscored by instances in which generative models generate outputs that encompass objectionable, offensive, or potentially injurious content. In response, machine unlearning has emerged to selectively forget specific knowledge or remove the influence of undesirable data subsets from pre-trained models. However, modern machine unlearning approaches typically assume access to model parameters and architectural details during unlearning, which is not always feasible. In multitude of downstream tasks, these models function as black-box systems, with inaccessible pre-trained parameters, architectures, and training data. In such scenarios, the possibility of filtering undesired outputs becomes a practical alternative. The primary goal of this study is twofold: first, to elucidate the relationship between filtering and unlearning processes, and second, to formulate a methodology aimed at mitigating the display of undesirable outputs generated from models characterized as black-box systems. Theoretical analysis in this study demonstrates that, in the context of black-box models, filtering can be seen as a form of weak unlearning. Our proposed \textbf{\textit{Feature Aware Similarity Thresholding(FAST)}} method effectively suppresses undesired outputs by systematically encoding the representation of unwanted features in the latent space.
♻ ☆ Scorch: A Library for Sparse Deep Learning
The rapid growth in the size of deep learning models strains the capabilities of traditional dense computation paradigms. Leveraging sparse computation has become increasingly popular for training and deploying large-scale models, but existing deep learning frameworks lack extensive support for sparse operations. To bridge this gap, we introduce Scorch, a library that seamlessly integrates efficient sparse tensor computation into the PyTorch ecosystem, with an initial focus on inference workloads on CPUs. Scorch provides a flexible and intuitive interface for sparse tensors, supporting diverse sparse data structures. Scorch introduces a compiler stack that automates key optimizations, including automatic loop ordering, tiling, and format inference. Combined with a runtime that adapts its execution to both dense and sparse data, Scorch delivers substantial speedups over hand-written PyTorch Sparse (torch.sparse) operations without sacrificing usability. More importantly, Scorch enables efficient computation of complex sparse operations that lack hand-optimized PyTorch implementations. This flexibility is crucial for exploring novel sparse architectures. We demonstrate Scorch's ease of use and performance gains on diverse deep learning models across multiple domains. With only minimal code changes, Scorch achieves 1.05-5.78x speedups over PyTorch Sparse on end-to-end tasks. Scorch's seamless integration and performance gains make it a valuable addition to the PyTorch ecosystem. We believe Scorch will enable wider exploration of sparsity as a tool for scaling deep learning and inform the development of other sparse libraries.
comment: 25 pages, 8 figures
♻ ☆ EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning ICML 2024
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.
comment: 15 pages, 3 figures; Accepted to ICML 2024 (camera ready version)
♻ ☆ Exploring ChatGPT's Capabilities on Vulnerability Management USENIX Security 2024
Recently, ChatGPT has attracted great attention from the code analysis domain. Prior works show that ChatGPT has the capabilities of processing foundational code analysis tasks, such as abstract syntax tree generation, which indicates the potential of using ChatGPT to comprehend code syntax and static behaviors. However, it is unclear whether ChatGPT can complete more complicated real-world vulnerability management tasks, such as the prediction of security relevance and patch correctness, which require an all-encompassing understanding of various aspects, including code syntax, program semantics, and related manual comments. In this paper, we explore ChatGPT's capabilities on 6 tasks involving the complete vulnerability management process with a large-scale dataset containing 70,346 samples. For each task, we compare ChatGPT against SOTA approaches, investigate the impact of different prompts, and explore the difficulties. The results suggest promising potential in leveraging ChatGPT to assist vulnerability management. One notable example is ChatGPT's proficiency in tasks like generating titles for software bug reports. Furthermore, our findings reveal the difficulties encountered by ChatGPT and shed light on promising future directions. For instance, directly providing random demonstration examples in the prompt cannot consistently guarantee good performance in vulnerability management. By contrast, leveraging ChatGPT in a self-heuristic way -- extracting expertise from demonstration examples itself and integrating the extracted expertise in the prompt is a promising research direction. Besides, ChatGPT may misunderstand and misuse the information in the prompt. Consequently, effectively guiding ChatGPT to focus on helpful information rather than the irrelevant content is still an open problem.
comment: Accepted by USENIX Security 2024
♻ ☆ MemDPT: Differential Privacy for Memory Efficient Language Models
Large language models have consistently demonstrated remarkable performance across a wide spectrum of applications. Nonetheless, the deployment of these models can inadvertently expose user privacy to potential risks. The substantial memory demands of these models during training represent a significant resource consumption challenge. The sheer size of these models imposes a considerable burden on memory resources, which is a matter of significant concern in practice. In this paper, we present an innovative training framework MemDPT that not only reduces the memory cost of large language models but also places a strong emphasis on safeguarding user data privacy. MemDPT provides edge network and reverse network designs to accommodate various differential privacy memory-efficient fine-tuning schemes. Our approach not only achieves $2 \sim 3 \times$ memory optimization but also provides robust privacy protection, ensuring that user data remains secure and confidential. Extensive experiments have demonstrated that MemDPT can effectively provide differential privacy efficient fine-tuning across various task scenarios.
comment: 12 pages first version
♻ ☆ TrustSQL: Benchmarking Text-to-SQL Reliability with Penalty-Based Scoring
Text-to-SQL enables users to interact with databases using natural language, simplifying the retrieval and synthesis of information. Despite the remarkable success of large language models (LLMs) in translating natural language questions into SQL queries, widespread deployment remains limited due to two primary challenges. First, the effective use of text-to-SQL models depends on users' understanding of the model's capabilities-the scope of questions the model can correctly answer. Second, the absence of abstention mechanisms can lead to incorrect SQL generation going unnoticed, thereby undermining trust in the model's output. To enable wider deployment, it is crucial to address these challenges in model design and enhance model evaluation to build trust in the model's output. To this end, we introduce TrustSQL, a novel comprehensive benchmark designed to evaluate text-to-SQL reliability-defined as a model's ability to correctly handle any type of input question by generating correct SQL queries for feasible questions and abstaining from generating infeasible ones (e.g., due to schema incompatibility or functionalities beyond SQL). We evaluate existing methods using a novel penalty-based scoring metric with two modeling approaches: (1) pipeline-based methods combining SQL generators with infeasible question detectors and SQL error detectors for abstention; and (2) unified methods using a single model for the entire task. Our experimental results reveal that achieving high scores under severe penalties requires significant effort and provide a new perspective on developing text-to-SQL models for safer deployment. TrustSQL is available at https://github.com/glee4810/TrustSQL.
comment: under review
♻ ☆ Repeat-Aware Neighbor Sampling for Dynamic Graph Learning KDD 2024
Dynamic graph learning equips the edges with time attributes and allows multiple links between two nodes, which is a crucial technology for understanding evolving data scenarios like traffic prediction and recommendation systems. Existing works obtain the evolving patterns mainly depending on the most recent neighbor sequences. However, we argue that whether two nodes will have interaction with each other in the future is highly correlated with the same interaction that happened in the past. Only considering the recent neighbors overlooks the phenomenon of repeat behavior and fails to accurately capture the temporal evolution of interactions. To fill this gap, this paper presents RepeatMixer, which considers evolving patterns of first and high-order repeat behavior in the neighbor sampling strategy and temporal information learning. Firstly, we define the first-order repeat-aware nodes of the source node as the destination nodes that have interacted historically and extend this concept to high orders as nodes in the destination node's high-order neighbors. Then, we extract neighbors of the source node that interacted before the appearance of repeat-aware nodes with a slide window strategy as its neighbor sequence. Next, we leverage both the first and high-order neighbor sequences of source and destination nodes to learn temporal patterns of interactions via an MLP-based encoder. Furthermore, considering the varying temporal patterns on different orders, we introduce a time-aware aggregation mechanism that adaptively aggregates the temporal representations from different orders based on the significance of their interaction time sequences. Experimental results demonstrate the superiority of RepeatMixer over state-of-the-art models in link prediction tasks, underscoring the effectiveness of the proposed repeat-aware neighbor sampling strategy.
comment: Accepted by KDD 2024, Research Track
♻ ☆ OR-Bench: An Over-Refusal Benchmark for Large Language Models
Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often come with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. Although the issue of over-refusal has been empirically observed, a systematic measurement is challenging due to the difficulty of crafting prompts that appear harmful but are benign. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety aligned models.
comment: version 2, 10 pages main, 22 pages total
♻ ☆ Robust $Q$-learning Algorithm for Markov Decision Processes under Wasserstein Uncertainty
We present a novel $Q$-learning algorithm tailored to solve distributionally robust Markov decision problems where the corresponding ambiguity set of transition probabilities for the underlying Markov decision process is a Wasserstein ball around a (possibly estimated) reference measure. We prove convergence of the presented algorithm and provide several examples also using real data to illustrate both the tractability of our algorithm as well as the benefits of considering distributional robustness when solving stochastic optimal control problems, in particular when the estimated distributions turn out to be misspecified in practice.
♻ ☆ Identification of quantum entanglement with Siamese convolutional neural networks and semi-supervised learning
Quantum entanglement is a fundamental property commonly used in various quantum information protocols and algorithms. Nonetheless, the problem of identifying entanglement has still not reached a general solution for systems larger than $2\times3$. In this study, we use deep convolutional NNs, a type of supervised machine learning, to identify quantum entanglement for any bipartition in a 3-qubit system. We demonstrate that training the model on synthetically generated datasets of random density matrices excluding challenging positive-under-partial-transposition entangled states (PPTES), which cannot be identified (and correctly labeled) in general, leads to good model accuracy even for PPTES states, that were outside the training data. Our aim is to enhance the model's generalization on PPTES. By applying entanglement-preserving symmetry operations through a triple Siamese network trained in a semi-supervised manner, we improve the model's accuracy and ability to recognize PPTES. Moreover, by constructing an ensemble of Siamese models, even better generalization is observed, in analogy with the idea of finding separate types of entanglement witnesses for different classes of states.
comment: Updated version with improved models; 12 pages, 7 figures, 4 tables
Information Retrieval 22
☆ LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation
As online music consumption increasingly shifts towards playlist-based listening, the task of playlist continuation, in which an algorithm suggests songs to extend a playlist in a personalized and musically cohesive manner, has become vital to the success of music streaming. Currently, many existing playlist continuation approaches rely on collaborative filtering methods to perform recommendation. However, such methods will struggle to recommend songs that lack interaction data, an issue known as the cold-start problem. Current approaches to this challenge design complex mechanisms for extracting relational signals from sparse collaborative data and integrating them into content representations. However, these approaches leave content representation learning out of scope and utilize frozen, pre-trained content models that may not be aligned with the distribution or format of a specific musical setting. Furthermore, even the musical state-of-the-art content modules are either (1) incompatible with the cold-start setting or (2) unable to effectively integrate cross-modal and relational signals. In this paper, we introduce LARP, a multi-modal cold-start playlist continuation model, to effectively overcome these limitations. LARP is a three-stage contrastive learning framework that integrates both multi-modal and relational signals into its learned representations. Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss. Experimental results on two publicly available datasets demonstrate the efficacy of LARP over uni-modal and multi-modal models for playlist continuation in a cold-start setting. Code and dataset are released at: https://github.com/Rsalganik1123/LARP.
☆ Reproducibility in Machine Learning-based Research: Overview, Barriers and Drivers
Research in various fields is currently experiencing challenges regarding the reproducibility of results. This problem is also prevalent in machine learning (ML) research. The issue arises primarily due to unpublished data and/or source code and the sensitivity of ML training conditions. Although different solutions have been proposed to address this issue, such as using ML platforms, the level of reproducibility in ML-driven research remains unsatisfactory. Therefore, in this article, we discuss the reproducibility of ML-driven research with three main aims: (i) identify the barriers to reproducibility when applying ML in research as well as categorize the barriers to different types of reproducibility (description, code, data, and experiment reproducibility), (ii) identify potential drivers such as tools, practices, and interventions that support ML reproducibility as well as distinguish between technology-driven drivers, procedural drivers, and drivers related to awareness and education, and (iii) map the drivers to the barriers. With this work, we hope to provide insights and contribute to the decision-making process regarding the adoption of different solutions to support ML reproducibility.
comment: Pre-print of submission planned to the AI Magazine
☆ Optimizing Novelty of Top-k Recommendations using Large Language Models and Reinforcement Learning KDD 2024
Given an input query, a recommendation model is trained using user feedback data (e.g., click data) to output a ranked list of items. In real-world systems, besides accuracy, an important consideration for a new model is novelty of its top-k recommendations w.r.t. an existing deployed model. However, novelty of top-k items is a difficult goal to optimize a model for, since it involves a non-differentiable sorting operation on the model's predictions. Moreover, novel items, by definition, do not have any user feedback data. Given the semantic capabilities of large language models, we address these problems using a reinforcement learning (RL) formulation where large language models provide feedback for the novel items. However, given millions of candidate items, the sample complexity of a standard RL algorithm can be prohibitively high. To reduce sample complexity, we reduce the top-k list reward to a set of item-wise rewards and reformulate the state space to consist of tuples such that the action space is reduced to a binary decision; and show that this reformulation results in a significantly lower complexity when the number of items is large. We evaluate the proposed algorithm on improving novelty for a query-ad recommendation task on a large-scale search engine. Compared to supervised finetuning on recent pairs, the proposed RL-based algorithm leads to significant novelty gains with minimal loss in recall. We obtain similar results on the ORCAS query-webpage matching dataset and a product recommendation dataset based on Amazon reviews.
comment: Accepted at KDD 2024
☆ DIRAS: Efficient LLM-Assisted Annotation of Document Relevance in Retrieval Augmented Generation
Retrieval Augmented Generation (RAG) is widely employed to ground responses to queries on domain-specific documents. But do RAG implementations leave out important information or excessively include irrelevant information? To allay these concerns, it is necessary to annotate domain-specific benchmarks to evaluate information retrieval (IR) performance, as relevance definitions vary across queries and domains. Furthermore, such benchmarks should be cost-efficiently annotated to avoid annotation selection bias. In this paper, we propose DIRAS (Domain-specific Information Retrieval Annotation with Scalability), a manual-annotation-free schema that fine-tunes open-sourced LLMs to annotate relevance labels with calibrated relevance probabilities. Extensive evaluation shows that DIRAS fine-tuned models achieve GPT-4-level performance on annotating and ranking unseen (query, document) pairs, and is helpful for real-world RAG development.
☆ An Investigation of Prompt Variations for Zero-shot LLM-based Rankers
We provide a systematic understanding of the impact of specific components and wordings used in prompts on the effectiveness of rankers based on zero-shot Large Language Models (LLMs). Several zero-shot ranking methods based on LLMs have recently been proposed. Among many aspects, methods differ across (1) the ranking algorithm they implement, e.g., pointwise vs. listwise, (2) the backbone LLMs used, e.g., GPT3.5 vs. FLAN-T5, (3) the components and wording used in prompts, e.g., the use or not of role-definition (role-playing) and the actual words used to express this. It is currently unclear whether performance differences are due to the underlying ranking algorithm, or because of spurious factors such as better choice of words used in prompts. This confusion risks to undermine future research. Through our large-scale experimentation and analysis, we find that ranking algorithms do contribute to differences between methods for zero-shot LLM ranking. However, so do the LLM backbones -- but even more importantly, the choice of prompt components and wordings affect the ranking. In fact, in our experiments, we find that, at times, these latter elements have more impact on the ranker's effectiveness than the actual ranking algorithms, and that differences among ranking methods become more blurred when prompt variations are considered.
☆ Taxonomy-Guided Zero-Shot Recommendations with LLMs
With the emergence of large language models (LLMs) and their ability to perform a variety of tasks, their application in recommender systems (RecSys) has shown promise. However, we are facing significant challenges when deploying LLMs into RecSys, such as limited prompt length, unstructured item information, and un-constrained generation of recommendations, leading to sub-optimal performance. To address these issues, we propose a novel method using a taxonomy dictionary. This method provides a systematic framework for categorizing and organizing items, improving the clarity and structure of item information. By incorporating the taxonomy dictionary into LLM prompts, we achieve efficient token utilization and controlled feature generation, leading to more accurate and contextually relevant recommendations. Our Taxonomy-guided Recommendation (TaxRec) approach features a two-step process: one-time taxonomy categorization and LLM-based recommendation, enabling zero-shot recommendations without the need for domain-specific fine-tuning. Experimental results demonstrate TaxRec significantly enhances recommendation quality compared to traditional zero-shot approaches, showcasing its efficacy as personal recommender with LLMs. Code is available at https://github.com/yueqingliang1/TaxRec.
☆ EAGER: Two-Stream Generative Recommender with Behavior-Semantic Collaboration KDD 2024
Generative retrieval has recently emerged as a promising approach to sequential recommendation, framing candidate item retrieval as an autoregressive sequence generation problem. However, existing generative methods typically focus solely on either behavioral or semantic aspects of item information, neglecting their complementary nature and thus resulting in limited effectiveness. To address this limitation, we introduce EAGER, a novel generative recommendation framework that seamlessly integrates both behavioral and semantic information. Specifically, we identify three key challenges in combining these two types of information: a unified generative architecture capable of handling two feature types, ensuring sufficient and independent learning for each type, and fostering subtle interactions that enhance collaborative information utilization. To achieve these goals, we propose (1) a two-stream generation architecture leveraging a shared encoder and two separate decoders to decode behavior tokens and semantic tokens with a confidence-based ranking strategy; (2) a global contrastive task with summary tokens to achieve discriminative decoding for each type of information; and (3) a semantic-guided transfer task designed to implicitly promote cross-interactions through reconstruction and estimation objectives. We validate the effectiveness of EAGER on four public benchmarks, demonstrating its superior performance compared to existing methods.
comment: Accepted by KDD 2024. Source code available at https://reczoo.github.io/EAGER
☆ Do Not Wait: Learning Re-Ranking Model Without User Feedback At Serving Time in E-Commerce
Recommender systems have been widely used in e-commerce, and re-ranking models are playing an increasingly significant role in the domain, which leverages the inter-item influence and determines the final recommendation lists. Online learning methods keep updating a deployed model with the latest available samples to capture the shifting of the underlying data distribution in e-commerce. However, they depend on the availability of real user feedback, which may be delayed by hours or even days, such as item purchases, leading to a lag in model enhancement. In this paper, we propose a novel extension of online learning methods for re-ranking modeling, which we term LAST, an acronym for Learning At Serving Time. It circumvents the requirement of user feedback by using a surrogate model to provide the instructional signal needed to steer model improvement. Upon receiving an online request, LAST finds and applies a model modification on the fly before generating a recommendation result for the request. The modification is request-specific and transient. It means the modification is tailored to and only to the current request to capture the specific context of the request. After a request, the modification is discarded, which helps to prevent error propagation and stabilizes the online learning procedure since the predictions of the surrogate model may be inaccurate. Most importantly, as a complement to feedback-based online learning methods, LAST can be seamlessly integrated into existing online learning systems to create a more adaptive and responsive recommendation experience. Comprehensive experiments, both offline and online, affirm that LAST outperforms state-of-the-art re-ranking models.
☆ Unifying Graph Convolution and Contrastive Learning in Collaborative Filtering
Graph-based models and contrastive learning have emerged as prominent methods in Collaborative Filtering (CF). While many existing models in CF incorporate these methods in their design, there seems to be a limited depth of analysis regarding the foundational principles behind them. This paper bridges graph convolution, a pivotal element of graph-based models, with contrastive learning through a theoretical framework. By examining the learning dynamics and equilibrium of the contrastive loss, we offer a fresh lens to understand contrastive learning via graph theory, emphasizing its capability to capture high-order connectivity. Building on this analysis, we further show that the graph convolutional layers often used in graph-based models are not essential for high-order connectivity modeling and might contribute to the risk of oversmoothing. Stemming from our findings, we introduce Simple Contrastive Collaborative Filtering (SCCF), a simple and effective algorithm based on a naive embedding model and a modified contrastive loss. The efficacy of the algorithm is demonstrated through extensive experiments across four public datasets. The experiment code is available at \url{https://github.com/wu1hong/SCCF}. \end{abstract}
☆ UpDLRM: Accelerating Personalized Recommendation using Real-World PIM Architecture
Deep Learning Recommendation Models (DLRMs) have gained popularity in recommendation systems due to their effectiveness in handling large-scale recommendation tasks. The embedding layers of DLRMs have become the performance bottleneck due to their intensive needs on memory capacity and memory bandwidth. In this paper, we propose UpDLRM, which utilizes real-world processingin-memory (PIM) hardware, UPMEM DPU, to boost the memory bandwidth and reduce recommendation latency. The parallel nature of the DPU memory can provide high aggregated bandwidth for the large number of irregular memory accesses in embedding lookups, thus offering great potential to reduce the inference latency. To fully utilize the DPU memory bandwidth, we further studied the embedding table partitioning problem to achieve good workload-balance and efficient data caching. Evaluations using real-world datasets show that, UpDLRM achieves much lower inference time for DLRM compared to both CPU-only and CPU-GPU hybrid counterparts.
☆ Evaluating RAG-Fusion with RAGElo: an Automated Elo-based Framework SIGIR24
Challenges in the automated evaluation of Retrieval-Augmented Generation (RAG) Question-Answering (QA) systems include hallucination problems in domain-specific knowledge and the lack of gold standard benchmarks for company internal tasks. This results in difficulties in evaluating RAG variations, like RAG-Fusion (RAGF), in the context of a product QA task at Infineon Technologies. To solve these problems, we propose a comprehensive evaluation framework, which leverages Large Language Models (LLMs) to generate large datasets of synthetic queries based on real user queries and in-domain documents, uses LLM-as-a-judge to rate retrieved documents and answers, evaluates the quality of answers, and ranks different variants of Retrieval-Augmented Generation (RAG) agents with RAGElo's automated Elo-based competition. LLM-as-a-judge rating of a random sample of synthetic queries shows a moderate, positive correlation with domain expert scoring in relevance, accuracy, completeness, and precision. While RAGF outperformed RAG in Elo score, a significance analysis against expert annotations also shows that RAGF significantly outperforms RAG in completeness, but underperforms in precision. In addition, Infineon's RAGF assistant demonstrated slightly higher performance in document relevance based on MRR@5 scores. We find that RAGElo positively aligns with the preferences of human annotators, though due caution is still required. Finally, RAGF's approach leads to more complete answers based on expert annotations and better answers overall based on RAGElo's evaluation criteria.
comment: Accepted to LLM4Eval @ SIGIR24
☆ ChatGPT as Research Scientist: Probing GPT's Capabilities as a Research Librarian, Research Ethicist, Data Generator and Data Predictor
How good a research scientist is ChatGPT? We systematically probed the capabilities of GPT-3.5 and GPT-4 across four central components of the scientific process: as a Research Librarian, Research Ethicist, Data Generator, and Novel Data Predictor, using psychological science as a testing field. In Study 1 (Research Librarian), unlike human researchers, GPT-3.5 and GPT-4 hallucinated, authoritatively generating fictional references 36.0% and 5.4% of the time, respectively, although GPT-4 exhibited an evolving capacity to acknowledge its fictions. In Study 2 (Research Ethicist), GPT-4 (though not GPT-3.5) proved capable of detecting violations like p-hacking in fictional research protocols, correcting 88.6% of blatantly presented issues, and 72.6% of subtly presented issues. In Study 3 (Data Generator), both models consistently replicated patterns of cultural bias previously discovered in large language corpora, indicating that ChatGPT can simulate known results, an antecedent to usefulness for both data generation and skills like hypothesis generation. Contrastingly, in Study 4 (Novel Data Predictor), neither model was successful at predicting new results absent in their training data, and neither appeared to leverage substantially new information when predicting more versus less novel outcomes. Together, these results suggest that GPT is a flawed but rapidly improving librarian, a decent research ethicist already, capable of data generation in simple domains with known characteristics but poor at predicting novel patterns of empirical data to aid future experimentation.
comment: Main article is 14 pages, 1 table. Includes SI Appendix: 26 pages, 12 tables, 2 figures. Total: 40 pages, 13 tables, 2 figures. Under revised review at PNAS
☆ RE-AdaptIR: Improving Information Retrieval through Reverse Engineered Adaptation
Large language models (LLMs) fine-tuned for text-retrieval have demonstrated state-of-the-art results across several information retrieval (IR) benchmarks. However, supervised training for improving these models requires numerous labeled examples, which are generally unavailable or expensive to acquire. In this work, we explore the effectiveness of extending reverse engineered adaptation to the context of information retrieval (RE-AdaptIR). We use RE-AdaptIR to improve LLM-based IR models using only unlabeled data. We demonstrate improved performance both in training domains as well as zero-shot in domains where the models have seen no queries. We analyze performance changes in various fine-tuning scenarios and offer findings of immediate use to practitioners.
☆ TTQA-RS- A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization
Question answering (QA) over tables and text has gained much popularity over the years. Multi-hop table-text QA requires multiple hops between the table and text, making it a challenging QA task. Although several works have attempted to solve the table-text QA task, most involve training the models and requiring labeled data. In this paper, we have proposed a model - TTQA-RS: A break-down prompting approach for Multi-hop Table-Text Question Answering with Reasoning and Summarization. Our model uses augmented knowledge including table-text summary with decomposed sub-question with answer for a reasoning-based table-text QA. Using open-source language models our model outperformed all existing prompting methods for table-text QA tasks on existing table-text QA datasets like HybridQA and OTT-QA's development set. Our results are comparable with the training-based state-of-the-art models, demonstrating the potential of prompt-based approaches using open-source LLMs. Additionally, by using GPT-4 with LLaMA3-70B, our model achieved state-of-the-art performance for prompting-based methods on multi-hop table-text QA.
♻ ☆ Prompt Perturbation in Retrieval-Augmented Generation based Large Language Models
The robustness of large language models (LLMs) becomes increasingly important as their use rapidly grows in a wide range of domains. Retrieval-Augmented Generation (RAG) is considered as a means to improve the trustworthiness of text generation from LLMs. However, how the outputs from RAG-based LLMs are affected by slightly different inputs is not well studied. In this work, we find that the insertion of even a short prefix to the prompt leads to the generation of outputs far away from factually correct answers. We systematically evaluate the effect of such prefixes on RAG by introducing a novel optimization technique called Gradient Guided Prompt Perturbation (GGPP). GGPP achieves a high success rate in steering outputs of RAG-based LLMs to targeted wrong answers. It can also cope with instructions in the prompts requesting to ignore irrelevant context. We also exploit LLMs' neuron activation difference between prompts with and without GGPP perturbations to give a method that improves the robustness of RAG-based LLMs through a highly effective detector trained on neuron activation triggered by GGPP generated prompts. Our evaluation on open-sourced LLMs demonstrates the effectiveness of our methods.
comment: 12 pages, 9 figures
♻ ☆ LLM-enhanced Reranking in Recommender Systems
Reranking is a critical component in recommender systems, playing an essential role in refining the output of recommendation algorithms. Traditional reranking models have focused predominantly on accuracy, but modern applications demand consideration of additional criteria such as diversity and fairness. Existing reranking approaches often fail to harmonize these diverse criteria effectively at the model level. Moreover, these models frequently encounter challenges with scalability and personalization due to their complexity and the varying significance of different reranking criteria in diverse scenarios. In response, we introduce a comprehensive reranking framework enhanced by LLM, designed to seamlessly integrate various reranking criteria while maintaining scalability and facilitating personalized recommendations. This framework employs a fully connected graph structure, allowing the LLM to simultaneously consider multiple aspects such as accuracy, diversity, and fairness through a coherent Chain-of-Thought (CoT) process. A customizable input mechanism is also integrated, enabling the tuning of the language model's focus to meet specific reranking needs. We validate our approach using three popular public datasets, where our framework demonstrates superior performance over existing state-of-the-art reranking models in balancing multiple criteria. The code for this implementation is publicly available.
♻ ☆ Evaluating Impact of User-Cluster Targeted Attacks in Matrix Factorisation Recommenders
In practice, users of a Recommender System (RS) fall into a few clusters based on their preferences. In this work, we conduct a systematic study on user-cluster targeted data poisoning attacks on Matrix Factorisation (MF) based RS, where an adversary injects fake users with falsely crafted user-item feedback to promote an item to a specific user cluster. We analyse how user and item feature matrices change after data poisoning attacks and identify the factors that influence the effectiveness of the attack on these feature matrices. We demonstrate that the adversary can easily target specific user clusters with minimal effort and that some items are more susceptible to attacks than others. Our theoretical analysis has been validated by the experimental results obtained from two real-world datasets. Our observations from the study could serve as a motivating point to design a more robust RS.
♻ ☆ Finetuning Large Language Model for Personalized Ranking
Large Language Models (LLMs) have demonstrated remarkable performance across various domains, motivating researchers to investigate their potential use in recommendation systems. However, directly applying LLMs to recommendation tasks has proven challenging due to the significant disparity between the data used for pre-training LLMs and the specific requirements of recommendation tasks. In this study, we introduce Direct Multi-Preference Optimization (DMPO), a streamlined framework designed to bridge the gap and enhance the alignment of LLMs for recommendation tasks. DMPO enhances the performance of LLM-based recommenders by simultaneously maximizing the probability of positive samples and minimizing the probability of multiple negative samples. We conducted experimental evaluations to compare DMPO against traditional recommendation methods and other LLM-based recommendation approaches. The results demonstrate that DMPO significantly improves the recommendation capabilities of LLMs across three real-world public datasets in few-shot scenarios. Additionally, the experiments indicate that DMPO exhibits superior generalization ability in cross-domain recommendations. A case study elucidates the reasons behind these consistent improvements and also underscores DMPO's potential as an explainable recommendation system.
♻ ☆ LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-commerce Product Attribute Value Extraction SIGIR 2024
Product attribute value extraction is a pivotal component in Natural Language Processing (NLP) and the contemporary e-commerce industry. The provision of precise product attribute values is fundamental in ensuring high-quality recommendations and enhancing customer satisfaction. The recently emerging Large Language Models (LLMs) have demonstrated state-of-the-art performance in numerous attribute extraction tasks, without the need for domain-specific training data. Nevertheless, varying strengths and weaknesses are exhibited by different LLMs due to the diversity in data, architectures, and hyperparameters. This variation makes them complementary to each other, with no single LLM dominating all others. Considering the diverse strengths and weaknesses of LLMs, it becomes necessary to develop an ensemble method that leverages their complementary potentials. In this paper, we propose a novel algorithm called LLM-ensemble to ensemble different LLMs' outputs for attribute value extraction. We iteratively learn the weights for different LLMs to aggregate the labels with weights to predict the final attribute value. Not only can our proposed method be proven theoretically optimal, but it also ensures efficient computation, fast convergence, and safe deployment. We have also conducted extensive experiments with various state-of-the-art LLMs, including Llama2-13B, Llama2-70B, PaLM-2, GPT-3.5, and GPT-4, on Walmart's internal data. Our offline metrics demonstrate that the LLM-ensemble method outperforms all the state-of-the-art single LLMs on Walmart's internal dataset. This method has been launched in several production models, leading to improved Gross Merchandise Volume (GMV), Click-Through Rate (CTR), Conversion Rate (CVR), and Add-to-Cart Rate (ATC).
comment: SIGIR 2024 industry track
♻ ☆ Improving Neural Topic Models with Wasserstein Knowledge Distillation ECIR 2023
Topic modeling is a dominant method for exploring document collections on the web and in digital libraries. Recent approaches to topic modeling use pretrained contextualized language models and variational autoencoders. However, large neural topic models have a considerable memory footprint. In this paper, we propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. In particular, the proposed distillation objective is to minimize the cross-entropy of the soft labels produced by the teacher and the student models, as well as to minimize the squared 2-Wasserstein distance between the latent distributions learned by the two models. Experiments on two publicly available datasets show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model, and even surpasses the teacher while containing far fewer parameters than the teacher's. The distilled model also outperforms several other competitive topic models on topic coherence.
comment: Accepted at ECIR 2023
♻ ☆ Multi-Behavior Collaborative Filtering with Partial Order Graph Convolutional Networks KDD2024
Representing information of multiple behaviors in the single graph collaborative filtering (CF) vector has been a long-standing challenge. This is because different behaviors naturally form separate behavior graphs and learn separate CF embeddings. Existing models merge the separate embeddings by appointing the CF embeddings for some behaviors as the primary embedding and utilizing other auxiliaries to enhance the primary embedding. However, this approach often results in the joint embedding performing well on the main tasks but poorly on the auxiliary ones. To address the problem arising from the separate behavior graphs, we propose the concept of Partial Order Recommendation Graphs (POG). POG defines the partial order relation of multiple behaviors and models behavior combinations as weighted edges to merge separate behavior graphs into a joint POG. Theoretical proof verifies that POG can be generalized to any given set of multiple behaviors. Based on POG, we propose the tailored Partial Order Graph Convolutional Networks (POGCN) that convolute neighbors' information while considering the behavior relations between users and items. POGCN also introduces a partial-order BPR sampling strategy for efficient and effective multiple-behavior CF training. POGCN has been successfully deployed on the homepage of Alibaba for two months, providing recommendation services for over one billion users. Extensive offline experiments conducted on three public benchmark datasets demonstrate that POGCN outperforms state-of-the-art multi-behavior baselines across all types of behaviors. Furthermore, online A/B tests confirm the superiority of POGCN in billion-scale recommender systems.
comment: Accepted by KDD2024
♻ ☆ Multi-Level Sequence Denoising with Cross-Signal Contrastive Learning for Sequential Recommendation
Sequential recommender systems (SRSs) aim to suggest next item for a user based on her historical interaction sequences. Recently, many research efforts have been devoted to attenuate the influence of noisy items in sequences by either assigning them with lower attention weights or discarding them directly. The major limitation of these methods is that the former would still prone to overfit noisy items while the latter may overlook informative items. To the end, in this paper, we propose a novel model named Multi-level Sequence Denoising with Cross-signal Contrastive Learning (MSDCCL) for sequential recommendation. To be specific, we first introduce a target-aware user interest extractor to simultaneously capture users' long and short term interest with the guidance of target items. Then, we develop a multi-level sequence denoising module to alleviate the impact of noisy items by employing both soft and hard signal denoising strategies. Additionally, we extend existing curriculum learning by simulating the learning pattern of human beings. It is worth noting that our proposed model can be seamlessly integrated with a majority of existing recommendation models and significantly boost their effectiveness. Experimental studies on five public datasets are conducted and the results demonstrate that the proposed MSDCCL is superior to the state-of-the-art baselines. The source code is publicly available at https://github.com/lalunex/MSDCCL/tree/main.
Multimedia 8
☆ MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding
The advent of large vision-language models (LVLMs) has spurred research into their applications in multi-modal contexts, particularly in video understanding. Traditional VideoQA benchmarks, despite providing quantitative metrics, often fail to encompass the full spectrum of video content and inadequately assess models' temporal comprehension. To address these limitations, we introduce MMBench-Video, a quantitative benchmark designed to rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy. We employ GPT-4 for automated assessment, demonstrating superior accuracy and robustness over earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted comprehensive evaluations that include both proprietary and open-source LVLMs for images and videos. MMBench-Video stands as a valuable resource for the research community, facilitating improved evaluation of LVLMs and catalyzing progress in the field of video understanding. The evalutation code of MMBench-Video will be integrated into VLMEvalKit: https://github.com/open-compass/VLMEvalKit.
☆ Proceedings of The second international workshop on eXplainable AI for the Arts (XAIxArts)
This second international workshop on explainable AI for the Arts (XAIxArts) brought together a community of researchers in HCI, Interaction Design, AI, explainable AI (XAI), and digital arts to explore the role of XAI for the Arts. Workshop held at the 16th ACM Conference on Creativity and Cognition (C&C 2024), Chicago, USA.
☆ A Multi-Stream Fusion Approach with One-Class Learning for Audio-Visual Deepfake Detection
This paper addresses the challenge of developing a robust audio-visual deepfake detection model. In practical use cases, new generation algorithms are continually emerging, and these algorithms are not encountered during the development of detection methods. This calls for the generalization ability of the method. Additionally, to ensure the credibility of detection methods, it is beneficial for the model to interpret which cues from the video indicate it is fake. Motivated by these considerations, we then propose a multi-stream fusion approach with one-class learning as a representation-level regularization technique. We study the generalization problem of audio-visual deepfake detection by creating a new benchmark by extending and re-splitting the existing FakeAVCeleb dataset. The benchmark contains four categories of fake video(Real Audio-Fake Visual, Fake Audio-Fake Visual, Fake Audio-Real Visual, and unsynchronized video). The experimental results show that our approach improves the model's detection of unseen attacks by an average of 7.31% across four test sets, compared to the baseline model. Additionally, our proposed framework offers interpretability, indicating which modality the model identifies as fake.
☆ Towards Event-oriented Long Video Understanding
With the rapid development of video Multimodal Large Language Models (MLLMs), numerous benchmarks have been proposed to assess their video understanding capability. However, due to the lack of rich events in the videos, these datasets may suffer from the short-cut bias that the answers can be deduced from a few frames, without the need to watch the entire video. To address this issue, we introduce Event-Bench, an event-oriented long video understanding benchmark built on existing datasets and human annotations. Event-Bench includes six event-related tasks and 2,190 test instances to comprehensively evaluate video event understanding ability. Additionally, we propose Video Instruction Merging~(VIM), a cost-effective method that enhances video MLLMs using merged, event-intensive video instructions, addressing the scarcity of human-annotated, event-intensive data. Extensive experiments show that the best-performing model, GPT-4o, achieves an overall accuracy of 53.33, significantly outperforming the best open-source model by 41.42%. Leveraging an effective instruction synthesis method and an adaptive model architecture, VIM surpasses both state-of-the-art open-source models and GPT-4V on the Event-Bench. All code, data, and models are publicly available at https://github.com/RUCAIBox/Event-Bench.
comment: Work on progress
☆ PIN: A Knowledge-Intensive Dataset for Paired and Interleaved Multimodal Documents
Recent advancements in Large Multimodal Models (LMMs) have leveraged extensive multimodal datasets to enhance capabilities in complex knowledge-driven tasks. However, persistent challenges in perceptual and reasoning errors limit their efficacy, particularly in interpreting intricate visual data and deducing multimodal relationships. Addressing these issues, we introduce a novel dataset format, PIN (Paired and INterleaved multimodal documents), designed to significantly improve both the depth and breadth of multimodal training. The PIN format is built on three foundational principles: knowledge intensity, scalability, and support for diverse training modalities. This innovative format combines markdown files and comprehensive images to enrich training data with a dense knowledge structure and versatile training strategies. We present PIN-14M, an open-source dataset comprising 14 million samples derived from a diverse range of Chinese and English sources, tailored to include complex web and scientific content. This dataset is constructed meticulously to ensure data quality and ethical integrity, aiming to facilitate advanced training strategies and improve model robustness against common multimodal training pitfalls. Our initial results, forming the basis of this technical report, suggest significant potential for the PIN format in refining LMM performance, with plans for future expansions and detailed evaluations of its impact on model capabilities.
♻ ☆ EquiAV: Leveraging Equivariance for Audio-Visual Contrastive Learning ICML 2024
Recent advancements in self-supervised audio-visual representation learning have demonstrated its potential to capture rich and comprehensive representations. However, despite the advantages of data augmentation verified in many learning methods, audio-visual learning has struggled to fully harness these benefits, as augmentations can easily disrupt the correspondence between input pairs. To address this limitation, we introduce EquiAV, a novel framework that leverages equivariance for audio-visual contrastive learning. Our approach begins with extending equivariance to audio-visual learning, facilitated by a shared attention-based transformation predictor. It enables the aggregation of features from diverse augmentations into a representative embedding, providing robust supervision. Notably, this is achieved with minimal computational overhead. Extensive ablation studies and qualitative results verify the effectiveness of our method. EquiAV outperforms previous works across various audio-visual benchmarks. The code is available on https://github.com/JongSuk1/EquiAV.
comment: 15 pages, 3 figures; Accepted to ICML 2024 (camera ready version)
♻ ☆ AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
comment: Project Page: https://adamesh.github.io
♻ ☆ YouTube SFV+HDR Quality Dataset
The popularity of Short form videos (SFV) has grown dramatically in the past few years, and has become a phenomenal video category with billions of viewers. Meanwhile, High Dynamic Range (HDR) as an advanced feature also becomes more and more popular on video sharing platforms. As a hot topic with huge impact, SFV and HDR bring new questions to video quality research: 1) is SFV+HDR quality assessment significantly different from traditional User Generated Content (UGC) quality assessment? 2) do objective quality metrics designed for traditional UGC still work well for SFV+HDR? To answer the above questions, we created the first large scale SFV+HDR dataset with reliable subjective quality scores, covering 10 popular content categories. Further, we also introduce a general sampling framework to maximize the representativeness of the dataset. We provided a comprehensive analysis of subjective quality scores for Short form SDR and HDR videos, and discuss the reliability of state-of-the-art UGC quality metrics and potential improvements.
comment: Accepted by 2024 IEEE International Conference on Image Processing Dataset link: https://media.withyoutube.com/sfv-hdr
Computer Vision and Pattern Recognition 31
☆ Simultaneous Map and Object Reconstruction
In this paper, we present a method for dynamic surface reconstruction of large-scale urban scenes from LiDAR. Depth-based reconstructions tend to focus on small-scale objects or large-scale SLAM reconstructions that treat moving objects as outliers. We take a holistic perspective and optimize a compositional model of a dynamic scene that decomposes the world into rigidly moving objects and the background. To achieve this, we take inspiration from recent novel view synthesis methods and pose the reconstruction problem as a global optimization, minimizing the distance between our predicted surface and the input LiDAR scans. We show how this global optimization can be decomposed into registration and surface reconstruction steps, which are handled well by off-the-shelf methods without any re-training. By careful modeling of continuous-time motion, our reconstructions can compensate for the rolling shutter effects of rotating LiDAR sensors. This allows for the first system (to our knowledge) that properly motion compensates LiDAR scans for rigidly-moving objects, complementing widely-used techniques for motion compensation of static scenes. Beyond pursuing dynamic reconstruction as a goal in and of itself, we also show that such a system can be used to auto-label partially annotated sequences and produce ground truth annotation for hard-to-label problems such as depth completion and scene flow.
☆ Using Multimodal Large Language Models for Automated Detection of Traffic Safety Critical Events
Traditional approaches to safety event analysis in autonomous systems have relied on complex machine learning models and extensive datasets for high accuracy and reliability. However, the advent of Multimodal Large Language Models (MLLMs) offers a novel approach by integrating textual, visual, and audio modalities, thereby providing automated analyses of driving videos. Our framework leverages the reasoning power of MLLMs, directing their output through context-specific prompts to ensure accurate, reliable, and actionable insights for hazard detection. By incorporating models like Gemini-Pro-Vision 1.5 and Llava, our methodology aims to automate the safety critical events and mitigate common issues such as hallucinations in MLLM outputs. Preliminary results demonstrate the framework's potential in zero-shot learning and accurate scenario analysis, though further validation on larger datasets is necessary. Furthermore, more investigations are required to explore the performance enhancements of the proposed framework through few-shot learning and fine-tuned models. This research underscores the significance of MLLMs in advancing the analysis of the naturalistic driving videos by improving safety-critical event detecting and understanding the interaction with complex environments.
☆ DPO: Dual-Perturbation Optimization for Test-time Adaptation in 3D Object Detection
LiDAR-based 3D object detection has seen impressive advances in recent times. However, deploying trained 3D detectors in the real world often yields unsatisfactory performance when the distribution of the test data significantly deviates from the training data due to different weather conditions, object sizes, \textit{etc}. A key factor in this performance degradation is the diminished generalizability of pre-trained models, which creates a sharp loss landscape during training. Such sharpness, when encountered during testing, can precipitate significant performance declines, even with minor data variations. To address the aforementioned challenges, we propose \textbf{dual-perturbation optimization (DPO)} for \textbf{\underline{T}est-\underline{t}ime \underline{A}daptation in \underline{3}D \underline{O}bject \underline{D}etection (TTA-3OD)}. We minimize the sharpness to cultivate a flat loss landscape to ensure model resiliency to minor data variations, thereby enhancing the generalization of the adaptation process. To fully capture the inherent variability of the test point clouds, we further introduce adversarial perturbation to the input BEV features to better simulate the noisy test environment. As the dual perturbation strategy relies on trustworthy supervision signals, we utilize a reliable Hungarian matcher to filter out pseudo-labels sensitive to perturbations. Additionally, we introduce early Hungarian cutoff to avoid error accumulation from incorrect pseudo-labels by halting the adaptation process. Extensive experiments across three types of transfer tasks demonstrate that the proposed DPO significantly surpasses previous state-of-the-art approaches, specifically on Waymo $\rightarrow$ KITTI, outperforming the most competitive baseline by 57.72\% in $\text{AP}_\text{3D}$ and reaching 91\% of the fully supervised upper bound.
comment: 13 pages, 7 figures
☆ WATT: Weight Average Test-Time Adaption of CLIP
Vision-Language Models (VLMs) such as CLIP have yielded unprecedented performance for zero-shot image classification, yet their generalization capability may still be seriously challenged when confronted to domain shifts. In response, we present Weight Average Test-Time Adaptation (WATT) of CLIP, a pioneering approach facilitating full test-time adaptation (TTA) of this VLM. Our method employs a diverse set of templates for text prompts, augmenting the existing framework of CLIP. Predictions are utilized as pseudo labels for model updates, followed by weight averaging to consolidate the learned information globally. Furthermore, we introduce a text ensemble strategy, enhancing overall test performance by aggregating diverse textual cues. Our findings underscore the efficacy of WATT in enhancing performance across diverse datasets, including CIFAR-10-C, CIFAR-10.1, CIFAR-100-C, VisDA-C, and several other challenging datasets, effectively covering a wide range of domain shifts. Notably, these enhancements are achieved without necessitating additional model transformations or trainable modules. Moreover, compared to other Test-Time Adaptation methods, our approach can operate effectively with just a single image. Highlighting the potential of innovative test-time strategies, this research emphasizes their role in fortifying the adaptability of VLMs. The implementation is available at: \url{https://github.com/Mehrdad-Noori/WATT.git}.
☆ Splatter a Video: Video Gaussian Representation for Versatile Processing
Video representation is a long-standing problem that is crucial for various down-stream tasks, such as tracking,depth prediction,segmentation,view synthesis,and editing. However, current methods either struggle to model complex motions due to the absence of 3D structure or rely on implicit 3D representations that are ill-suited for manipulation tasks. To address these challenges, we introduce a novel explicit 3D representation-video Gaussian representation -- that embeds a video into 3D Gaussians. Our proposed representation models video appearance in a 3D canonical space using explicit Gaussians as proxies and associates each Gaussian with 3D motions for video motion. This approach offers a more intrinsic and explicit representation than layered atlas or volumetric pixel matrices. To obtain such a representation, we distill 2D priors, such as optical flow and depth, from foundation models to regularize learning in this ill-posed setting. Extensive applications demonstrate the versatility of our new video representation. It has been proven effective in numerous video processing tasks, including tracking, consistent video depth and feature refinement, motion and appearance editing, and stereoscopic video generation. Project page: https://sunyangtian.github.io/spatter_a_video_web/
☆ Liveness Detection in Computer Vision: Transformer-based Self-Supervised Learning for Face Anti-Spoofing
Face recognition systems are increasingly used in biometric security for convenience and effectiveness. However, they remain vulnerable to spoofing attacks, where attackers use photos, videos, or masks to impersonate legitimate users. This research addresses these vulnerabilities by exploring the Vision Transformer (ViT) architecture, fine-tuned with the DINO framework. The DINO framework facilitates self-supervised learning, enabling the model to learn distinguishing features from unlabeled data. We compared the performance of the proposed fine-tuned ViT model using the DINO framework against a traditional CNN model, EfficientNet b2, on the face anti-spoofing task. Numerous tests on standard datasets show that the ViT model performs better than the CNN model in terms of accuracy and resistance to different spoofing methods. Additionally, we collected our own dataset from a biometric application to validate our findings further. This study highlights the superior performance of transformer-based architecture in identifying complex spoofing cues, leading to significant advancements in biometric security.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Locating and measuring marine aquaculture production from space: a computer vision approach in the French Mediterranean
Aquaculture production -- the cultivation of aquatic plants and animals -- has grown rapidly since the 1990s, but sparse, self-reported and aggregate production data limits the effective understanding and monitoring of the industry's trends and potential risks. Building on a manual survey of aquaculture production from remote sensing imagery, we train a computer vision model to identify marine aquaculture cages from aerial and satellite imagery, and generate a spatially explicit dataset of finfish production locations in the French Mediterranean from 2000-2021 that includes 4,010 cages (69m2 average cage area). We demonstrate the value of our method as an easily adaptable, cost-effective approach that can improve the speed and reliability of aquaculture surveys, and enables downstream analyses relevant to researchers and regulators. We illustrate its use to compute independent estimates of production, and develop a flexible framework to quantify uncertainty in these estimates. Overall, our study presents an efficient, scalable and highly adaptable method for monitoring aquaculture production from remote sensing imagery.
☆ MAMA-MIA: A Large-Scale Multi-Center Breast Cancer DCE-MRI Benchmark Dataset with Expert Segmentations
Current research in breast cancer Magnetic Resonance Imaging (MRI), especially with Artificial Intelligence (AI), faces challenges due to the lack of expert segmentations. To address this, we introduce the MAMA-MIA dataset, comprising 1506 multi-center dynamic contrast-enhanced MRI cases with expert segmentations of primary tumors and non-mass enhancement areas. These cases were sourced from four publicly available collections in The Cancer Imaging Archive (TCIA). Initially, we trained a deep learning model to automatically segment the cases, generating preliminary segmentations that significantly reduced expert segmentation time. Sixteen experts, averaging 9 years of experience in breast cancer, then corrected these segmentations, resulting in the final expert segmentations. Additionally, two radiologists conducted a visual inspection of the automatic segmentations to support future quality control studies. Alongside the expert segmentations, we provide 49 harmonized demographic and clinical variables and the pretrained weights of the well-known nnUNet architecture trained using the DCE-MRI full-images and expert segmentations. This dataset aims to accelerate the development and benchmarking of deep learning models and foster innovation in breast cancer diagnostics and treatment planning.
comment: 15 paes, 7 figures, 3 tables
☆ IG-CFAT: An Improved GAN-Based Framework for Effectively Exploiting Transformers in Real-World Image Super-Resolution
In the field of single image super-resolution (SISR), transformer-based models, have demonstrated significant advancements. However, the potential and efficiency of these models in applied fields such as real-world image super-resolution are less noticed and there are substantial opportunities for improvement. Recently, composite fusion attention transformer (CFAT), outperformed previous state-of-the-art (SOTA) models in classic image super-resolution. This paper extends the CFAT model to an improved GAN-based model called IG-CFAT to effectively exploit the performance of transformers in real-world image super-resolution. IG-CFAT incorporates a semantic-aware discriminator to reconstruct image details more accurately, significantly improving perceptual quality. Moreover, our model utilizes an adaptive degradation model to better simulate real-world degradations. Our methodology adds wavelet losses to conventional loss functions of GAN-based super-resolution models to reconstruct high-frequency details more efficiently. Empirical results demonstrate that IG-CFAT sets new benchmarks in real-world image super-resolution, outperforming SOTA models in both quantitative and qualitative metrics.
☆ Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
☆ AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI
comment: Code available https://github.com/alanaai/EVUD
☆ NeRF-Feat: 6D Object Pose Estimation using Feature Rendering 3DV 2024
Object Pose Estimation is a crucial component in robotic grasping and augmented reality. Learning based approaches typically require training data from a highly accurate CAD model or labeled training data acquired using a complex setup. We address this by learning to estimate pose from weakly labeled data without a known CAD model. We propose to use a NeRF to learn object shape implicitly which is later used to learn view-invariant features in conjunction with CNN using a contrastive loss. While NeRF helps in learning features that are view-consistent, CNN ensures that the learned features respect symmetry. During inference, CNN is used to predict view-invariant features which can be used to establish correspondences with the implicit 3d model in NeRF. The correspondences are then used to estimate the pose in the reference frame of NeRF. Our approach can also handle symmetric objects unlike other approaches using a similar training setup. Specifically, we learn viewpoint invariant, discriminative features using NeRF which are later used for pose estimation. We evaluated our approach on LM, LM-Occlusion, and T-Less dataset and achieved benchmark accuracy despite using weakly labeled data.
comment: 3DV 2024
☆ LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application CVPR 2024
Large Language Models (LLM) and Vision Language Models (VLM) enable robots to ground natural language prompts into control actions to achieve tasks in an open world. However, when applied to a long-horizon collaborative task, this formulation results in excessive prompting for initiating or clarifying robot actions at every step of the task. We propose Language-driven Intention Tracking (LIT), leveraging LLMs and VLMs to model the human user's long-term behavior and to predict the next human intention to guide the robot for proactive collaboration. We demonstrate smooth coordination between a LIT-based collaborative robot and the human user in collaborative cooking tasks.
comment: Spotlight Presentation at the 3rd Workshop on Computer Vision in the Wild at CVPR 2024. Also accepted by the 5th Annual Embodied AI Workshop at CVPR 2024
☆ A Primal-Dual Framework for Transformers and Neural Networks ICLR 2023
Self-attention is key to the remarkable success of transformers in sequence modeling tasks including many applications in natural language processing and computer vision. Like neural network layers, these attention mechanisms are often developed by heuristics and experience. To provide a principled framework for constructing attention layers in transformers, we show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem, whose primal formulation has the form of a neural network layer. Using our framework, we derive popular attention layers used in practice and propose two new attentions: 1) the Batch Normalized Attention (Attention-BN) derived from the batch normalization layer and 2) the Attention with Scaled Head (Attention-SH) derived from using less training data to fit the SVR model. We empirically demonstrate the advantages of the Attention-BN and Attention-SH in reducing head redundancy, increasing the model's accuracy, and improving the model's efficiency in a variety of practical applications including image and time-series classification.
comment: Accepted to ICLR 2023, 26 pages, 4 figures, 14 tables
☆ Elliptical Attention
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. This dot-product self-attention computes attention weights among the input tokens using Euclidean distance, which makes the model prone to representation collapse and vulnerable to contaminated samples. In this paper, we propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance. In particular, we define a hyper-ellipsoidal neighborhood around each query to increase the attention weights of the tokens lying in the contextually important directions. We term this novel class of attention Elliptical Attention. Our Elliptical Attention provides two benefits: 1) reducing representation collapse and 2) enhancing the model's robustness as the Elliptical Attention pays more attention to contextually relevant information rather than focusing on some small subset of informative features. We empirically demonstrate the advantages of Elliptical Attention over the baseline dot-product attention and state-of-the-art attention methods on various practical tasks, including object classification, image segmentation, and language modeling across different data modalities.
comment: 38 pages, 7 figures, 12 tables
☆ Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models
Can large multimodal models have a human-like ability for emotional and social reasoning, and if so, how does it work? Recent research has discovered emergent theory-of-mind (ToM) reasoning capabilities in large language models (LLMs). LLMs can reason about people's mental states by solving various text-based ToM tasks that ask questions about the actors' ToM (e.g., human belief, desire, intention). However, human reasoning in the wild is often grounded in dynamic scenes across time. Thus, we consider videos a new medium for examining spatio-temporal ToM reasoning ability. Specifically, we ask explicit probing questions about videos with abundant social and emotional reasoning content. We develop a pipeline for multimodal LLM for ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question, which reveals how multimodal LLMs reason about ToM.
☆ Unveiling the Hidden Structure of Self-Attention via Kernel Principal Component Analysis
The remarkable success of transformers in sequence modeling tasks, spanning various applications in natural language processing and computer vision, is attributed to the critical role of self-attention. Similar to the development of most deep learning models, the construction of these attention mechanisms rely on heuristics and experience. In our work, we derive self-attention from kernel principal component analysis (kernel PCA) and show that self-attention projects its query vectors onto the principal component axes of its key matrix in a feature space. We then formulate the exact formula for the value matrix in self-attention, theoretically and empirically demonstrating that this value matrix captures the eigenvectors of the Gram matrix of the key vectors in self-attention. Leveraging our kernel PCA framework, we propose Attention with Robust Principal Components (RPC-Attention), a novel class of robust attention that is resilient to data contamination. We empirically demonstrate the advantages of RPC-Attention over softmax attention on the ImageNet-1K object classification, WikiText-103 language modeling, and ADE20K image segmentation task.
comment: 33 pages, 5 figures, 12 tables
☆ Empowering Tuberculosis Screening with Explainable Self-Supervised Deep Neural Networks
Tuberculosis persists as a global health crisis, especially in resource-limited populations and remote regions, with more than 10 million individuals newly infected annually. It stands as a stark symbol of inequity in public health. Tuberculosis impacts roughly a quarter of the global populace, with the majority of cases concentrated in eight countries, accounting for two-thirds of all tuberculosis infections. Although a severe ailment, tuberculosis is both curable and manageable. However, early detection and screening of at-risk populations are imperative. Chest x-ray stands as the predominant imaging technique utilized in tuberculosis screening efforts. However, x-ray screening necessitates skilled radiologists, a resource often scarce, particularly in remote regions with limited resources. Consequently, there is a pressing need for artificial intelligence (AI)-powered systems to support clinicians and healthcare providers in swift screening. However, training a reliable AI model necessitates large-scale high-quality data, which can be difficult and costly to acquire. Inspired by these challenges, in this work, we introduce an explainable self-supervised self-train learning network tailored for tuberculosis case screening. The network achieves an outstanding overall accuracy of 98.14% and demonstrates high recall and precision rates of 95.72% and 99.44%, respectively, in identifying tuberculosis cases, effectively capturing clinically significant features.
comment: 9 pages, 3 figures
☆ GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.
comment: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ;. arXiv admin note: substantial text overlap with arXiv:2404.01291
☆ StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics
comment: Dataset website: https://stablesemantics.github.io/StableSemantics
☆ GUI Action Narrator: Where and When Did That Action Take Place?
The advent of Multimodal LLMs has significantly enhanced image OCR recognition capabilities, making GUI automation a viable reality for increasing efficiency in digital tasks. One fundamental aspect of developing a GUI automation system is understanding primitive GUI actions. This comprehension is crucial as it enables agents to learn from user demonstrations, an essential element of automation. To rigorously evaluate such capabilities, we developed a video captioning benchmark for GUI actions, comprising 4,189 diverse video captioning samples. This task presents unique challenges compared to natural scene video captioning: 1) GUI screenshots typically contain denser information than natural scenes, and 2) events within GUIs are subtler and occur more rapidly, requiring precise attention to the appropriate time span and spatial region for accurate understanding. To address these challenges, we introduce our GUI action dataset \textbf{Act2Cap} as well as a simple yet effective framework, \textbf{GUI Narrator}, for GUI video captioning that utilizes the cursor as a visual prompt to enhance the interpretation of high-resolution screenshots. Specifically, a cursor detector is trained on our dataset, and a multimodal LLM model with mechanisms for selecting keyframes and key regions generates the captions. Experimental results indicate that even for today's most advanced multimodal models, such as GPT-4o, the task remains highly challenging. Additionally, our evaluations show that our strategy effectively enhances model performance, whether integrated into the fine-tuning of open-source models or employed as a prompting strategy in closed-source models.
☆ AlanaVLM: A Multimodal Embodied AI Foundation Model for Egocentric Video Understanding
AI personal assistants deployed via robots or wearables require embodied understanding to collaborate with humans effectively. However, current Vision-Language Models (VLMs) primarily focus on third-person view videos, neglecting the richness of egocentric perceptual experience. To address this gap, we propose three key contributions. First, we introduce the Egocentric Video Understanding Dataset (EVUD) for training VLMs on video captioning and question answering tasks specific to egocentric videos. Second, we present AlanaVLM, a 7B parameter VLM trained using parameter-efficient methods on EVUD. Finally, we evaluate AlanaVLM's capabilities on OpenEQA, a challenging benchmark for embodied video question answering. Our model achieves state-of-the-art performance, outperforming open-source models including strong Socratic models using GPT-4 as a planner by 3.6%. Additionally, we outperform Claude 3 and Gemini Pro Vision 1.0 and showcase competitive results compared to Gemini Pro 1.5 and GPT-4V, even surpassing the latter in spatial reasoning. This research paves the way for building efficient VLMs that can be deployed in robots or wearables, leveraging embodied video understanding to collaborate seamlessly with humans in everyday tasks, contributing to the next generation of Embodied AI
comment: Code available https://github.com/alanaai/EVUD
♻ ☆ Spatialyze: A Geospatial Video Analytics System with Spatial-Aware Optimizations
Videos that are shot using commodity hardware such as phones and surveillance cameras record various metadata such as time and location. We encounter such geospatial videos on a daily basis and such videos have been growing in volume significantly. Yet, we do not have data management systems that allow users to interact with such data effectively. In this paper, we describe Spatialyze, a new framework for end-to-end querying of geospatial videos. Spatialyze comes with a domain-specific language where users can construct geospatial video analytic workflows using a 3-step, declarative, build-filter-observe paradigm. Internally, Spatialyze leverages the declarative nature of such workflows, the temporal-spatial metadata stored with videos, and physical behavior of real-world objects to optimize the execution of workflows. Our results using real-world videos and workflows show that Spatialyze can reduce execution time by up to 5.3x, while maintaining up to 97.1% accuracy compared to unoptimized execution.
comment: Project Page: https://spatialyze.github.io
♻ ☆ Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model CVPR 2024
The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems, unifying various vision-language (VL) tasks by instruction tuning. However, due to the enormous diversity in input-output formats in the vision domain, existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work, we introduce VistaLLM, a powerful visual system that addresses coarse- and fine-grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover, VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences, significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM, we curate CoinIt, a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task, AttCoSeg (Attribute-level Co-Segmentation), which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across all downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/.
comment: CVPR 2024 Highlight
♻ ☆ TroL: Traversal of Layers for Large Language and Vision Models
Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.
comment: Code is available in https://github.com/ByungKwanLee/TroL
♻ ☆ Anatomically-Controllable Medical Image Generation with Segmentation-Guided Diffusion Models MICCAI 2024
Diffusion models have enabled remarkably high-quality medical image generation, yet it is challenging to enforce anatomical constraints in generated images. To this end, we propose a diffusion model-based method that supports anatomically-controllable medical image generation, by following a multi-class anatomical segmentation mask at each sampling step. We additionally introduce a random mask ablation training algorithm to enable conditioning on a selected combination of anatomical constraints while allowing flexibility in other anatomical areas. We compare our method ("SegGuidedDiff") to existing methods on breast MRI and abdominal/neck-to-pelvis CT datasets with a wide range of anatomical objects. Results show that our method reaches a new state-of-the-art in the faithfulness of generated images to input anatomical masks on both datasets, and is on par for general anatomical realism. Finally, our model also enjoys the extra benefit of being able to adjust the anatomical similarity of generated images to real images of choice through interpolation in its latent space. SegGuidedDiff has many applications, including cross-modality translation, and the generation of paired or counterfactual data. Our code is available at https://github.com/mazurowski-lab/segmentation-guided-diffusion.
comment: Accepted at MICCAI 2024. Code and synthetic dataset: https://github.com/mazurowski-lab/segmentation-guided-diffusion
♻ ☆ VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
Panoptic Scene Graph Generation (PSG) aims at achieving a comprehensive image understanding by simultaneously segmenting objects and predicting relations among objects. However, the long-tail problem among relations leads to unsatisfactory results in real-world applications. Prior methods predominantly rely on vision information or utilize limited language information, such as object or relation names, thereby overlooking the utility of language information. Leveraging the recent progress in Large Language Models (LLMs), we propose to use language information to assist relation prediction, particularly for rare relations. To this end, we propose the Vision-Language Prompting (VLPrompt) model, which acquires vision information from images and language information from LLMs. Then, through a prompter network based on attention mechanism, it achieves precise relation prediction. Our extensive experiments show that VLPrompt significantly outperforms previous state-of-the-art methods on the PSG dataset, proving the effectiveness of incorporating language information and alleviating the long-tail problem of relations. Code is available at \url{https://github.com/franciszzj/TP-SIS}.
comment: 22 pages, 9 figures
♻ ☆ RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering
In recent years, with the rapid advancement of transformer models, transformer-based multimodal architectures have found wide application in various downstream tasks, including but not limited to Image Captioning, Visual Question Answering (VQA), and Image-Text Generation. However, contemporary approaches to Remote Sensing (RS) VQA often involve resource-intensive techniques, such as full fine-tuning of large models or the extraction of image-text features from pre-trained multimodal models, followed by modality fusion using decoders. These approaches demand significant computational resources and time, and a considerable number of trainable parameters are introduced. To address these challenges, we introduce a novel method known as RSAdapter, which prioritizes runtime and parameter efficiency. RSAdapter comprises two key components: the Parallel Adapter and an additional linear transformation layer inserted after each fully connected (FC) layer within the Adapter. This approach not only improves adaptation to pre-trained multimodal models but also allows the parameters of the linear transformation layer to be integrated into the preceding FC layers during inference, reducing inference costs. To demonstrate the effectiveness of RSAdapter, we conduct an extensive series of experiments using three distinct RS-VQA datasets and achieve state-of-the-art results on all three datasets. The code for RSAdapter is available online at https://github.com/Y-D-Wang/RSAdapter.
♻ ☆ An image speaks a thousand words, but can everyone listen? On image transcreation for cultural relevance
Given the rise of multimedia content, human translators increasingly focus on culturally adapting not only words but also other modalities such as images to convey the same meaning. While several applications stand to benefit from this, machine translation systems remain confined to dealing with language in speech and text. In this work, we take a first step towards translating images to make them culturally relevant. First, we build three pipelines comprising state-of-the-art generative models to do the task. Next, we build a two-part evaluation dataset: i) concept: comprising 600 images that are cross-culturally coherent, focusing on a single concept per image, and ii) application: comprising 100 images curated from real-world applications. We conduct a multi-faceted human evaluation of translated images to assess for cultural relevance and meaning preservation. We find that as of today, image-editing models fail at this task, but can be improved by leveraging LLMs and retrievers in the loop. Best pipelines can only translate 5% of images for some countries in the easier concept dataset and no translation is successful for some countries in the application dataset, highlighting the challenging nature of the task. Our code and data is released here: https://github.com/simran-khanuja/image-transcreation.
♻ ☆ The Male CEO and the Female Assistant: Gender Biases in Text-To-Image Generation of Dual Subjects
Recent large-scale T2I models like DALLE-3 have made progress on improving fairness in single-subject generation, i.e. generating a one-person image. However, we reveal that these improved models still demonstrate considerable biases when simply generating two people. To systematically evaluate T2I models in this challenging generation setting, we propose the Paired Stereotype Test (PST) framework, established as a dual-subject generation task, i.e. generating two people in the same image. The setting in PST is especially challenging, as the two individuals are described with social identities that are male-stereotyped and female-stereotyped, respectively, e.g. "a CEO" and "an Assistant". It is easy for T2I models to unfairly follow gender stereotypes in this contrastive setting. We establish a metric, Stereotype Score (SS), to quantitatively measure the adherence to gender stereotypes in generated images. Using PST, we evaluate two aspects of gender biases in DALLE-3 -- the widely-identified bias in gendered occupation, as well as a novel aspect: bias in organizational power. Results show that despite generating seemingly fair or even anti-stereotype single-person images, DALLE-3 still shows notable biases under PST -- for instance, in experiments on gender-occupational stereotypes, over 74% model generations demonstrate biases. Moreover, compared to single-person settings, DALLE-3 is more likely to perpetuate male-associated stereotypes under PST. Our work pioneers the research on bias in dual-subject generation, and our proposed PST framework can be easily extended for further experiments, establishing a valuable contribution.
♻ ☆ Fine-tuned Generative Adversarial Network-based Model for Medical Image Super-Resolution
In the field of medical image analysis, there is a substantial need for high-resolution (HR) images to improve diagnostic accuracy. However, it is a challenging task to obtain HR medical images, as it requires advanced instruments and significant time. Deep learning-based super-resolution methods can help to improve the resolution and perceptual quality of low-resolution (LR) medical images. Recently, Generative Adversarial Network (GAN) based methods have shown remarkable performance among deep learning-based super-resolution methods. Real-Enhanced Super-Resolution Generative Adversarial Network (Real-ESRGAN) is a practical model for recovering HR images from real-world LR images. In our proposed approach, we use transfer learning technique and fine-tune the pre-trained Real-ESRGAN model using medical image datasets. This technique helps in improving the performance of the model. We employ the high-order degradation model of the Real-ESRGAN which better simulates real-world image degradations. This adaptation allows for generating more realistic degraded medical images, resulting in improved performance. The focus of this paper is on enhancing the resolution and perceptual quality of chest X-ray and retinal images. We use the Tuberculosis chest X-ray (Shenzhen) dataset and the STARE dataset of retinal images for fine-tuning the model. The proposed model achieves superior perceptual quality compared to the Real-ESRGAN model, effectively preserving fine details and generating images with more realistic textures.
Computation and Language 5
☆ Open Generative Large Language Models for Galician
Large language models (LLMs) have transformed natural language processing. Yet, their predominantly English-centric training has led to biases and performance disparities across languages. This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician. We present the first two generative LLMs focused on Galician to bridge this gap. These models, freely available as open-source resources, were trained using a GPT architecture with 1.3B parameters on a corpus of 2.1B words. Leveraging continual pretraining, we adapt to Galician two existing LLMs trained on larger corpora, thus mitigating the data constraints that would arise if the training were performed from scratch. The models were evaluated using human judgments and task-based datasets from standardized benchmarks. These evaluations reveal a promising performance, underscoring the importance of linguistic diversity in generative models.
comment: 12 pages, 1 figure
☆ Adaptable Logical Control for Large Language Models
Despite the success of Large Language Models (LLMs) on various tasks following human instructions, controlling model generation at inference time poses a persistent challenge. In this paper, we introduce Ctrl-G, an adaptable framework that facilitates tractable and flexible control of LLM generation to reliably follow logical constraints. Ctrl-G combines any production-ready LLM with a Hidden Markov Model, enabling LLM outputs to adhere to logical constraints represented as deterministic finite automata. We show that Ctrl-G, when applied to a TULU2-7B model, outperforms GPT3.5 and GPT4 on the task of interactive text editing: specifically, for the task of generating text insertions/continuations following logical constraints, Ctrl-G achieves over 30% higher satisfaction rate in human evaluation compared to GPT4. When applied to medium-size language models (e.g., GPT2-large), Ctrl-G also beats its counterparts for constrained generation by large margins on standard benchmarks. Additionally, as a proof-of-concept study, we experiment Ctrl-G on the Grade School Math benchmark to assist LLM reasoning, foreshadowing the application of Ctrl-G, as well as other constrained generation approaches, beyond traditional language generation tasks.
☆ ClinicalLab: Aligning Agents for Multi-Departmental Clinical Diagnostics in the Real World
LLMs have achieved significant performance progress in various NLP applications. However, LLMs still struggle to meet the strict requirements for accuracy and reliability in the medical field and face many challenges in clinical applications. Existing clinical diagnostic evaluation benchmarks for evaluating medical agents powered by LLMs have severe limitations. Firstly, most existing medical evaluation benchmarks face the risk of data leakage or contamination. Secondly, existing benchmarks often neglect the characteristics of multiple departments and specializations in modern medical practice. Thirdly, existing evaluation methods are limited to multiple-choice questions, which do not align with the real-world diagnostic scenarios. Lastly, existing evaluation methods lack comprehensive evaluations of end-to-end real clinical scenarios. These limitations in benchmarks in turn obstruct advancements of LLMs and agents for medicine. To address these limitations, we introduce ClinicalLab, a comprehensive clinical diagnosis agent alignment suite. ClinicalLab includes ClinicalBench, an end-to-end multi-departmental clinical diagnostic evaluation benchmark for evaluating medical agents and LLMs. ClinicalBench is based on real cases that cover 24 departments and 150 diseases. ClinicalLab also includes four novel metrics (ClinicalMetrics) for evaluating the effectiveness of LLMs in clinical diagnostic tasks. We evaluate 17 LLMs and find that their performance varies significantly across different departments. Based on these findings, in ClinicalLab, we propose ClinicalAgent, an end-to-end clinical agent that aligns with real-world clinical diagnostic practices. We systematically investigate the performance and applicable scenarios of variants of ClinicalAgent on ClinicalBench. Our findings demonstrate the importance of aligning with modern medical practices in designing medical agents.
☆ Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever
Knowledge tagging for questions plays a crucial role in contemporary intelligent educational applications, including learning progress diagnosis, practice question recommendations, and course content organization. Traditionally, these annotations are always conducted by pedagogical experts, as the task requires not only a strong semantic understanding of both question stems and knowledge definitions but also deep insights into connecting question-solving logic with corresponding knowledge concepts. With the recent emergence of advanced text encoding algorithms, such as pre-trained language models, many researchers have developed automatic knowledge tagging systems based on calculating the semantic similarity between the knowledge and question embeddings. In this paper, we explore automating the task using Large Language Models (LLMs), in response to the inability of prior encoding-based methods to deal with the hard cases which involve strong domain knowledge and complicated concept definitions. By showing the strong performance of zero- and few-shot results over math questions knowledge tagging tasks, we demonstrate LLMs' great potential in conquering the challenges faced by prior methods. Furthermore, by proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs in achieving better performance results while keeping the in-context demonstration usage efficiency high.
comment: 13 pages, 6 figures
♻ ☆ Retrieval Augmented Generation using Engineering Design Knowledge
Aiming to support Retrieval Augmented Generation (RAG) in the design process, we present a method to identify explicit, engineering design facts - {head entity :: relationship :: tail entity} from patented artefact descriptions. Given a sentence with a pair of entities (based on noun phrases) marked in a unique manner, our method extracts the relationship that is explicitly communicated in the sentence. For this task, we create a dataset of 375,084 examples and fine-tune language models for relation identification (token classification) and elicitation (sequence-to-sequence). The token classification approach achieves up to 99.7% accuracy. Upon applying the method to a domain of 4,870 fan system patents, we populate a knowledge base of over 2.93 million facts. Using this knowledge base, we demonstrate how Large Language Models (LLMs) are guided by explicit facts to synthesise knowledge and generate technical and cohesive responses when sought out for knowledge retrieval tasks in the design process.
comment: Resources: Dataset - https://huggingface.co/datasets/siddharthl1293/engineering_design_facts Training Infrastructure - https://zenodo.org/records/12012131 Trained model - https://huggingface.co/siddharthl1293/albert-albert-large-v2 Application - https://github.com/siddharthl93/engineering-design-knowledge
Systems and Control 27
☆ SkyGrid: Energy-Flow Optimization at Harmonized Aerial Intersections
The rapid evolution of urban air mobility (UAM) is reshaping the future of transportation by integrating aerial vehicles into urban transit systems. The design of aerial intersections plays a critical role in the phased development of UAM systems to ensure safe and efficient operations in air corridors. This work adapts the concept of rhythmic control of connected and automated vehicles (CAVs) at unsignalized intersections to address complex traffic control problems. This control framework assigns UAM vehicles to different movement groups and significantly reduces the computation of routing strategies to avoid conflicts. In contrast to ground traffic, the objective is to balance three measures: minimizing energy utilization, maximizing intersection flow (throughput), and maintaining safety distances. This optimization method dynamically directs traffic with various demands, considering path assignment distributions and segment-level trajectory coefficients for straight and curved paths as control variables. To the best of our knowledge, this is the first work to consider a multi-objective optimization approach for unsignalized intersection control in the air and to propose such optimization in a rhythmic control setting with time arrival and UAM operational constraints. A sensitivity analysis with respect to inter-platoon safety and straight/left demand balance demonstrates the effectiveness of our method in handling traffic under various scenarios.
comment: 8 pages, 12 figures - Submitted to IEEE VTC Fall 2024 - Under Review
☆ Adaptive Curves for Optimally Efficient Market Making
Automated Market Makers (AMMs) are essential in Decentralized Finance (DeFi) as they match liquidity supply with demand. They function through liquidity providers (LPs) who deposit assets into liquidity pools. However, the asset trading prices in these pools often trail behind those in more dynamic, centralized exchanges, leading to potential arbitrage losses for LPs. This issue is tackled by adapting market maker bonding curves to trader behavior, based on the classical market microstructure model of Glosten and Milgrom. Our approach ensures a zero-profit condition for the market maker's prices. We derive the differential equation that an optimal adaptive curve should follow to minimize arbitrage losses while remaining competitive. Solutions to this optimality equation are obtained for standard Gaussian and Lognormal price models using Kalman filtering. A key feature of our method is its ability to estimate the external market price without relying on price or loss oracles. We also provide an equivalent differential equation for the implied dynamics of canonical static bonding curves and establish conditions for their optimality. Our algorithms demonstrate robustness to changing market conditions and adversarial perturbations, and we offer an on-chain implementation using Uniswap v4 alongside off-chain AI co-processors.
☆ COAC: Cross-layer Optimization of Accelerator Configurability for Efficient CNN Processing
To achieve high accuracy, convolutional neural networks (CNNs) are increasingly growing in complexity and diversity in layer types and topologies. This makes it very challenging to efficiently deploy such networks on custom processor architectures for resource-scarce edge devices. Existing mapping exploration frameworks enable searching for the optimal execution schedules or hardware mappings of individual network layers, by optimizing each layer's spatial (dataflow parallelization) and temporal unrolling (execution order). However, these tools fail to take into account the overhead of supporting different unrolling schemes within a common hardware architecture. Using a fixed unrolling scheme across all layers is also not ideal, as this misses significant opportunities for energy and latency savings from optimizing the mapping of diverse layer types. A balanced approach assesses the right amount of mapping flexibility needed across target neural networks, while taking into account the overhead to support multiple unrollings. This paper, therefore, presents COAC, a cross-layer design space exploration and mapping framework to optimize the flexibility of neural processing architectures by balancing configurability overhead against resulting energy and latency savings for end-to-end inference. COAC does not only provide a systematical analysis of the architectural overhead in function of the supported spatial unrollings, but also builds an automated flow to find the best unrolling combination(s) for efficient end-to-end inference with limited hardware overhead. Results demonstrate that architectures with carefully optimized flexibility can achieve up to 38% EDP (energy-delay-product) savings for a set of six neural networks at the expense of a relative area increase of 9.5%.
comment: 14 pages,17 figures.Journal IEEE Transactions on Very Large Scale Integration (VLSI) Systems
☆ Safety-Critical Formation Control of Non-Holonomic Multi-Robot Systems in Communication-Limited Environments
This paper presents a robust estimator-based safety-critical controller for formation control of non-holonomic mobile robots in communication-limited environments. The proposed decentralized framework integrates a robust state estimator with a formation tracking control law that guarantees inter-agent collision avoidance using control barrier functions. String stability is incorporated into the control design to maintain stability against noise from predecessors in leader-follower formations. Rigorous stability analysis using Lyapunov functions ensures the stability of estimation errors and the convergence of the formation to desired configurations. The effectiveness and robustness of the proposed approach are validated through numerical simulations of various maneuvers and realistic Gazebo experiments involving formations in a warehouse environment. The results demonstrate the controller's ability to maintain safety, achieve precise formation control, and mitigate disturbances in scenarios without inter-robot communication.
comment: Under review
☆ Advanced Maximum Adhesion Tracking Strategies in Railway Traction Drives
Modern railway traction systems are often equipped with anti-slip control strategies to comply with performance and safety requirements. A certain amount of slip is needed to increase the torque transferred by the traction motors onto the rail. Commonly, constant slip control is used to limit the slip velocity between the wheel and rail avoiding excessive slippage and vehicle derailment. This is at the price of not fully utilizing the train's traction and braking capabilities. Finding the slip at which maximum traction force occurs is challenging due to the non-linear relationship between slip and wheel-rail adhesion coefficient, as well as to its dependence on rail and wheel conditions. Perturb and observe (P\&O) and steepest gradient (SG) methods have been reported for the Maximum Adhesion Tracking (MAT) search. However, both methods exhibit weaknesses. Two new MAT strategies are proposed in this paper which overcome the limitations of existing methods, using Fuzzy Logic Controller (FLC) and Particle Swarm Optimization (PSO) respectively. Existing and proposed methods are first simulated and further validated experimentally using a scaled roller rig under identical conditions. The results show that the proposed methods improve the traction capability with lower searching time and oscillations compared to existing solutions. Tuning complexity and computational requirements will also be shown to be favorable to the proposed methods.
comment: 16 pages, 21 figures
☆ On Computation of Approximate Solutions to Large-Scale Backstepping Kernel Equations via Continuum Approximation
We provide two methods for computation of continuum backstepping kernels that arise in control of continua (ensembles) of linear hyperbolic PDEs and which can approximate backstepping kernels arising in control of a large-scale, PDE system counterpart (with computational complexity that does not grow with the number of state components of the large-scale system). In the first method, we identify a class of systems for which the solution to the continuum (and hence, also an approximate solution to the respective large-scale) kernel equations can be constructed in closed form. In the second method, we provide explicit formulae for the solution to the continuum kernels PDEs, employing a (triple) power series representation of the continuum kernel and establishing its convergence properties. In this case, we also provide means for reducing computational complexity by properly truncating the power series (in the powers of the ensemble variable). We also present numerical examples to illustrate computational efficiency/accuracy of the approaches, as well as to validate the stabilization properties of the approximate control kernels, constructed based on the continuum.
comment: 13 pages, 5 figures, submitted to Systems & Control Letters
☆ Measured-state conditioned recursive feasibility for stochastic model predictive control
In this paper, we address the problem of designing stochastic model predictive control (MPC) schemes for linear systems affected by unbounded disturbances. The contribution of the paper is twofold. First, motivated by the difficulty of guaranteeing recursive feasibility in this framework, due to the nonzero probability of violating chance-constraints in the case of unbounded noise, we introduce the novel definition of measured-state conditioned recursive feasibility in expectation. Second, we construct a stochastic MPC scheme, based on the introduction of ellipsoidal probabilistic reachable sets, which implements a closed-loop initialization strategy, i.e., the current measured-state is employed for initializing the optimization problem. This new scheme is proven to satisfy the novel definition of recursive feasibility, and its superiority with respect to open-loop initialization schemes, arising from the fact that one never neglects the information brought by the current measurement, is shown through numerical examples.
☆ Design of Phase Locked Loop in 180 nm Technology
The presented paper introduces a design for a phase-locked loop (PLL) that is utilized in frequency synthesis and modulation-demodulation within communication systems and in VLSI applications. The CMOS PLL is designed using 180 nm Fabrication Technology on Cadence Virtuoso Tool with a supply voltage of 1.8 V. The performance is evaluated through simulations and measurements, which demonstrate its ability to track and lock onto the input frequency. The PLL is a frequency synthesizer implemented to generate 2.4 GHz frequency. The input reference clock from a crystal oscillator is 150 MHz square wave. Negative feedback is given by divide-by-16 frequency divider, ensuring the phase and frequency synchronization between the divided signal and the reference signal. The design has essential components such as a phase frequency detector, charge pump, loop filter, current-starved voltage-controlled oscillator (CSVCO), and frequency divider. Through their collaborative operation, the system generates an output frequency that is 16 times the input frequency. The centre frequency of the 3-stage CSVCO is 3.208 GHz at 900 mV input voltage. With an input voltage ranging from 0.4 V to 1.8 V, the VCO offers a tuning range that spans from 1.066 GHz to 3.731 GHz.PLL demonstrates a lock-in range spanning from 70.4 MHz to 173 MHz, with an output frequency range of 1.12 GHz to 2.78 GHz. It achieves a lock time of 260.03 ns and consumes a maximum power of 5.15 mW at 2.4 GHz.
☆ The effect of control barrier functions on energy transfers in controlled physical systems
Using a port-Hamiltonian formalism, we show the qualitative and quantitative effect of safety-critical control implemented with control barrier functions (CBFs) on the power balance of controlled physical systems. The presented results will provide novel tools to design CBFs inducing desired energetic behaviors of the closed-loop system, including nontrivial damping injection effects and non-passive control actions, effectively injecting energy in the system in a controlled manner. Simulations validate the stated results.
☆ Safe and Non-Conservative Trajectory Planning for Autonomous Driving Handling Unanticipated Behaviors of Traffic Participants
Trajectory planning for autonomous driving is challenging because the unknown future motion of traffic participants must be accounted for, yielding large uncertainty. Stochastic Model Predictive Control (SMPC)-based planners provide non-conservative planning, but do not rule out a (small) probability of collision. We propose a control scheme that yields an efficient trajectory based on SMPC when the traffic scenario allows, still avoiding that the vehicle causes collisions with traffic participants if the latter move according to the prediction assumptions. If some traffic participant does not behave as anticipated, no safety guarantee can be given. Then, our approach yields a trajectory which minimizes the probability of collision, using Constraint Violation Probability Minimization techniques. Our algorithm can also be adapted to minimize the anticipated harm caused by a collision. We provide a thorough discussion of the benefits of our novel control scheme and compare it to a previous approach through numerical simulations from the CommonRoad database.
☆ State Anti-windup: A New Methodology for Tackling State Constraints at Both Synthesis and Implementation Levels
The anti-windup compensation typically addresses strict control limitations in control systems. However, there is a clear need for an equivalent solution for the states/outputs of the system. This paper introduces a novel methodology for the state anti-windup compensator. Unlike state-constrained control methods, which often focus on incorporating soft constraints into the design or fail to react adequately to constraint violations in practical settings, the proposed methodology treats state constraints as implement-oriented soft-hard constraints. This is achieved by integrating a saturation block within the structure of the safety compensator, referred to as the state anti-windup (SANTW) compensator. Similar to input anti-windup schemes, the SANTW design is separated from the nominal controller design. The problem is formulated as a disturbance rejection one to directly minimize the saturation. The paper develops two Hinf optimization frameworks using frequency-domain solutions and linear matrix inequalities. It then addresses constraints on both inputs and states, resulting in a unified Input-State Anti-windup (IS-ANTW) compensator synthesized using non-smooth Hinf optimization. This method also offers the flexibility of having a fixed-order compensator, crucial in many practical applications. Additionally, the study evaluates the proposed compensator's performance in managing current fluctuations from renewable energy sources during grid faults, demonstrating its effectiveness through detailed Electromagnetic Transient (EMT) simulations of grid-connected DC-AC converters.
comment: 15 pages 21 Figures
☆ GPU-Accelerated DCOPF using Gradient-Based Optimization
DC Optimal Power Flow (DCOPF) is a key operational tool for power system operators, and it is embedded as a subproblem in many challenging optimization problems (e.g., line switching). However, traditional CPU-based solve routines (e.g., simplex) have saturated in speed and are hard to parallelize. This paper focuses on solving DCOPF problems using gradient-based routines on Graphics Processing Units (GPUs), which have massive parallelization capability. To formulate these problems, we pose a Lagrange dual associated with DCOPF (linear and quadratic cost curves), and then we explicitly solve the inner (primal) minimization problem with a dual norm. The resulting dual problem can be efficiently iterated using projected gradient ascent. After solving the dual problem on both CPUs and GPUs to find tight lower bounds, we benchmark against Gurobi and MOSEK, comparing convergence speed and tightness on the IEEE 2000 and 10000 bus systems. We provide reliable and tight lower bounds for these problems with, at best, 5.4x speedup over a conventional solver.
☆ Constructing and Evaluating Digital Twins: An Intelligent Framework for DT Development
The development of Digital Twins (DTs) represents a transformative advance for simulating and optimizing complex systems in a controlled digital space. Despite their potential, the challenge of constructing DTs that accurately replicate and predict the dynamics of real-world systems remains substantial. This paper introduces an intelligent framework for the construction and evaluation of DTs, specifically designed to enhance the accuracy and utility of DTs in testing algorithmic performance. We propose a novel construction methodology that integrates deep learning-based policy gradient techniques to dynamically tune the DT parameters, ensuring high fidelity in the digital replication of physical systems. Moreover, the Mean STate Error (MSTE) is proposed as a robust metric for evaluating the performance of algorithms within these digital space. The efficacy of our framework is demonstrated through extensive simulations that show our DT not only accurately mirrors the physical reality but also provides a reliable platform for algorithm evaluation. This work lays a foundation for future research into DT technologies, highlighting pathways for both theoretical enhancements and practical implementations in various industries.
☆ Thruster-Assisted Incline Walking
In this study, our aim is to evaluate the effectiveness of thruster-assisted steep slope walking for the Husky Carbon, a quadrupedal robot equipped with custom-designed actuators and plural electric ducted fans, through simulation prior to conducting experimental trials. Thruster-assisted steep slope walking draws inspiration from wing-assisted incline running (WAIR) observed in birds, and intriguingly incorporates posture manipulation and thrust vectoring, a locomotion technique not previously explored in the animal kingdom. Our approach involves developing a reduced-order model of the Husky robot, followed by the application of an optimization-based controller utilizing collocation methods and dynamics interpolation to determine control actions. Through simulation testing, we demonstrate the feasibility of hardware implementation of our controller.
comment: 7 pages, 7 figures, submitted to CDC 2024 conference. arXiv admin note: text overlap with arXiv:2405.06070
♻ ☆ Optimal Control of Logically Constrained Partially Observable and Multi-Agent Markov Decision Processes
Autonomous systems often have logical constraints arising, for example, from safety, operational, or regulatory requirements. Such constraints can be expressed using temporal logic specifications. The system state is often partially observable. Moreover, it could encompass a team of multiple agents with a common objective but disparate information structures and constraints. In this paper, we first introduce an optimal control theory for partially observable Markov decision processes (POMDPs) with finite linear temporal logic constraints. We provide a structured methodology for synthesizing policies that maximize a cumulative reward while ensuring that the probability of satisfying a temporal logic constraint is sufficiently high. Our approach comes with guarantees on approximate reward optimality and constraint satisfaction. We then build on this approach to design an optimal control framework for logically constrained multi-agent settings with information asymmetry. We illustrate the effectiveness of our approach by implementing it on several case studies.
comment: arXiv admin note: substantial text overlap with arXiv:2203.09038
♻ ☆ Passive Obstacle Aware Control to Follow Desired Velocities
Evaluating and updating the obstacle avoidance velocity for an autonomous robot in real-time ensures robustness against noise and disturbances. A passive damping controller can obtain the desired motion with a torque-controlled robot, which remains compliant and ensures a safe response to external perturbations. Here, we propose a novel approach for designing the passive control policy. Our algorithm complies with obstacle-free zones while transitioning to increased damping near obstacles to ensure collision avoidance. This approach ensures stability across diverse scenarios, effectively mitigating disturbances. Validation on a 7DoF robot arm demonstrates superior collision rejection capabilities compared to the baseline, underlining its practicality for real-world applications. Our obstacle-aware damping controller represents a substantial advancement in secure robot control within complex and uncertain environments.
♻ ☆ Fast Geometric Learning of MIMO Signal Detection over Grassmannian Manifolds
Domain or statistical distribution shifts are a key staple of the wireless communication channel, because of the dynamics of the environment. Deep learning (DL) models for detecting multiple-input multiple-output (MIMO) signals in dynamic communication require large training samples (in the order of hundreds of thousands to millions) and online retraining to adapt to domain shift. Some dynamic networks, such as vehicular networks, cannot tolerate the waiting time associated with gathering a large number of training samples or online fine-tuning which incurs significant end-to-end delay. In this paper, a novel classification technique based on the concept of geodesic flow kernel (GFK) is proposed for MIMO signal detection. In particular, received MIMO signals are first represented as points on Grassmannian manifolds by formulating basis of subspaces spanned by the rows vectors of the received signal. Then, the domain shift is modeled using a geodesic flow kernel integrating the subspaces that lie on the geodesic to characterize changes in geometric and statistical properties of the received signals. The kernel derives low-dimensional representations of the received signals over the Grassman manifolds that are invariant to domain shift and is used in a geometric support vector machine (G-SVM) algorithm for MIMO signal detection in an unsupervised manner. Simulation results reveal that the proposed method achieves promising performance against the existing baselines like OAMPnet and MMNet with only 1,200 training samples and without online retraining.
♻ ☆ Resource-Aware Hierarchical Federated Learning in Wireless Video Caching Networks
Backhaul traffic congestion caused by the video traffic of a few popular files can be alleviated by storing the to-be-requested content at various levels in wireless video caching networks. Typically, content service providers (CSPs) own the content, and the users request their preferred content from the CSPs using their (wireless) internet service providers (ISPs). As these parties do not reveal their private information and business secrets, traditional techniques may not be readily used to predict the dynamic changes in users' future demands. Motivated by this, we propose a novel resource-aware hierarchical federated learning (RawHFL) solution for predicting user's future content requests. A practical data acquisition technique is used that allows the user to update its local training dataset based on its requested content. Besides, since networking and other computational resources are limited, considering that only a subset of the users participate in the model training, we derive the convergence bound of the proposed algorithm. Based on this bound, we minimize a weighted utility function for jointly configuring the controllable parameters to train the RawHFL energy efficiently under practical resource constraints. Our extensive simulation results validate the proposed algorithm's superiority, in terms of test accuracy and energy cost, over existing baselines.
comment: Under review for possible publication in IEEE Transactions on Wireless Communications
♻ ☆ Immersion and Invariance-based Disturbance Observer and Its Application to Safe Control
When the disturbance input matrix is nonlinear, existing disturbance observer design methods rely on the solvability of a partial differential equation or the existence of an output function with a uniformly well-defined disturbance relative degree, which can pose significant limitations. This note introduces a systematic approach for designing an Immersion and Invariance-based Disturbance Observer (IIDOB) that circumvents these strong assumptions. The proposed IIDOB ensures the disturbance estimation error is globally uniformly ultimately bounded by approximately solving a partial differential equation while compensating for the approximation error. Furthermore, by integrating IIDOB into the framework of control barrier functions, a filter-based safe control design method for control-affine systems with disturbances is established where the filter is used to generate an alternative disturbance estimation signal with a known derivative. Sufficient conditions are established to guarantee the safety of the disturbed systems. Simulation results demonstrate the effectiveness of the proposed method.
comment: Accepted to IEEE Transactions on Automatic Control
♻ ☆ Vessim: A Testbed for Carbon-Aware Applications and Systems
To reduce the carbon footprint of computing and stabilize electricity grids, there is an increasing focus on approaches that align the power usage of IT infrastructure with the availability of clean energy. Unfortunately, research on energy-aware and carbon-aware applications, as well as the interfaces between computing and energy systems, remains complex due to the scarcity of available testing environments. To this day, almost all new approaches are evaluated on custom simulation testbeds, which leads to repeated development efforts and limited comparability of results. In this paper, we present Vessim, a co-simulation environment for testing applications and computing systems that interact with their energy systems. Our testbed connects domain-specific simulators for renewable power generation and energy storage, and enables users to implement interfaces to integrate real systems through software and hardware-in-the-loop simulation. Vessim offers an easy-to-use interface, is extendable to new simulators, and provides direct access to historical datasets. We aim to not only accelerate research in carbon-aware computing but also facilitate development and operation, as in continuous testing or digital twins. Vessim is publicly available: https://github.com/dos-group/vessim.
comment: HotCarbon'24
♻ ☆ Asymptotic Analysis of Near-Field Coupling in Massive MISO and Massive SIMO Systems
This paper studies the receiver to transmitter antenna coupling in near-field communications with massive arrays. Although most works in the literature consider that it is negligible and approximate it by zero, there is no rigorous analysis on its relevance for practical systems. In this work, we leverage multiport communication theory to obtain conditions for the aforementioned approximation to be valid in MISO and SIMO systems. These conditions are then particularized for arrays with fixed inter-element spacing and arrays with fixed size.
comment: Accepted version of the article published in IEEE Communications Letters, 2024. DOI: 10.1109/LCOMM.2024.3416044
♻ ☆ Control of Unknown Quadrotors from a Single Throw IROS 2024
This paper presents a method to recover quadrotor UAV from a throw, when no control parameters are known before the throw. We leverage the availability of high-frequency rotor speed feedback available in racing drone hardware and software to find control effectiveness values and fit a motor model using recursive least squares (RLS) estimation. Furthermore, we propose an excitation sequence that provides large actuation commands while guaranteeing to stay within gyroscope sensing limits. After 450ms of excitation, an INDI attitude controller uses the 52 fitted parameters to arrest rotational motion and recover an upright attitude. Finally, a NDI position controller drives the craft to a position setpoint. The proposed algorithm runs efficiently on microcontrollers found in common UAV flight controllers, and was shown to recover an agile quadrotor every time in 57 live experiments with as low as 3.5m throw height, demonstrating robustness against initial rotations and noise. We also demonstrate control of randomized quadrotors in simulated throws, where the parameter fitting RMS error is typically within 10% of the true value.
comment: 7 pages, 5 figures, 2 tables. Submitted to the IROS 2024 conference
♻ ☆ Decentralized PI-control and Anti-windup in Resource Sharing Networks
We consider control of multiple stable first-order \review{agents} which have a control coupling described by an M-matrix. These agents are subject to incremental sector-bounded \review{input} nonlinearities. We show that such plants can be globally asymptotically stabilized to a unique equilibrium using fully decentralized proportional-integral controllers equipped with anti-windup and subject to local tuning rules. In addition, we show that when the nonlinearities correspond to the saturation function, the closed loop asymptotically minimizes a weighted 1-norm of the agents state mismatch. The control strategy is finally compared to other state-of-the-art controllers on a numerical district heating example.
♻ ☆ Reachability Analysis Using Constrained Polynomial Logical Zonotopes
In this paper, we propose reachability analysis using constrained polynomial logical zonotopes. We perform reachability analysis to compute the set of states that could be reached. To do this, we utilize a recently introduced set representation called polynomial logical zonotopes for performing computationally efficient and exact reachability analysis on logical systems. Notably, polynomial logical zonotopes address the "curse of dimensionality" when analyzing the reachability of logical systems since the set representation can represent $2^h$ binary vectors using $h$ generators. After finishing the reachability analysis, the formal verification involves verifying whether the intersection of the calculated reachable set and the unsafe set is empty or not. Polynomial logical zonotopes lack closure under intersections, prompting the formulation of constrained polynomial logical zonotopes, which preserve the computational efficiency and exactness of polynomial logical zonotopes for reachability analysis while enabling exact intersections. Additionally, an extensive empirical study is presented to demonstrate and validate the advantages of constrained polynomial logical zonotopes.
comment: IEEE Control Systems Letters (2024)
♻ ☆ Game-theoretic Energy Management Strategies With Interacting Agents in Formula 1
This paper presents an interaction-aware energy management optimization framework for Formula 1 racing. The considered scenario involves two agents and a drag reduction model. Strategic interactions between the agents are captured by a Stackelberg game formulated as a bilevel program. To address the computational challenges associated with bilevel optimization, the problem is reformulated as a single-level nonlinear program employing the Karush-Kuhn-Tucker conditions. The proposed framework contributes towards the development of new energy management and allocation strategies, caused by the presence of another agent. For instance, it provides valuable insights on how to redistribute the energy in order to optimally exploit the wake effect, showcasing a notable difference with the behavior studied in previous works. Robust energy allocations can be identified to reduce the lap time loss associated with unexpected choices of the other agent. It allows to recognize the boundary conditions for the interaction to become relevant, impacting the system's behavior, and to assess if overtaking is possible and beneficial. Overall, the framework provides a comprehensive approach for a two-agent Formula 1 racing problem with strategic interactions, offering physically intuitive and practical results.
♻ ☆ Tradeoffs between convergence rate and noise amplification for momentum-based accelerated optimization algorithms
We study momentum-based first-order optimization algorithms in which the iterations utilize information from the two previous steps and are subject to an additive white noise. This setup uses noise to account for uncertainty in either gradient evaluation or iteration updates, and it includes Polyak's heavy-ball and Nesterov's accelerated methods as special cases. For strongly convex quadratic problems, we use the steady-state variance of the error in the optimization variable to quantify noise amplification and identify fundamental stochastic performance tradeoffs. Our approach utilizes the Jury stability criterion to provide a novel geometric characterization of conditions for linear convergence, and it reveals the relation between the noise amplification and convergence rate as well as their dependence on the condition number and the constant algorithmic parameters. This geometric insight leads to simple alternative proofs of standard convergence results and allows us to establish ``uncertainty principle'' of strongly convex optimization: for the two-step momentum method with linear convergence rate, the lower bound on the product between the settling time and noise amplification scales quadratically with the condition number. Our analysis also identifies a key difference between the gradient and iterate noise models: while the amplification of gradient noise can be made arbitrarily small by sufficiently decelerating the algorithm, the best achievable variance for the iterate noise model increases linearly with the settling time in the decelerating regime. Finally, we introduce two parameterized families of algorithms that strike a balance between noise amplification and settling time while preserving order-wise Pareto optimality for both noise models.
comment: 23 pages; 7 figures
♻ ☆ Signatures Meet Dynamic Programming: Generalizing Bellman Equations for Trajectory Following
Path signatures have been proposed as a powerful representation of paths that efficiently captures the path's analytic and geometric characteristics, having useful algebraic properties including fast concatenation of paths through tensor products. Signatures have recently been widely adopted in machine learning problems for time series analysis. In this work we establish connections between value functions typically used in optimal control and intriguing properties of path signatures. These connections motivate our novel control framework with signature transforms that efficiently generalizes the Bellman equation to the space of trajectories. We analyze the properties and advantages of the framework, termed signature control. In particular, we demonstrate that (i) it can naturally deal with varying/adaptive time steps; (ii) it propagates higher-level information more efficiently than value function updates; (iii) it is robust to dynamical system misspecification over long rollouts. As a specific case of our framework, we devise a model predictive control method for path tracking. This method generalizes integral control, being suitable for problems with unknown disturbances. The proposed algorithms are tested in simulation, with differentiable physics models including typical control and robotics tasks such as point-mass, curve following for an ant model, and a robotic manipulator.
comment: 48 pages, 21 figures
Multiagent Systems 5
☆ Tactical Game-theoretic Decision-making with Homotopy Class Constraints
We propose a tactical homotopy-aware decision-making framework for game-theoretic motion planning in urban environments. We model urban driving as a generalized Nash equilibrium problem and employ a mixed-integer approach to tame the combinatorial aspect of motion planning. More specifically, by utilizing homotopy classes, we partition the high-dimensional solution space into finite, well-defined subregions. Each subregion (homotopy) corresponds to a high-level tactical decision, such as the passing order between pairs of players. The proposed formulation allows to find global optimal Nash equilibria in a computationally tractable manner by solving a mixed-integer quadratic program. Each homotopy decision is represented by a binary variable that activates different sets of linear collision avoidance constraints. This extra homotopic constraint allows to find solutions in a more efficient way (on a roundabout scenario on average 5-times faster). We experimentally validate the proposed approach on scenarios taken from the rounD dataset. Simulation-based testing in receding horizon fashion demonstrates the capability of the framework in achieving globally optimal solutions while yielding a 78% average decrease in the computational time with respect to an implementation without the homotopic constraints.
☆ Minimalist exploration strategies for robot swarms at the edge of chaos
Effective exploration abilities are fundamental for robot swarms, especially when small, inexpensive robots are employed (e.g., micro- or nano-robots). Random walks are often the only viable choice if robots are too constrained regarding sensors and computation to implement state-of-the-art solutions. However, identifying the best random walk parameterisation may not be trivial. Additionally, variability among robots in terms of motion abilities-a very common condition when precise calibration is not possible-introduces the need for flexible solutions. This study explores how random walks that present chaotic or edge-of-chaos dynamics can be generated. We also evaluate their effectiveness for a simple exploration task performed by a swarm of simulated Kilobots. First, we show how Random Boolean Networks can be used as controllers for the Kilobots, achieving a significant performance improvement compared to the best parameterisation of a L\'evy-modulated Correlated Random Walk. Second, we demonstrate how chaotic dynamics are beneficial to maximise exploration effectiveness. Finally, we demonstrate how the exploration behavior produced by Boolean Networks can be optimized through an Evolutionary Robotics approach while maintaining the chaotic dynamics of the networks.
comment: This work has been submitted to the IEEE RA-L for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ A Simulation Environment for the Neuroevolution of Ant Colony Dynamics
We introduce a simulation environment to facilitate research into emergent collective behaviour, with a focus on replicating the dynamics of ant colonies. By leveraging real-world data, the environment simulates a target ant trail that a controllable agent must learn to replicate, using sensory data observed by the target ant. This work aims to contribute to the neuroevolution of models for collective behaviour, focusing on evolving neural architectures that encode domain-specific behaviours in the network topology. By evolving models that can be modified and studied in a controlled environment, we can uncover the necessary conditions required for collective behaviours to emerge. We hope this environment will be useful to those studying the role of interactions in emergent behaviour within collective systems.
comment: Accepted for publication at The 2024 Conference on Artificial Life. 2 page extended abstract
♻ ☆ Generation of Asset Administration Shell with Large Language Model Agents: Towards Semantic Interoperability in Digital Twins in the Context of Industry 4.0
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: https://github.com/YuchenXia/AASbyLLM.
comment: accepted in journal IEEE ACCESS https://www.doi.org/10.1109/ACCESS.2024.3415470
♻ ☆ CoSLight: Co-optimizing Collaborator Selection and Decision-making to Enhance Traffic Signal Control KDD 2024
Effective multi-intersection collaboration is pivotal for reinforcement-learning-based traffic signal control to alleviate congestion. Existing work mainly chooses neighboring intersections as collaborators. However, quite an amount of congestion, even some wide-range congestion, is caused by non-neighbors failing to collaborate. To address these issues, we propose to separate the collaborator selection as a second policy to be learned, concurrently being updated with the original signal-controlling policy. Specifically, the selection policy in real-time adaptively selects the best teammates according to phase- and intersection-level features. Empirical results on both synthetic and real-world datasets provide robust validation for the superiority of our approach, offering significant improvements over existing state-of-the-art methods. The code is available at https://github.com/bonaldli/CoSLight.
comment: Accepted by KDD 2024
Information Retrieval 20
☆ Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
☆ Converging Dimensions: Information Extraction and Summarization through Multisource, Multimodal, and Multilingual Fusion
Recent advances in large language models (LLMs) have led to new summarization strategies, offering an extensive toolkit for extracting important information. However, these approaches are frequently limited by their reliance on isolated sources of data. The amount of information that can be gathered is limited and covers a smaller range of themes, which introduces the possibility of falsified content and limited support for multilingual and multimodal data. The paper proposes a novel approach to summarization that tackles such challenges by utilizing the strength of multiple sources to deliver a more exhaustive and informative understanding of intricate topics. The research progresses beyond conventional, unimodal sources such as text documents and integrates a more diverse range of data, including YouTube playlists, pre-prints, and Wikipedia pages. The aforementioned varied sources are then converted into a unified textual representation, enabling a more holistic analysis. This multifaceted approach to summary generation empowers us to extract pertinent information from a wider array of sources. The primary tenet of this approach is to maximize information gain while minimizing information overlap and maintaining a high level of informativeness, which encourages the generation of highly coherent summaries.
comment: 11 pages, 3 figures
☆ Mining United Nations General Assembly Debates
This project explores the application of Natural Language Processing (NLP) techniques to analyse United Nations General Assembly (UNGA) speeches. Using NLP allows for the efficient processing and analysis of large volumes of textual data, enabling the extraction of semantic patterns, sentiment analysis, and topic modelling. Our goal is to deliver a comprehensive dataset and a tool (interface with descriptive statistics and automatically extracted topics) from which political scientists can derive insights into international relations and have the opportunity to have a nuanced understanding of global diplomatic discourse.
comment: 4 pages, 1 figure, 2 tables
☆ CLIP-Branches: Interactive Fine-Tuning for Text-Image Retrieval
The advent of text-image models, most notably CLIP, has significantly transformed the landscape of information retrieval. These models enable the fusion of various modalities, such as text and images. One significant outcome of CLIP is its capability to allow users to search for images using text as a query, as well as vice versa. This is achieved via a joint embedding of images and text data that can, for instance, be used to search for similar items. Despite efficient query processing techniques such as approximate nearest neighbor search, the results may lack precision and completeness. We introduce CLIP-Branches, a novel text-image search engine built upon the CLIP architecture. Our approach enhances traditional text-image search engines by incorporating an interactive fine-tuning phase, which allows the user to further concretize the search query by iteratively defining positive and negative examples. Our framework involves training a classification model given the additional user feedback and essentially outputs all positively classified instances of the entire data catalog. By building upon recent techniques, this inference phase, however, is not implemented by scanning the entire data catalog, but by employing efficient index structures pre-built for the data. Our results show that the fine-tuned results can improve the initial search outputs in terms of relevance and accuracy while maintaining swift response times
☆ R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation
Retrieval augmented generation (RAG) has been applied in many scenarios to augment large language models (LLMs) with external documents provided by retrievers. However, a semantic gap exists between LLMs and retrievers due to differences in their training objectives and architectures. This misalignment forces LLMs to passively accept the documents provided by the retrievers, leading to incomprehension in the generation process, where the LLMs are burdened with the task of distinguishing these documents using their inherent knowledge. This paper proposes R$^2$AG, a novel enhanced RAG framework to fill this gap by incorporating Retrieval information into Retrieval Augmented Generation. Specifically, R$^2$AG utilizes the nuanced features from the retrievers and employs a R$^2$-Former to capture retrieval information. Then, a retrieval-aware prompting strategy is designed to integrate retrieval information into LLMs' generation. Notably, R$^2$AG suits low-source scenarios where LLMs and retrievers are frozen. Extensive experiments across five datasets validate the effectiveness, robustness, and efficiency of R$^2$AG. Our analysis reveals that retrieval information serves as an anchor to aid LLMs in the generation process, thereby filling the semantic gap.
☆ Enhancing Collaborative Semantics of Language Model-Driven Recommendations via Graph-Aware Learning
Large Language Models (LLMs) are increasingly prominent in the recommendation systems domain. Existing studies usually utilize in-context learning or supervised fine-tuning on task-specific data to align LLMs into recommendations. However, the substantial bias in semantic spaces between language processing tasks and recommendation tasks poses a nonnegligible challenge. Specifically, without the adequate capturing ability of collaborative information, existing modeling paradigms struggle to capture behavior patterns within community groups, leading to LLMs' ineffectiveness in discerning implicit interaction semantic in recommendation scenarios. To address this, we consider enhancing the learning capability of language model-driven recommendation models for structured data, specifically by utilizing interaction graphs rich in collaborative semantics. We propose a Graph-Aware Learning for Language Model-Driven Recommendations (GAL-Rec). GAL-Rec enhances the understanding of user-item collaborative semantics by imitating the intent of Graph Neural Networks (GNNs) to aggregate multi-hop information, thereby fully exploiting the substantial learning capacity of LLMs to independently address the complex graphs in the recommendation system. Sufficient experimental results on three real-world datasets demonstrate that GAL-Rec significantly enhances the comprehension of collaborative semantics, and improves recommendation performance.
comment: 10pages
☆ Towards Robust Evaluation: A Comprehensive Taxonomy of Datasets and Metrics for Open Domain Question Answering in the Era of Large Language Models
Open Domain Question Answering (ODQA) within natural language processing involves building systems that answer factual questions using large-scale knowledge corpora. Recent advances stem from the confluence of several factors, such as large-scale training datasets, deep learning techniques, and the rise of large language models. High-quality datasets are used to train models on realistic scenarios and enable the evaluation of the system on potentially unseen data. Standardized metrics facilitate comparisons between different ODQA systems, allowing researchers to objectively track advancements in the field. Our study presents a thorough examination of the current landscape of ODQA benchmarking by reviewing 52 datasets and 20 evaluation techniques across textual and multimodal modalities. We introduce a novel taxonomy for ODQA datasets that incorporates both the modality and difficulty of the question types. Additionally, we present a structured organization of ODQA evaluation metrics along with a critical analysis of their inherent trade-offs. Our study aims to empower researchers by providing a framework for the robust evaluation of modern question-answering systems. We conclude by identifying the current challenges and outlining promising avenues for future research and development.
comment: 22 pages, 13 tables, 7 figures
☆ Communication-Efficient Federated Knowledge Graph Embedding with Entity-Wise Top-K Sparsification
Federated Knowledge Graphs Embedding learning (FKGE) encounters challenges in communication efficiency stemming from the considerable size of parameters and extensive communication rounds. However, existing FKGE methods only focus on reducing communication rounds by conducting multiple rounds of local training in each communication round, and ignore reducing the size of parameters transmitted within each communication round. To tackle the problem, we first find that universal reduction in embedding precision across all entities during compression can significantly impede convergence speed, underscoring the importance of maintaining embedding precision. We then propose bidirectional communication-efficient FedS based on Entity-Wise Top-K Sparsification strategy. During upload, clients dynamically identify and upload only the Top-K entity embeddings with the greater changes to the server. During download, the server first performs personalized embedding aggregation for each client. It then identifies and transmits the Top-K aggregated embeddings to each client. Besides, an Intermittent Synchronization Mechanism is used by FedS to mitigate negative effect of embedding inconsistency among shared entities of clients caused by heterogeneity of Federated Knowledge Graph. Extensive experiments across three datasets showcase that FedS significantly enhances communication efficiency with negligible (even no) performance degradation.
☆ Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More?
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. Leveraging LCLMs' ability to natively ingest and process entire corpora of information offers numerous advantages. It enhances user-friendliness by eliminating the need for specialized knowledge of tools, provides robust end-to-end modeling that minimizes cascading errors in complex pipelines, and allows for the application of sophisticated prompting techniques across the entire system. To assess this paradigm shift, we introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks. However, LCLMs still face challenges in areas like compositional reasoning that are required in SQL-like tasks. Notably, prompting strategies significantly influence performance, emphasizing the need for continued research as context lengths grow. Overall, LOFT provides a rigorous testing ground for LCLMs, showcasing their potential to supplant existing paradigms and tackle novel tasks as model capabilities scale.
comment: 29 pages. Dataset available at https://github.com/google-deepmind/loft
♻ ☆ Retrieval Augmented Generation using Engineering Design Knowledge
Aiming to support Retrieval Augmented Generation (RAG) in the design process, we present a method to identify explicit, engineering design facts - {head entity :: relationship :: tail entity} from patented artefact descriptions. Given a sentence with a pair of entities (based on noun phrases) marked in a unique manner, our method extracts the relationship that is explicitly communicated in the sentence. For this task, we create a dataset of 375,084 examples and fine-tune language models for relation identification (token classification) and elicitation (sequence-to-sequence). The token classification approach achieves up to 99.7% accuracy. Upon applying the method to a domain of 4,870 fan system patents, we populate a knowledge base of over 2.93 million facts. Using this knowledge base, we demonstrate how Large Language Models (LLMs) are guided by explicit facts to synthesise knowledge and generate technical and cohesive responses when sought out for knowledge retrieval tasks in the design process.
comment: Resources: Dataset - https://huggingface.co/datasets/siddharthl1293/engineering_design_facts Training Infrastructure - https://zenodo.org/records/12012131 Trained model - https://huggingface.co/siddharthl1293/albert-albert-large-v2 Application - https://github.com/siddharthl93/engineering-design-knowledge
♻ ☆ Advancing continual lifelong learning in neural information retrieval: definition, dataset, framework, and empirical evaluation
Continual learning refers to the capability of a machine learning model to learn and adapt to new information, without compromising its performance on previously learned tasks. Although several studies have investigated continual learning methods for information retrieval tasks, a well-defined task formulation is still lacking, and it is unclear how typical learning strategies perform in this context. To address this challenge, a systematic task formulation of continual neural information retrieval is presented, along with a multiple-topic dataset that simulates continuous information retrieval. A comprehensive continual neural information retrieval framework consisting of typical retrieval models and continual learning strategies is then proposed. Empirical evaluations illustrate that the proposed framework can successfully prevent catastrophic forgetting in neural information retrieval and enhance performance on previously learned tasks. The results indicate that embedding-based retrieval models experience a decline in their continual learning performance as the topic shift distance and dataset volume of new tasks increase. In contrast, pretraining-based models do not show any such correlation. Adopting suitable learning strategies can mitigate the effects of topic shift and data augmentation.
comment: Submitted to Information Sciences
♻ ☆ KU-DMIS at EHRSQL 2024:Generating SQL query via question templatization in EHR NAACL 2024
Transforming natural language questions into SQL queries is crucial for precise data retrieval from electronic health record (EHR) databases. A significant challenge in this process is detecting and rejecting unanswerable questions that request information beyond the database's scope or exceed the system's capabilities. In this paper, we introduce a novel text-to-SQL framework that robustly handles out-of-domain questions and verifies the generated queries with query execution.Our framework begins by standardizing the structure of questions into a templated format. We use a powerful large language model (LLM), fine-tuned GPT-3.5 with detailed prompts involving the table schemas of the EHR database system. Our experimental results demonstrate the effectiveness of our framework on the EHRSQL-2024 benchmark benchmark, a shared task in the ClinicalNLP workshop. Although a straightforward fine-tuning of GPT shows promising results on the development set, it struggled with the out-of-domain questions in the test set. With our framework, we improve our system's adaptability and achieve competitive performances in the official leaderboard of the EHRSQL-2024 challenge.
comment: Published at ClinicalNLP workshop @ NAACL 2024
♻ ☆ MSynFD: Multi-hop Syntax aware Fake News Detection
The proliferation of social media platforms has fueled the rapid dissemination of fake news, posing threats to our real-life society. Existing methods use multimodal data or contextual information to enhance the detection of fake news by analyzing news content and/or its social context. However, these methods often overlook essential textual news content (articles) and heavily rely on sequential modeling and global attention to extract semantic information. These existing methods fail to handle the complex, subtle twists in news articles, such as syntax-semantics mismatches and prior biases, leading to lower performance and potential failure when modalities or social context are missing. To bridge these significant gaps, we propose a novel multi-hop syntax aware fake news detection (MSynFD) method, which incorporates complementary syntax information to deal with subtle twists in fake news. Specifically, we introduce a syntactical dependency graph and design a multi-hop subgraph aggregation mechanism to capture multi-hop syntax. It extends the effect of word perception, leading to effective noise filtering and adjacent relation enhancement. Subsequently, a sequential relative position-aware Transformer is designed to capture the sequential information, together with an elaborate keyword debiasing module to mitigate the prior bias. Extensive experimental results on two public benchmark datasets verify the effectiveness and superior performance of our proposed MSynFD over state-of-the-art detection models.
comment: 10 pages
♻ ☆ ERASE: Benchmarking Feature Selection Methods for Deep Recommender Systems KDD 2024
Deep Recommender Systems (DRS) are increasingly dependent on a large number of feature fields for more precise recommendations. Effective feature selection methods are consequently becoming critical for further enhancing the accuracy and optimizing storage efficiencies to align with the deployment demands. This research area, particularly in the context of DRS, is nascent and faces three core challenges. Firstly, variant experimental setups across research papers often yield unfair comparisons, obscuring practical insights. Secondly, the existing literature's lack of detailed analysis on selection attributes, based on large-scale datasets and a thorough comparison among selection techniques and DRS backbones, restricts the generalizability of findings and impedes deployment on DRS. Lastly, research often focuses on comparing the peak performance achievable by feature selection methods, an approach that is typically computationally infeasible for identifying the optimal hyperparameters and overlooks evaluating the robustness and stability of these methods. To bridge these gaps, this paper presents ERASE, a comprehensive bEnchmaRk for feAture SElection for DRS. ERASE comprises a thorough evaluation of eleven feature selection methods, covering both traditional and deep learning approaches, across four public datasets, private industrial datasets, and a real-world commercial platform, achieving significant enhancement. Our code is available online for ease of reproduction.
comment: Accepted to KDD 2024
♻ ☆ Generation of Asset Administration Shell with Large Language Model Agents: Towards Semantic Interoperability in Digital Twins in the Context of Industry 4.0
This research introduces a novel approach for achieving semantic interoperability in digital twins and assisting the creation of Asset Administration Shell (AAS) as digital twin model within the context of Industry 4.0. The foundational idea of our research is that the communication based on semantics and the generation of meaningful textual data are directly linked, and we posit that these processes are equivalent if the exchanged information can be serialized in text form. Based on this, we construct a "semantic node" data structure in our research to capture the semantic essence of textual data. Then, a system powered by large language models is designed and implemented to process the "semantic node" and generate standardized digital twin models from raw textual data collected from datasheets describing technical assets. Our evaluation demonstrates an effective generation rate of 62-79%, indicating a substantial proportion of the information from the source text can be translated error-free to the target digital twin instance model with the generative capability of large language models. This result has a direct application in the context of Industry 4.0, and the designed system is implemented as a data model generation tool for reducing the manual effort in creating AAS model. In our evaluation, a comparative analysis of different LLMs and an in-depth ablation study of Retrieval-Augmented Generation (RAG) mechanisms provide insights into the effectiveness of LLM systems for interpreting technical concepts and translating data. Our findings emphasize LLMs' capability to automate AAS instance creation and contribute to the broader field of semantic interoperability for digital twins in industrial applications. The prototype implementation and evaluation results are presented on our GitHub Repository: https://github.com/YuchenXia/AASbyLLM.
comment: accepted in journal IEEE ACCESS https://www.doi.org/10.1109/ACCESS.2024.3415470
♻ ☆ Recent advances in text embedding: A Comprehensive Review of Top-Performing Methods on the MTEB Benchmark
Text embedding methods have become increasingly popular in both industrial and academic fields due to their critical role in a variety of natural language processing tasks. The significance of universal text embeddings has been further highlighted with the rise of Large Language Models (LLMs) applications such as Retrieval-Augmented Systems (RAGs). While previous models have attempted to be general-purpose, they often struggle to generalize across tasks and domains. However, recent advancements in training data quantity, quality and diversity; synthetic data generation from LLMs as well as using LLMs as backbones encourage great improvements in pursuing universal text embeddings. In this paper, we provide an overview of the recent advances in universal text embedding models with a focus on the top performing text embeddings on Massive Text Embedding Benchmark (MTEB). Through detailed comparison and analysis, we highlight the key contributions and limitations in this area, and propose potentially inspiring future research directions.
comment: 21 pages
♻ ☆ Multi-Resolution Diffusion for Privacy-Sensitive Recommender Systems
While recommender systems have become an integral component of the Web experience, their heavy reliance on user data raises privacy and security concerns. Substituting user data with synthetic data can address these concerns, but accurately replicating these real-world datasets has been a notoriously challenging problem. Recent advancements in generative AI have demonstrated the impressive capabilities of diffusion models in generating realistic data across various domains. In this work we introduce a Score-based Diffusion Recommendation Module (SDRM), which captures the intricate patterns of real-world datasets required for training highly accurate recommender systems. SDRM allows for the generation of synthetic data that can replace existing datasets to preserve user privacy, or augment existing datasets to address excessive data sparsity. Our method outperforms competing baselines such as generative adversarial networks, variational autoencoders, and recently proposed diffusion models in synthesizing various datasets to replace or augment the original data by an average improvement of 4.30% in Recall@k and 4.65% in NDCG@k.
comment: 13 pages, 3 figures
♻ ☆ Human Action Co-occurrence in Lifestyle Vlogs using Graph Link Prediction
We introduce the task of automatic human action co-occurrence identification, i.e., determine whether two human actions can co-occur in the same interval of time. We create and make publicly available the ACE (Action Co-occurrencE) dataset, consisting of a large graph of ~12k co-occurring pairs of visual actions and their corresponding video clips. We describe graph link prediction models that leverage visual and textual information to automatically infer if two actions are co-occurring. We show that graphs are particularly well suited to capture relations between human actions, and the learned graph representations are effective for our task and capture novel and relevant information across different data domains. The ACE dataset and the code introduced in this paper are publicly available at https://github.com/MichiganNLP/vlog_action_co-occurrence.
♻ ☆ Diffusion-EXR: Controllable Review Generation for Explainable Recommendation via Diffusion Models
Denoising Diffusion Probabilistic Model (DDPM) has shown great competence in image and audio generation tasks. However, there exist few attempts to employ DDPM in the text generation, especially review generation under recommendation systems. Fueled by the predicted reviews explainability that justifies recommendations could assist users better understand the recommended items and increase the transparency of recommendation system, we propose a Diffusion Model-based Review Generation towards EXplainable Recommendation named Diffusion-EXR. Diffusion-EXR corrupts the sequence of review embeddings by incrementally introducing varied levels of Gaussian noise to the sequence of word embeddings and learns to reconstruct the original word representations in the reverse process. The nature of DDPM enables our lightweight Transformer backbone to perform excellently in the recommendation review generation task. Extensive experimental results have demonstrated that Diffusion-EXR can achieve state-of-the-art review generation for recommendation on two publicly available benchmark datasets.
comment: I request to withdraw my paper due to the discovery of significant errors in terms of experimental results in the manuscript that affect the validity of the paper. These errors are necessary to correct, and the current version should not be used or cited in its present form
♻ ☆ A Gradient Accumulation Method for Dense Retriever under Memory Constraint
InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose Contrastive Accumulation (ContAccum), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that ContAccum can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that ContAccum provides more stable dual-encoder training than current memory bank utilization methods.
Multimedia 10
☆ Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset
A more robust and holistic language-video representation is the key to pushing video understanding forward. Despite the improvement in training strategies, the quality of the language-video dataset is less attention to. The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks where queries are much more complex. This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware for more sophisticated representation learning needs, hence helping all downstream tasks. Our multifaceted video captioning method captures entities, actions, speech transcripts, aesthetics, and emotional cues, providing detailed and correlating information from the text side to the video side for training. We also develop an agent-like strategy using language models to generate high-quality, factual textual descriptions, reducing human intervention and enabling scalability. The method's effectiveness in improving language-video representation is evaluated through text-video retrieval using the MSR-VTT dataset and several multi-modal retrieval models.
☆ GenAI-Bench: Evaluating and Improving Compositional Text-to-Visual Generation
While text-to-visual models now produce photo-realistic images and videos, they struggle with compositional text prompts involving attributes, relationships, and higher-order reasoning such as logic and comparison. In this work, we conduct an extensive human study on GenAI-Bench to evaluate the performance of leading image and video generation models in various aspects of compositional text-to-visual generation. We also compare automated evaluation metrics against our collected human ratings and find that VQAScore -- a metric measuring the likelihood that a VQA model views an image as accurately depicting the prompt -- significantly outperforms previous metrics such as CLIPScore. In addition, VQAScore can improve generation in a black-box manner (without finetuning) via simply ranking a few (3 to 9) candidate images. Ranking by VQAScore is 2x to 3x more effective than other scoring methods like PickScore, HPSv2, and ImageReward at improving human alignment ratings for DALL-E 3 and Stable Diffusion, especially on compositional prompts that require advanced visio-linguistic reasoning. We will release a new GenAI-Rank benchmark with over 40,000 human ratings to evaluate scoring metrics on ranking images generated from the same prompt. Lastly, we discuss promising areas for improvement in VQAScore, such as addressing fine-grained visual details. We will release all human ratings (over 80,000) to facilitate scientific benchmarking of both generative models and automated metrics.
comment: We open-source our dataset, model, and code at: https://linzhiqiu.github.io/papers/genai_bench ; Project page: https://linzhiqiu.github.io/papers/genai_bench ;. arXiv admin note: substantial text overlap with arXiv:2404.01291
☆ Convex-hull Estimation using XPSNR for Versatile Video Coding ICIP
As adaptive streaming becomes crucial for delivering high-quality video content across diverse network conditions, accurate metrics to assess perceptual quality are essential. This paper explores using the eXtended Peak Signal-to-Noise Ratio (XPSNR) metric as an alternative to the popular Video Multimethod Assessment Fusion (VMAF) metric for determining optimized bitrate-resolution pairs in the context of Versatile Video Coding (VVC). Our study is rooted in the observation that XPSNR shows a superior correlation with subjective quality scores for VVC-coded Ultra-High Definition (UHD) content compared to VMAF. We predict the average XPSNR of VVC-coded bitstreams using spatiotemporal complexity features of the video and the target encoding configuration and then determine the convex-hull online. On average, the proposed convex-hull using XPSNR (VEXUS) achieves an overall quality improvement of 5.84 dB PSNR and 0.62 dB XPSNR while maintaining the same bitrate, compared to the default UHD encoding using the VVenC encoder, accompanied by an encoding time reduction of 44.43% and a decoding time reduction of 65.46%. This shift towards XPSNR as a guiding metric shall enhance the effectiveness of adaptive streaming algorithms, ensuring an optimal balance between bitrate efficiency and perceptual fidelity with advanced video coding standards.
comment: Accepted at 2024 IEEE International Conference on Image Processing (ICIP)
☆ PetalView: Fine-grained Location and Orientation Extraction of Street-view Images via Cross-view Local Search with Supplementary Materials
Satellite-based street-view information extraction by cross-view matching refers to a task that extracts the location and orientation information of a given street-view image query by using one or multiple geo-referenced satellite images. Recent work has initiated a new research direction to find accurate information within a local area covered by one satellite image centered at a location prior (e.g., from GPS). It can be used as a standalone solution or complementary step following a large-scale search with multiple satellite candidates. However, these existing works require an accurate initial orientation (angle) prior (e.g., from IMU) and/or do not efficiently search through all possible poses. To allow efficient search and to give accurate prediction regardless of the existence or the accuracy of the angle prior, we present PetalView extractors with multi-scale search. The PetalView extractors give semantically meaningful features that are equivalent across two drastically different views, and the multi-scale search strategy efficiently inspects the satellite image from coarse to fine granularity to provide sub-meter and sub-degree precision extraction. Moreover, when an angle prior is given, we propose a learnable prior angle mixer to utilize this information. Our method obtains the best performance on the VIGOR dataset and successfully improves the performance on KITTI dataset test 1 set with the recall within 1 meter (r@1m) for location estimation to 68.88% and recall within 1 degree (r@1d) 21.10% when no angle prior is available, and with angle prior achieves stable estimations at r@1m and r@1d above 70% and 21%, up to a 40-degree noise level.
comment: This paper has been accepted by ACM Multimedia 2023. This version contains additional supplementary materials
☆ Straight Through Gumbel Softmax Estimator based Bimodal Neural Architecture Search for Audio-Visual Deepfake Detection
Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.
☆ Reinforcing Pre-trained Models Using Counterfactual Images
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. Deep learning classification models are often trained using datasets that mirror real-world scenarios. In this training process, because learning is based solely on correlations with labels, there is a risk that models may learn spurious relationships, such as an overreliance on features not central to the subject, like background elements in images. However, due to the black-box nature of the decision-making process in deep learning models, identifying and addressing these vulnerabilities has been particularly challenging. We introduce a novel framework for reinforcing the classification models, which consists of a two-stage process. First, we identify model weaknesses by testing the model using the counterfactual image dataset, which is generated by perturbed image captions. Subsequently, we employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model. Through extensive experiments on several classification models across various datasets, we revealed that fine-tuning with a small set of counterfactual images effectively strengthens the model.
comment: 6 pages, 4 figures
☆ Integration of Policy and Reputation based Trust Mechanisms in e-Commerce Industry
The e-commerce systems are being tackled from commerce behavior and internet technologies. Therefore, trust aspect between buyer-seller transactions is a potential element which needs to be addressed in competitive e-commerce industry. The e-commerce industry is currently handling two different trust approaches. First approach consists on centralized mechanism where digital credentials/set of rules assembled, called Policy based trust mechanisms . Second approach consists on decentralized trust mechanisms where reputation, points assembled and shared, called Reputation based trust mechanisms. The difference between reputation and policy based trust mechanism will be analyzed and recommendations would be proposed to increase trust between buyer and seller in e-commerce industry. The integration of trust mechanism is proposed through mapping process, strength of one mechanism with the weakness of other. The proposed model for integrated mechanism will be presented and illustrated how the proposed model will be used in real world e-commerce industry.
☆ Media Forensics and Deepfake Systematic Survey
Deepfake is a generative deep learning algorithm that creates or changes facial features in a very realistic way making it hard to differentiate the real from the fake features It can be used to make movies look better as well as to spread false information by imitating famous people In this paper many different ways to make a Deepfake are explained analyzed and separated categorically Using Deepfake datasets models are trained and tested for reliability through experiments Deepfakes are a type of facial manipulation that allow people to change their entire faces identities attributes and expressions The trends in the available Deepfake datasets are also discussed with a focus on how they have changed Using Deep learning a general Deepfake detection model is made Moreover the problems in making and detecting Deepfakes are also mentioned As a result of this survey it is expected that the development of new Deepfake based imaging tools will speed up in the future This survey gives indepth review of methods for manipulating images of face and various techniques to spot altered face images Four types of facial manipulation are specifically discussed which are attribute manipulation expression swap entire face synthesis and identity swap Across every manipulation category we yield information on manipulation techniques significant benchmarks for technical evaluation of counterfeit detection techniques available public databases and a summary of the outcomes of all such analyses From all of the topics in the survey we focus on the most recent development of Deepfake showing its advances and obstacles in detecting fake images
comment: 46 pages, 9 figures, 5 tables
☆ Enhancing Cross-Prompt Transferability in Vision-Language Models through Contextual Injection of Target Tokens
Vision-language models (VLMs) seamlessly integrate visual and textual data to perform tasks such as image classification, caption generation, and visual question answering. However, adversarial images often struggle to deceive all prompts effectively in the context of cross-prompt migration attacks, as the probability distribution of the tokens in these images tends to favor the semantics of the original image rather than the target tokens. To address this challenge, we propose a Contextual-Injection Attack (CIA) that employs gradient-based perturbation to inject target tokens into both visual and textual contexts, thereby improving the probability distribution of the target tokens. By shifting the contextual semantics towards the target tokens instead of the original image semantics, CIA enhances the cross-prompt transferability of adversarial images.Extensive experiments on the BLIP2, InstructBLIP, and LLaVA models show that CIA outperforms existing methods in cross-prompt transferability, demonstrating its potential for more effective adversarial strategies in VLMs.
comment: 13 pages
♻ ☆ A Survey on Image-text Multimodal Models
With the significant advancements of Large Language Models (LLMs) in the field of Natural Language Processing (NLP), the development of image-text multimodal models has garnered widespread attention. Current surveys on image-text multimodal models mainly focus on representative models or application domains, but lack a review on how general technical models influence the development of domain-specific models, which is crucial for domain researchers. Based on this, this paper first reviews the technological evolution of image-text multimodal models, from early explorations of feature space to visual language encoding structures, and then to the latest large model architectures. Next, from the perspective of technological evolution, we explain how the development of general image-text multimodal technologies promotes the progress of multimodal technologies in the biomedical field, as well as the importance and complexity of specific datasets in the biomedical domain. Then, centered on the tasks of image-text multimodal models, we analyze their common components and challenges. After that, we summarize the architecture, components, and data of general image-text multimodal models, and introduce the applications and improvements of image-text multimodal models in the biomedical field. Finally, we categorize the challenges faced in the development and application of general models into external factors and intrinsic factors, further refining them into 2 external factors and 5 intrinsic factors, and propose targeted solutions, providing guidance for future research directions. For more details and data, please visit our GitHub page: \url{https://github.com/i2vec/A-survey-on-image-text-multimodal-models}.
Computer Vision and Pattern Recognition 140
☆ Depth Anywhere: Enhancing 360 Monocular Depth Estimation via Perspective Distillation and Unlabeled Data Augmentation
Accurately estimating depth in 360-degree imagery is crucial for virtual reality, autonomous navigation, and immersive media applications. Existing depth estimation methods designed for perspective-view imagery fail when applied to 360-degree images due to different camera projections and distortions, whereas 360-degree methods perform inferior due to the lack of labeled data pairs. We propose a new depth estimation framework that utilizes unlabeled 360-degree data effectively. Our approach uses state-of-the-art perspective depth estimation models as teacher models to generate pseudo labels through a six-face cube projection technique, enabling efficient labeling of depth in 360-degree images. This method leverages the increasing availability of large datasets. Our approach includes two main stages: offline mask generation for invalid regions and an online semi-supervised joint training regime. We tested our approach on benchmark datasets such as Matterport3D and Stanford2D3D, showing significant improvements in depth estimation accuracy, particularly in zero-shot scenarios. Our proposed training pipeline can enhance any 360 monocular depth estimator and demonstrates effective knowledge transfer across different camera projections and data types. See our project page for results: https://albert100121.github.io/Depth-Anywhere/
comment: Project page: https://albert100121.github.io/Depth-Anywhere/
☆ ChangeViT: Unleashing Plain Vision Transformers for Change Detection
Change detection in remote sensing images is essential for tracking environmental changes on the Earth's surface. Despite the success of vision transformers (ViTs) as backbones in numerous computer vision applications, they remain underutilized in change detection, where convolutional neural networks (CNNs) continue to dominate due to their powerful feature extraction capabilities. In this paper, our study uncovers ViTs' unique advantage in discerning large-scale changes, a capability where CNNs fall short. Capitalizing on this insight, we introduce ChangeViT, a framework that adopts a plain ViT backbone to enhance the performance of large-scale changes. This framework is supplemented by a detail-capture module that generates detailed spatial features and a feature injector that efficiently integrates fine-grained spatial information into high-level semantic learning. The feature integration ensures that ChangeViT excels in both detecting large-scale changes and capturing fine-grained details, providing comprehensive change detection across diverse scales. Without bells and whistles, ChangeViT achieves state-of-the-art performance on three popular high-resolution datasets (i.e., LEVIR-CD, WHU-CD, and CLCD) and one low-resolution dataset (i.e., OSCD), which underscores the unleashed potential of plain ViTs for change detection. Furthermore, thorough quantitative and qualitative analyses validate the efficacy of the introduced modules, solidifying the effectiveness of our approach. The source code is available at https://github.com/zhuduowang/ChangeViT.
☆ DrVideo: Document Retrieval Based Long Video Understanding
Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
comment: 11 pages
☆ LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging ICML 2024
Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.
comment: ICML 2024
☆ GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation CVPR
Referring Video Object Segmentation (RVOS) aims to segment the object referred to by the query sentence throughout the entire video. Most existing methods require end-to-end training with dense mask annotations, which could be computation-consuming and less scalable. In this work, we aim to efficiently adapt foundation segmentation models for addressing RVOS from weak supervision with the proposed Grounded Prompting (GroPrompt) framework. More specifically, we propose Text-Aware Prompt Contrastive Learning (TAP-CL) to enhance the association between the position prompts and the referring sentences with only box supervisions, including Text-Contrastive Prompt Learning (TextCon) and Modality-Contrastive Prompt Learning (ModalCon) at frame level and video level, respectively. With the proposed TAP-CL, our GroPrompt framework can generate temporal-consistent yet text-aware position prompts describing locations and movements for the referred object from the video. The experimental results in the standard RVOS benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, and JHMDB-Sentences) demonstrate the competitive performance of our proposed GroPrompt framework given only bounding box weak supervisions.
comment: CVPR Workshop 2024
☆ VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal VIdeo Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, the foundation of VIA is a novel test-time editing adaptation method, which adapts a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that adapts consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potentials for advanced video editing tasks over long video sequences.
comment: 13 pages, 11 figures
☆ Neural Approximate Mirror Maps for Constrained Diffusion Models
Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a diffusion model not only improves its distribution-matching accuracy but also makes it more reliable for generating valid synthetic data and solving constrained inverse problems. However, existing methods for constrained diffusion models are inflexible with different types of constraints. Recent work proposed to learn mirror diffusion models (MDMs) in an unconstrained space defined by a mirror map and to impose the constraint with an inverse mirror map, but analytical mirror maps are challenging to derive for complex constraints. We propose neural approximate mirror maps (NAMMs) for general constraints. Our approach only requires a differentiable distance function from the constraint set. We learn an approximate mirror map that pushes data into an unconstrained space and a corresponding approximate inverse that maps data back to the constraint set. A generative model, such as an MDM, can then be trained in the learned mirror space and its samples restored to the constraint set by the inverse map. We validate our approach on a variety of constraints, showing that compared to an unconstrained diffusion model, a NAMM-based MDM substantially improves constraint satisfaction. We also demonstrate how existing diffusion-based inverse-problem solvers can be easily applied in the learned mirror space to solve constrained inverse problems.
☆ Adversarial Attacks on Multimodal Agents
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack
comment: 19 pages
☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and exploring emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
☆ AITTI: Learning Adaptive Inclusive Token for Text-to-Image Generation
Despite the high-quality results of text-to-image generation, stereotypical biases have been spotted in their generated contents, compromising the fairness of generative models. In this work, we propose to learn adaptive inclusive tokens to shift the attribute distribution of the final generative outputs. Unlike existing de-biasing approaches, our method requires neither explicit attribute specification nor prior knowledge of the bias distribution. Specifically, the core of our method is a lightweight adaptive mapping network, which can customize the inclusive tokens for the concepts to be de-biased, making the tokens generalizable to unseen concepts regardless of their original bias distributions. This is achieved by tuning the adaptive mapping network with a handful of balanced and inclusive samples using an anchor loss. Experimental results demonstrate that our method outperforms previous bias mitigation methods without attribute specification while preserving the alignment between generative results and text descriptions. Moreover, our method achieves comparable performance to models that require specific attributes or editing directions for generation. Extensive experiments showcase the effectiveness of our adaptive inclusive tokens in mitigating stereotypical bias in text-to-image generation. The code will be available at https://github.com/itsmag11/AITTI.
☆ Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video ICLR 2024
We introduce latent intuitive physics, a transfer learning framework for physics simulation that can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. Our key insight is to use latent features drawn from a learnable prior distribution conditioned on the underlying particle states to capture the invisible and complex physical properties. To achieve this, we train a parametrized prior learner given visual observations to approximate the visual posterior of inverse graphics, and both the particle states and the visual posterior are obtained from a learned neural renderer. The converged prior learner is embedded in our probabilistic physics engine, allowing us to perform novel simulations on unseen geometries, boundaries, and dynamics without knowledge of the true physical parameters. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation. Our model demonstrates strong performance in all three tasks.
comment: Published as a conference paper at ICLR 2024
☆ MAC: A Benchmark for Multiple Attributes Compositional Zero-Shot Learning
Compositional Zero-Shot Learning (CZSL) aims to learn semantic primitives (attributes and objects) from seen compositions and recognize unseen attribute-object compositions. Existing CZSL datasets focus on single attributes, neglecting the fact that objects naturally exhibit multiple interrelated attributes. Real-world objects often possess multiple interrelated attributes, and current datasets' narrow attribute scope and single attribute labeling introduce annotation biases, undermining model performance and evaluation. To address these limitations, we introduce the Multi-Attribute Composition (MAC) dataset, encompassing 18,217 images and 11,067 compositions with comprehensive, representative, and diverse attribute annotations. MAC includes an average of 30.2 attributes per object and 65.4 objects per attribute, facilitating better multi-attribute composition predictions. Our dataset supports deeper semantic understanding and higher-order attribute associations, providing a more realistic and challenging benchmark for the CZSL task. We also develop solutions for multi-attribute compositional learning and propose the MM-encoder to disentangling the attributes and objects.
comment: 13pages,5figures
☆ GFM4MPM: Towards Geospatial Foundation Models for Mineral Prospectivity Mapping
Machine Learning (ML) for Mineral Prospectivity Mapping (MPM) remains a challenging problem as it requires the analysis of associations between large-scale multi-modal geospatial data and few historical mineral commodity observations (positive labels). Recent MPM works have explored Deep Learning (DL) as a modeling tool with more representation capacity. However, these overparameterized methods may be more prone to overfitting due to their reliance on scarce labeled data. While a large quantity of unlabeled geospatial data exists, no prior MPM works have considered using such information in a self-supervised manner. Our MPM approach uses a masked image modeling framework to pretrain a backbone neural network in a self-supervised manner using unlabeled geospatial data alone. After pretraining, the backbone network provides feature extraction for downstream MPM tasks. We evaluated our approach alongside existing methods to assess mineral prospectivity of Mississippi Valley Type (MVT) and Clastic-Dominated (CD) Lead-Zinc deposits in North America and Australia. Our results demonstrate that self-supervision promotes robustness in learned features, improving prospectivity predictions. Additionally, we leverage explainable artificial intelligence techniques to demonstrate that individual predictions can be interpreted from a geological perspective.
comment: 12 pages, 16 figures, 7 tables
☆ Extracting Training Data from Unconditional Diffusion Models
As diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI), the study of their memorization of the raw training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn by memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for more controllable generation and trustworthy application of Artificial Intelligence Generated Content (AIGC). While previous works have made important observations of when DPMs are prone to memorization, these findings are mostly empirical, and the developed data extraction methods only work for conditional diffusion models. In this work, we aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we further propose a novel data extraction method called \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a classifier trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models. Our empirical results demonstrate that SIDE can extract training data from diffusion models where previous methods fail, and it is on average over 50\% more effective across different scales of the CelebA dataset.
☆ Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.
comment: First three authors contributed equally. Dataset: https://huggingface.co/datasets/VLLMs/MIRB
☆ Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning
The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject to various implicit factors beyond its visual appearance. That is, visually similar objects may be totally opposite in their privacy classes. To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the capturing of subtle context changes. Based on this hybrid graph... **For the full abstract, see the original paper.**
comment: 15 pages
☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
☆ Self-Localized Collaborative Perception
Collaborative perception has garnered considerable attention due to its capacity to address several inherent challenges in single-agent perception, including occlusion and out-of-range issues. However, existing collaborative perception systems heavily rely on precise localization systems to establish a consistent spatial coordinate system between agents. This reliance makes them susceptible to large pose errors or malicious attacks, resulting in substantial reductions in perception performance. To address this, we propose~$\mathtt{CoBEVGlue}$, a novel self-localized collaborative perception system, which achieves more holistic and robust collaboration without using an external localization system. The core of~$\mathtt{CoBEVGlue}$ is a novel spatial alignment module, which provides the relative poses between agents by effectively matching co-visible objects across agents. We validate our method on both real-world and simulated datasets. The results show that i) $\mathtt{CoBEVGlue}$ achieves state-of-the-art detection performance under arbitrary localization noises and attacks; and ii) the spatial alignment module can seamlessly integrate with a majority of previous methods, enhancing their performance by an average of $57.7\%$. Code is available at https://github.com/VincentNi0107/CoBEVGlue
☆ Coarse-Fine Spectral-Aware Deformable Convolution For Hyperspectral Image Reconstruction ICIP2024
We study the inverse problem of Coded Aperture Snapshot Spectral Imaging (CASSI), which captures a spatial-spectral data cube using snapshot 2D measurements and uses algorithms to reconstruct 3D hyperspectral images (HSI). However, current methods based on Convolutional Neural Networks (CNNs) struggle to capture long-range dependencies and non-local similarities. The recently popular Transformer-based methods are poorly deployed on downstream tasks due to the high computational cost caused by self-attention. In this paper, we propose Coarse-Fine Spectral-Aware Deformable Convolution Network (CFSDCN), applying deformable convolutional networks (DCN) to this task for the first time. Considering the sparsity of HSI, we design a deformable convolution module that exploits its deformability to capture long-range dependencies and non-local similarities. In addition, we propose a new spectral information interaction module that considers both coarse-grained and fine-grained spectral similarities. Extensive experiments demonstrate that our CFSDCN significantly outperforms previous state-of-the-art (SOTA) methods on both simulated and real HSI datasets.
comment: 7 pages, 5 figures, Accepted by ICIP2024
☆ SUPER: Selfie Undistortion and Head Pose Editing with Identity Preservation
Self-portraits captured from a short distance might look unnatural or even unattractive due to heavy distortions making facial features malformed, and ill-placed head poses. In this paper, we propose SUPER, a novel method of eliminating distortions and adjusting head pose in a close-up face crop. We perform 3D GAN inversion for a facial image by optimizing camera parameters and face latent code, which gives a generated image. Besides, we estimate depth from the obtained latent code, create a depth-induced 3D mesh, and render it with updated camera parameters to obtain a warped portrait. Finally, we apply the visibility-based blending so that visible regions are reprojected, and occluded parts are restored with a generative model. Experiments on face undistortion benchmarks and on our self-collected Head Rotation dataset (HeRo), show that SUPER outperforms previous approaches both qualitatively and quantitatively, opening new possibilities for photorealistic selfie editing.
☆ Online-Adaptive Anomaly Detection for Defect Identification in Aircraft Assembly
Anomaly detection deals with detecting deviations from established patterns within data. It has various applications like autonomous driving, predictive maintenance, and medical diagnosis. To improve anomaly detection accuracy, transfer learning can be applied to large, pre-trained models and adapt them to the specific application context. In this paper, we propose a novel framework for online-adaptive anomaly detection using transfer learning. The approach adapts to different environments by selecting visually similar training images and online fitting a normality model to EfficientNet features extracted from the training subset. Anomaly detection is then performed by computing the Mahalanobis distance between the normality model and the test image features. Different similarity measures (SIFT/FLANN, Cosine) and normality models (MVG, OCSVM) are employed and compared with each other. We evaluate the approach on different anomaly detection benchmarks and data collected in controlled laboratory settings. Experimental results showcase a detection accuracy exceeding 0.975, outperforming the state-of-the-art ET-NET approach.
comment: This is preprint for the accepted paper
☆ Spatial Sequence Attention Network for Schizophrenia Classification from Structural Brain MR Images
Schizophrenia is a debilitating, chronic mental disorder that significantly impacts an individual's cognitive abilities, behavior, and social interactions. It is characterized by subtle morphological changes in the brain, particularly in the gray matter. These changes are often imperceptible through manual observation, demanding an automated approach to diagnosis. This study introduces a deep learning methodology for the classification of individuals with Schizophrenia. We achieve this by implementing a diversified attention mechanism known as Spatial Sequence Attention (SSA) which is designed to extract and emphasize significant feature representations from structural MRI (sMRI). Initially, we employ the transfer learning paradigm by leveraging pre-trained DenseNet to extract initial feature maps from the final convolutional block which contains morphological alterations associated with Schizophrenia. These features are further processed by the proposed SSA to capture and emphasize intricate spatial interactions and relationships across volumes within the brain. Our experimental studies conducted on a clinical dataset have revealed that the proposed attention mechanism outperforms the existing Squeeze & Excitation Network for Schizophrenia classification.
comment: This paper has been accepted for the 21st IEEE International Symposium on Biomedical Imaging (ISBI 2024)
☆ GeoBench: Benchmarking and Analyzing Monocular Geometry Estimation Models
Recent advances in discriminative and generative pretraining have yielded geometry estimation models with strong generalization capabilities. While discriminative monocular geometry estimation methods rely on large-scale fine-tuning data to achieve zero-shot generalization, several generative-based paradigms show the potential of achieving impressive generalization performance on unseen scenes by leveraging pre-trained diffusion models and fine-tuning on even a small scale of synthetic training data. Frustratingly, these models are trained with different recipes on different datasets, making it hard to find out the critical factors that determine the evaluation performance. Besides, current geometry evaluation benchmarks have two main drawbacks that may prevent the development of the field, i.e., limited scene diversity and unfavorable label quality. To resolve the above issues, (1) we build fair and strong baselines in a unified codebase for evaluating and analyzing the geometry estimation models; (2) we evaluate monocular geometry estimators on more challenging benchmarks for geometry estimation task with diverse scenes and high-quality annotations. Our results reveal that pre-trained using large data, discriminative models such as DINOv2, can outperform generative counterparts with a small amount of high-quality synthetic data under the same training configuration, which suggests that fine-tuning data quality is a more important factor than the data scale and model architecture. Our observation also raises a question: if simply fine-tuning a general vision model such as DINOv2 using a small amount of synthetic depth data produces SOTA results, do we really need complex generative models for depth estimation? We believe this work can propel advancements in geometry estimation tasks as well as a wide range of downstream applications.
comment: https://github.com/aim-uofa/GeoBench
☆ Disturbing Image Detection Using LMM-Elicited Emotion Embeddings ICIP 2024
In this paper we deal with the task of Disturbing Image Detection (DID), exploiting knowledge encoded in Large Multimodal Models (LMMs). Specifically, we propose to exploit LMM knowledge in a two-fold manner: first by extracting generic semantic descriptions, and second by extracting elicited emotions. Subsequently, we use the CLIP's text encoder in order to obtain the text embeddings of both the generic semantic descriptions and LMM-elicited emotions. Finally, we use the aforementioned text embeddings along with the corresponding CLIP's image embeddings for performing the DID task. The proposed method significantly improves the baseline classification accuracy, achieving state-of-the-art performance on the augmented Disturbing Image Detection dataset.
comment: Accepted for publication, LVLM Workshop @ IEEE Int. Conf. on Image Processing (ICIP 2024), Abu Dhabi, United Arab Emirates, Oct. 2024. This is the authors' "accepted version"
☆ Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content, facilitating applications such as image captioning. However, using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. While previous studies attribute the occurrence of OH to the inclusion of more details, our study finds technical flaws in existing metrics, leading to unreliable evaluations of models and conclusions about OH. This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning? In this paper, we address this debate by proposing a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD decodes the wealth of information hidden in visual input into distinct language representations called unit facts in parallel. This decoding is achieved via a well-designed differential score that guides the parallel search and candidate screening. The selected unit facts are then aggregated to generate the final caption. Our proposed metrics evaluate the comprehensiveness and accuracy of image captions by comparing the embedding groups of ground-truth image regions and generated text partitions. Extensive experiments on the Visual Genome dataset validate the effectiveness of our approach, demonstrating that it produces detailed descriptions while maintaining low hallucination levels.
☆ Online Anchor-based Training for Image Classification Tasks ICIP 2024
In this paper, we aim to improve the performance of a deep learning model towards image classification tasks, proposing a novel anchor-based training methodology, named \textit{Online Anchor-based Training} (OAT). The OAT method, guided by the insights provided in the anchor-based object detection methodologies, instead of learning directly the class labels, proposes to train a model to learn percentage changes of the class labels with respect to defined anchors. We define as anchors the batch centers at the output of the model. Then, during the test phase, the predictions are converted back to the original class label space, and the performance is evaluated. The effectiveness of the OAT method is validated on four datasets.
comment: Accepted for publication, IEEE Int. Conf. on Image Processing (ICIP 2024), Abu Dhabi, United Arab Emirates, Oct. 2024. This is the authors' "accepted version"
☆ Federated Learning with a Single Shared Image CVPR
Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which predictions are exchanged. However, in many contexts such a dataset might be difficult to acquire due to privacy and the clients might not allow for storage of a large shared dataset. To this end, in this paper, we introduce a new method that improves this knowledge distillation method to only rely on a single shared image between clients and server. In particular, we propose a novel adaptive dataset pruning algorithm that selects the most informative crops generated from only a single image. With this, we show that federated learning with distillation under a limited shared dataset budget works better by using a single image compared to multiple individual ones. Finally, we extend our approach to allow for training heterogeneous client architectures by incorporating a non-uniform distillation schedule and client-model mirroring on the server side.
comment: 8 Pages, 3 Figures, Appendix 4 Pages, CVPRW 2024
☆ Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models
Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.
☆ An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation MICCAI-2024
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.
comment: Accepted to MICCAI-2024
☆ Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model KDD 2024
Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes; 2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage the extracted features to obtain visual and textual prototypes for prediction. To make full use of multi-modal information, we also propose cross-modal attention to enrich the features from both modalities. For effective generalization, we introduce virtual prototypes for new classes to make up for their lack of training images. Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets while substantially reducing the training time, demonstrating the superiority of our approach. The source code is available at https://github.com/shijxcs/Candle.
comment: Accepted by KDD 2024
☆ Cyclic 2.5D Perceptual Loss for Cross-Modal 3D Image Synthesis: T1 MRI to Tau-PET
Alzheimer's Disease (AD) is the most common form of dementia, characterised by cognitive decline and biomarkers such as tau-proteins. Tau-positron emission tomography (tau-PET), which employs a radiotracer to selectively bind, detect, and visualise tau protein aggregates within the brain, is valuable for early AD diagnosis but is less accessible due to high costs, limited availability, and its invasive nature. Image synthesis with neural networks enables the generation of tau-PET images from more accessible T1-weighted magnetic resonance imaging (MRI) images. To ensure high-quality image synthesis, we propose a cyclic 2.5D perceptual loss combined with mean squared error and structural similarity index measure (SSIM) losses. The cyclic 2.5D perceptual loss sequentially calculates the axial 2D average perceptual loss for a specified number of epochs, followed by the coronal and sagittal planes for the same number of epochs. This sequence is cyclically performed, with intervals reducing as the cycles repeat. We conduct supervised synthesis of tau-PET images from T1w MRI images using 516 paired T1w MRI and tau-PET 3D images from the ADNI database. For the collected data, we perform preprocessing, including intensity standardisation for tau-PET images from each manufacturer. The proposed loss, applied to generative 3D U-Net and its variants, outperformed those with 2.5D and 3D perceptual losses in SSIM and peak signal-to-noise ratio (PSNR). In addition, including the cyclic 2.5D perceptual loss to the original losses of GAN-based image synthesis models such as CycleGAN and Pix2Pix improves SSIM and PSNR by at least 2% and 3%. Furthermore, by-manufacturer PET standardisation helps the models in synthesising high-quality images than min-max PET normalisation.
comment: 24 pages, 5 figures
☆ SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation
Out-of-distribution (OOD) detection is crucial for the safe deployment of neural networks. Existing CLIP-based approaches perform OOD detection by devising novel scoring functions or sophisticated fine-tuning methods. In this work, we propose SeTAR, a novel, training-free OOD detection method that leverages selective low-rank approximation of weight matrices in vision-language and vision-only models. SeTAR enhances OOD detection via post-hoc modification of the model's weight matrices using a simple greedy search algorithm. Based on SeTAR, we further propose SeTAR+FT, a fine-tuning extension optimizing model performance for OOD detection tasks. Extensive evaluations on ImageNet1K and Pascal-VOC benchmarks show SeTAR's superior performance, reducing the false positive rate by up to 18.95% and 36.80% compared to zero-shot and fine-tuning baselines. Ablation studies further validate our approach's effectiveness, robustness, and generalizability across different model backbones. Our work offers a scalable, efficient solution for OOD detection, setting a new state-of-the-art in this area.
comment: Code are available at \url{https://github.com/X1AOX1A/SeTAR}
☆ Learned Image Compression for HE-stained Histopathological Images via Stain Deconvolution
Processing histopathological Whole Slide Images (WSI) leads to massive storage requirements for clinics worldwide. Even after lossy image compression during image acquisition, additional lossy compression is frequently possible without substantially affecting the performance of deep learning-based (DL) downstream tasks. In this paper, we show that the commonly used JPEG algorithm is not best suited for further compression and we propose Stain Quantized Latent Compression (SQLC ), a novel DL based histopathology data compression approach. SQLC compresses staining and RGB channels before passing it through a compression autoencoder (CAE ) in order to obtain quantized latent representations for maximizing the compression. We show that our approach yields superior performance in a classification downstream task, compared to traditional approaches like JPEG, while image quality metrics like the Multi-Scale Structural Similarity Index (MS-SSIM) is largely preserved. Our method is online available.
☆ Unmasking the Veil: An Investigation into Concept Ablation for Privacy and Copyright Protection in Images
In this paper, we extend the study of concept ablation within pre-trained models as introduced in 'Ablating Concepts in Text-to-Image Diffusion Models' by (Kumari et al.,2022). Our work focuses on reproducing the results achieved by the different variants of concept ablation proposed and validated through predefined metrics. We also introduce a novel variant of concept ablation, namely 'trademark ablation'. This variant combines the principles of memorization and instance ablation to tackle the nuanced influence of proprietary or branded elements in model outputs. Further, our research contributions include an observational analysis of the model's limitations. Moreover, we investigate the model's behavior in response to ablation leakage-inducing prompts, which aim to indirectly ablate concepts, revealing insights into the model's resilience and adaptability. We also observe the model's performance degradation on images generated by concepts far from its target ablation concept, documented in the appendix.
☆ Restorer: Solving Multiple Image Restoration Tasks with One Set of Parameters
Although there are many excellent solutions in image restoration, the fact that they are specifically designed for a single image restoration task may prevent them from being state-of-the-art (SOTA) in other types of image restoration tasks. While some approaches require considering multiple image restoration tasks, they are still not sufficient for the requirements of the real world and may suffer from the task confusion issue. In this work, we focus on designing a unified and effective solution for multiple image restoration tasks including deraining, desnowing, defogging, deblurring, denoising, and low-light enhancement. Based on the above purpose, we propose a Transformer network Restorer with U-Net architecture. In order to effectively deal with degraded information in multiple image restoration tasks, we need a more comprehensive attention mechanism. Thus, we design all-axis attention (AAA) through stereo embedding and 3D convolution, which can simultaneously model the long-range dependencies in both spatial and channel dimensions, capturing potential correlations among all axis. Moreover, we propose a Restorer based on textual prompts. Compared to previous methods that employ learnable queries, textual prompts bring explicit task priors to solve the task confusion issue arising from learnable queries and introduce interactivity. Based on these designs, Restorer demonstrates SOTA or comparable performance in multiple image restoration tasks compared to universal image restoration frameworks and methods specifically designed for these individual tasks. Meanwhile, Restorer is faster during inference. The above results along with the real-world test results show that Restorer has the potential to serve as a backbone for multiple real-world image restoration tasks.
☆ Cephalometric Landmark Detection across Ages with Prototypical Network MICCAI 2024
Automated cephalometric landmark detection is crucial in real-world orthodontic diagnosis. Current studies mainly focus on only adult subjects, neglecting the clinically crucial scenario presented by adolescents whose landmarks often exhibit significantly different appearances compared to adults. Hence, an open question arises about how to develop a unified and effective detection algorithm across various age groups, including adolescents and adults. In this paper, we propose CeLDA, the first work for Cephalometric Landmark Detection across Ages. Our method leverages a prototypical network for landmark detection by comparing image features with landmark prototypes. To tackle the appearance discrepancy of landmarks between age groups, we design new strategies for CeLDA to improve prototype alignment and obtain a holistic estimation of landmark prototypes from a large set of training images. Moreover, a novel prototype relation mining paradigm is introduced to exploit the anatomical relations between the landmark prototypes. Extensive experiments validate the superiority of CeLDA in detecting cephalometric landmarks on both adult and adolescent subjects. To our knowledge, this is the first effort toward developing a unified solution and dataset for cephalometric landmark detection across age groups. Our code and dataset will be made public on https://github.com/ShanghaiTech-IMPACT/Cephalometric-Landmark-Detection-across-Ages-with-Prototypical-Network
comment: MICCAI 2024
☆ A Super-human Vision-based Reinforcement Learning Agent for Autonomous Racing in Gran Turismo
Racing autonomous cars faster than the best human drivers has been a longstanding grand challenge for the fields of Artificial Intelligence and robotics. Recently, an end-to-end deep reinforcement learning agent met this challenge in a high-fidelity racing simulator, Gran Turismo. However, this agent relied on global features that require instrumentation external to the car. This paper introduces, to the best of our knowledge, the first super-human car racing agent whose sensor input is purely local to the car, namely pixels from an ego-centric camera view and quantities that can be sensed from on-board the car, such as the car's velocity. By leveraging global features only at training time, the learned agent is able to outperform the best human drivers in time trial (one car on the track at a time) races using only local input features. The resulting agent is evaluated in Gran Turismo 7 on multiple tracks and cars. Detailed ablation experiments demonstrate the agent's strong reliance on visual inputs, making it the first vision-based super-human car racing agent.
comment: Accepted at the Reinforcement Learning Conference (RLC) 2024
☆ Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality
In human interaction, gestures serve various functions such as marking speech rhythm, highlighting key elements, and supplementing information. These gestures are also observed in explanatory contexts. However, the impact of gestures on explanations provided by virtual agents remains underexplored. A user study was carried out to investigate how different types of gestures influence perceived interaction quality and listener understanding. This study addresses the effect of gestures in explanation by developing an embodied virtual explainer integrating both beat gestures and iconic gestures to enhance its automatically generated verbal explanations. Our model combines beat gestures generated by a learned speech-driven synthesis module with manually captured iconic gestures, supporting the agent's verbal expressions about the board game Quarto! as an explanation scenario. Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding. Nonetheless, compared to prior research, the embodied agent significantly enhances understanding.
☆ ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection
With the rapid development of depth sensor, more and more RGB-D videos could be obtained. Identifying the foreground in RGB-D videos is a fundamental and important task. However, the existing salient object detection (SOD) works only focus on either static RGB-D images or RGB videos, ignoring the collaborating of RGB-D and video information. In this paper, we first collect a new annotated RGB-D video SOD (ViDSOD-100) dataset, which contains 100 videos within a total of 9,362 frames, acquired from diverse natural scenes. All the frames in each video are manually annotated to a high-quality saliency annotation. Moreover, we propose a new baseline model, named attentive triple-fusion network (ATF-Net), for RGB-D video salient object detection. Our method aggregates the appearance information from an input RGB image, spatio-temporal information from an estimated motion map, and the geometry information from the depth map by devising three modality-specific branches and a multi-modality integration branch. The modality-specific branches extract the representation of different inputs, while the multi-modality integration branch combines the multi-level modality-specific features by introducing the encoder feature aggregation (MEA) modules and decoder feature aggregation (MDA) modules. The experimental findings conducted on both our newly introduced ViDSOD-100 dataset and the well-established DAVSOD dataset highlight the superior performance of the proposed ATF-Net. This performance enhancement is demonstrated both quantitatively and qualitatively, surpassing the capabilities of current state-of-the-art techniques across various domains, including RGB-D saliency detection, video saliency detection, and video object segmentation. Our data and our code are available at github.com/jhl-Det/RGBD_Video_SOD.
☆ Reparameterizable Dual-Resolution Network for Real-time Semantic Segmentation
Semantic segmentation plays a key role in applications such as autonomous driving and medical image. Although existing real-time semantic segmentation models achieve a commendable balance between accuracy and speed, their multi-path blocks still affect overall speed. To address this issue, this study proposes a Reparameterizable Dual-Resolution Network (RDRNet) dedicated to real-time semantic segmentation. Specifically, RDRNet employs a two-branch architecture, utilizing multi-path blocks during training and reparameterizing them into single-path blocks during inference, thereby enhancing both accuracy and inference speed simultaneously. Furthermore, we propose the Reparameterizable Pyramid Pooling Module (RPPM) to enhance the feature representation of the pyramid pooling module without increasing its inference time. Experimental results on the Cityscapes, CamVid, and Pascal VOC 2012 datasets demonstrate that RDRNet outperforms existing state-of-the-art models in terms of both performance and speed. The code is available at https://github.com/gyyang23/RDRNet.
☆ RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.
comment: 14 pages, 6 figures, 4 tables
☆ LFMamba: Light Field Image Super-Resolution with State Space Model
Recent years have witnessed significant advancements in light field image super-resolution (LFSR) owing to the progress of modern neural networks. However, these methods often face challenges in capturing long-range dependencies (CNN-based) or encounter quadratic computational complexities (Transformer-based), which limit their performance. Recently, the State Space Model (SSM) with selective scanning mechanism (S6), exemplified by Mamba, has emerged as a superior alternative in various vision tasks compared to traditional CNN- and Transformer-based approaches, benefiting from its effective long-range sequence modeling capability and linear-time complexity. Therefore, integrating S6 into LFSR becomes compelling, especially considering the vast data volume of 4D light fields. However, the primary challenge lies in \emph{designing an appropriate scanning method for 4D light fields that effectively models light field features}. To tackle this, we employ SSMs on the informative 2D slices of 4D LFs to fully explore spatial contextual information, complementary angular information, and structure information. To achieve this, we carefully devise a basic SSM block characterized by an efficient SS2D mechanism that facilitates more effective and efficient feature learning on these 2D slices. Based on the above two designs, we further introduce an SSM-based network for LFSR termed LFMamba. Experimental results on LF benchmarks demonstrate the superior performance of LFMamba. Furthermore, extensive ablation studies are conducted to validate the efficacy and generalization ability of our proposed method. We expect that our LFMamba shed light on effective representation learning of LFs with state space models.
☆ HumanSplat: Generalizable Single-Image Human Gaussian Splatting with Structure Priors
Despite recent advancements in high-fidelity human reconstruction techniques, the requirements for densely captured images or time-consuming per-instance optimization significantly hinder their applications in broader scenarios. To tackle these issues, we present HumanSplat which predicts the 3D Gaussian Splatting properties of any human from a single input image in a generalizable manner. In particular, HumanSplat comprises a 2D multi-view diffusion model and a latent reconstruction transformer with human structure priors that adeptly integrate geometric priors and semantic features within a unified framework. A hierarchical loss that incorporates human semantic information is further designed to achieve high-fidelity texture modeling and better constrain the estimated multiple views. Comprehensive experiments on standard benchmarks and in-the-wild images demonstrate that HumanSplat surpasses existing state-of-the-art methods in achieving photorealistic novel-view synthesis.
☆ PCA-Relax: Deep-learning-based groupwise registration for motion correction of cardiac $T_1$ mapping MICCAI 2024
Quantitative MRI (qMRI) is an increasingly important tool for clinical assessment of cardiovascular diseases. Quantitative maps are derived by fitting a known signal model to a series of baseline images, while the quality of the map can be deteriorated by involuntary respiratory and cardiac motion. To correct motion, a template image is often needed to register all baseline images, but the choice of template is nontrivial, leading to inconsistent performance sensitive to image contrast. In this work, we propose a novel deep-learning-based groupwise registration framework, which omits the need for a template, and registers all baseline images simultaneously. We design two groupwise losses for this registration framework: the first is a linear principal component analysis (PCA) loss that enforces alignment of baseline images irrespective of the intensity variation, and the second is an auxiliary relaxometry loss that enforces adherence of intensity profile to the signal model. We extensively evaluated our method, termed ``PCA-Relax'', and other baseline methods on an in-house cardiac MRI dataset including both pre- and post-contrast $T_1$ sequences. All methods were evaluated under three distinct training-and-evaluation strategies, namely, standard, one-shot, and test-time-adaptation. The proposed PCA-Relax showed further improved performance of registration and mapping over well-established baselines. The proposed groupwise framework is generic and can be adapted to applications involving multiple images.
comment: MICCAI 2024. Contents may slightly differ from the camera-ready version
☆ Insect Identification in the Wild: The AMI Dataset
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.
☆ Automated MRI Quality Assessment of Brain T1-weighted MRI in Clinical Data Warehouses: A Transfer Learning Approach Relying on Artefact Simulation
The emergence of clinical data warehouses (CDWs), which contain the medical data of millions of patients, has paved the way for vast data sharing for research. The quality of MRIs gathered in CDWs differs greatly from what is observed in research settings and reflects a certain clinical reality. Consequently, a significant proportion of these images turns out to be unusable due to their poor quality. Given the massive volume of MRIs contained in CDWs, the manual rating of image quality is impossible. Thus, it is necessary to develop an automated solution capable of effectively identifying corrupted images in CDWs. This study presents an innovative transfer learning method for automated quality control of 3D gradient echo T1-weighted brain MRIs within a CDW, leveraging artefact simulation. We first intentionally corrupt images from research datasets by inducing poorer contrast, adding noise and introducing motion artefacts. Subsequently, three artefact-specific models are pre-trained using these corrupted images to detect distinct types of artefacts. Finally, the models are generalised to routine clinical data through a transfer learning technique, utilising 3660 manually annotated images. The overall image quality is inferred from the results of the three models, each designed to detect a specific type of artefact. Our method was validated on an independent test set of 385 3D gradient echo T1-weighted MRIs. Our proposed approach achieved excellent results for the detection of bad quality MRIs, with a balanced accuracy of over 87%, surpassing our previous approach by 3.5 percent points. Additionally, we achieved a satisfactory balanced accuracy of 79% for the detection of moderate quality MRIs, outperforming our previous performance by 5 percent points. Our framework provides a valuable tool for exploiting the potential of MRIs in CDWs.
comment: Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2024:012
☆ Cycle-Correspondence Loss: Learning Dense View-Invariant Visual Features from Unlabeled and Unordered RGB Images
Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.
comment: 8 pages, 4 figures, 2 tables
☆ Deep self-supervised learning with visualisation for automatic gesture recognition
Gesture is an important mean of non-verbal communication, with visual modality allows human to convey information during interaction, facilitating peoples and human-machine interactions. However, it is considered difficult to automatically recognise gestures. In this work, we explore three different means to recognise hand signs using deep learning: supervised learning based methods, self-supervised methods and visualisation based techniques applied to 3D moving skeleton data. Self-supervised learning used to train fully connected, CNN and LSTM method. Then, reconstruction method is applied to unlabelled data in simulated settings using CNN as a backbone where we use the learnt features to perform the prediction in the remaining labelled data. Lastly, Grad-CAM is applied to discover the focus of the models. Our experiments results show that supervised learning method is capable to recognise gesture accurately, with self-supervised learning increasing the accuracy in simulated settings. Finally, Grad-CAM visualisation shows that indeed the models focus on relevant skeleton joints on the associated gesture.
comment: Student research project with company collaboration
☆ Recognition of Dynamic Hand Gestures in Long Distance using a Web-Camera for Robot Guidance
Dynamic gestures enable the transfer of directive information to a robot. Moreover, the ability of a robot to recognize them from a long distance makes communication more effective and practical. However, current state-of-the-art models for dynamic gestures exhibit limitations in recognition distance, typically achieving effective performance only within a few meters. In this work, we propose a model for recognizing dynamic gestures from a long distance of up to 20 meters. The model integrates the SlowFast and Transformer architectures (SFT) to effectively process and classify complex gesture sequences captured in video frames. SFT demonstrates superior performance over existing models.
☆ TADM: Temporally-Aware Diffusion Model for Neurodegenerative Progression on Brain MRI
Generating realistic images to accurately predict changes in the structure of brain MRI is a crucial tool for clinicians. Such applications help assess patients' outcomes and analyze how diseases progress at the individual level. However, existing methods for this task present some limitations. Some approaches attempt to model the distribution of MRI scans directly by conditioning the model on patients' ages, but they fail to explicitly capture the relationship between structural changes in the brain and time intervals, especially on age-unbalanced datasets. Other approaches simply rely on interpolation between scans, which limits their clinical application as they do not predict future MRIs. To address these challenges, we propose a Temporally-Aware Diffusion Model (TADM), which introduces a novel approach to accurately infer progression in brain MRIs. TADM learns the distribution of structural changes in terms of intensity differences between scans and combines the prediction of these changes with the initial baseline scans to generate future MRIs. Furthermore, during training, we propose to leverage a pre-trained Brain-Age Estimator (BAE) to refine the model's training process, enhancing its ability to produce accurate MRIs that match the expected age gap between baseline and generated scans. Our assessment, conducted on the OASIS-3 dataset, uses similarity metrics and region sizes computed by comparing predicted and real follow-up scans on 3 relevant brain regions. TADM achieves large improvements over existing approaches, with an average decrease of 24% in region size error and an improvement of 4% in similarity metrics. These evaluations demonstrate the improvement of our model in mimicking temporal brain neurodegenerative progression compared to existing methods. Our approach will benefit applications, such as predicting patient outcomes or improving treatments for patients.
☆ LOOC: Localizing Organs using Occupancy Networks and Body Surface Depth Images
We introduce a novel method employing occupancy networks for the precise localization of 67 anatomical structures from single depth images captured from the exterior of the human body. This method considers the anatomical diversity across individuals. Our contributions include the application of occupancy networks for occluded structure localization, a robust method for estimating anatomical positions from depth images, and the creation of detailed, individualized 3D anatomical atlases. This approach promises improvements in medical imaging and automated diagnostic procedures by offering accurate, non-invasive localization of critical anatomical features.
☆ Scan-to-BIM for As-built Roads: Automatic Road Digital Twinning from Semantically Labeled Point Cloud Data
Creating geometric digital twins (gDT) for as-built roads still faces many challenges, such as low automation level and accuracy, limited asset types and shapes, and reliance on engineering experience. A novel scan-to-building information modeling (scan-to-BIM) framework is proposed for automatic road gDT creation based on semantically labeled point cloud data (PCD), which considers six asset types: Road Surface, Road Side (Slope), Road Lane (Marking), Road Sign, Road Light, and Guardrail. The framework first segments the semantic PCD into spatially independent instances or parts, then extracts the sectional polygon contours as their representative geometric information, stored in JavaScript Object Notation (JSON) files using a new data structure. Primitive gDTs are finally created from JSON files using corresponding conversion algorithms. The proposed method achieves an average distance error of 1.46 centimeters and a processing speed of 6.29 meters per second on six real-world road segments with a total length of 1,200 meters.
☆ SDNIA-YOLO: A Robust Object Detection Model for Extreme Weather Conditions
Though current object detection models based on deep learning have achieved excellent results on many conventional benchmark datasets, their performance will dramatically decline on real-world images taken under extreme conditions. Existing methods either used image augmentation based on traditional image processing algorithms or applied customized and scene-limited image adaptation technologies for robust modeling. This study thus proposes a stylization data-driven neural-image-adaptive YOLO (SDNIA-YOLO), which improves the model's robustness by enhancing image quality adaptively and learning valuable information related to extreme weather conditions from images synthesized by neural style transfer (NST). Experiments show that the developed SDNIA-YOLOv3 achieves significant mAP@.5 improvements of at least 15% on the real-world foggy (RTTS) and lowlight (ExDark) test sets compared with the baseline model. Besides, the experiments also highlight the outstanding potential of stylization data in simulating extreme weather conditions. The developed SDNIA-YOLO remains excellent characteristics of the native YOLO to a great extent, such as end-to-end one-stage, data-driven, and fast.
☆ VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
We introduce a new benchmark designed to advance the development of general-purpose, large-scale vision-language models for remote sensing images. Although several vision-language datasets in remote sensing have been proposed to pursue this goal, existing datasets are typically tailored to single tasks, lack detailed object information, or suffer from inadequate quality control. Exploring these improvement opportunities, we present a Versatile vision-language Benchmark for Remote Sensing image understanding, termed VRSBench. This benchmark comprises 29,614 images, with 29,614 human-verified detailed captions, 52,472 object references, and 123,221 question-answer pairs. It facilitates the training and evaluation of vision-language models across a broad spectrum of remote sensing image understanding tasks. We further evaluated state-of-the-art models on this benchmark for three vision-language tasks: image captioning, visual grounding, and visual question answering. Our work aims to significantly contribute to the development of advanced vision-language models in the field of remote sensing. The data and code can be accessed at https://github.com/lx709/VRSBench.
comment: Submitted for consideration at a conference
☆ UAV-based Intelligent Information Systems on Winter Road Safety for Autonomous Vehicles
As autonomous vehicles continue to revolutionize transportation, addressing challenges posed by adverse weather conditions, particularly during winter, becomes paramount for ensuring safe and efficient operations. One of the most important aspects of a road safety inspection during adverse weather is when a limited lane width can reduce the capacity of the road and raise the risk of serious accidents involving autonomous vehicles. In this research, a method for improving driving challenges on roads in winter conditions, with a model that segments and estimates the width of the road from the perspectives of Uncrewed aerial vehicles and autonomous vehicles. The proposed approach in this article is needed to empower self-driving cars with up-to-date and accurate insights, enhancing their adaptability and decision-making capabilities in winter landscapes.
☆ A Comparative Study of Continuous Sign Language Recognition Techniques
Continuous Sign Language Recognition (CSLR) focuses on the interpretation of a sequence of sign language gestures performed continually without pauses. In this study, we conduct an empirical evaluation of recent deep learning CSLR techniques and assess their performance across various datasets and sign languages. The models selected for analysis implement a range of approaches for extracting meaningful features and employ distinct training strategies. To determine their efficacy in modeling different sign languages, these models were evaluated using multiple datasets, specifically RWTH-PHOENIX-Weather-2014, ArabSign, and GrSL, each representing a unique sign language. The performance of the models was further tested with unseen signers and sentences. The conducted experiments establish new benchmarks on the selected datasets and provide valuable insights into the robustness and generalization of the evaluated techniques under challenging scenarios.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Mixing Natural and Synthetic Images for Robust Self-Supervised Representations
This paper introduces DiffMix, a new self-supervised learning (SSL) pre-training framework that combines real and synthetic images. Unlike traditional SSL methods that predominantly use real images, DiffMix uses a variant of Stable Diffusion to replace an augmented instance of a real image, facilitating the learning of cross real-synthetic image representations. The key insight is that while SSL methods trained solely on synthetic images underperform compared to those trained on real images, a blended training approach using both real and synthetic images leads to more robust and adaptable representations. Experiments demonstrate that DiffMix enhances the SSL methods SimCLR, BarlowTwins, and DINO, across various robustness datasets and domain transfer tasks. DiffMix boosts SimCLR's accuracy on ImageNet-1K by 4.56\%. These results challenge the notion that high-quality real images are crucial for SSL pre-training by showing that lower quality synthetic images can also produce strong representations. DiffMix also reduces the need for image augmentations in SSL, offering new optimization strategies.
☆ Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines ICIP 2024
This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filters dynamically, in a fuzzy manner, which further optimizes the winning filter on the given sample. Inspired by simulated annealing optimization techniques, we employ a softmax function with a temperature variable as the weight allocation function to mitigate the effects of random initialization. Our evaluation, conducted on a system utilizing multiple post-processing filters within a Versatile Video Coding (VVC) codec framework, demonstrates the superiority of content-specific filters trained with our proposed strategies, specifically, when images are processed in blocks. Using VVC reference software VTM 12.0 as the anchor, experiments on the OpenImages dataset show an improvement in the BD-rate reduction from -41.3% and -44.6% to -42.3% and -44.7% for object detection and instance segmentation tasks, respectively, compared to independently trained filters. The statistics of the filter usage align with our hypothesis and underscore the importance of jointly optimizing filters for both content and reconstruction quality. Our findings pave the way for further improving the performance of video/image codecs.
comment: Accepted to be preseneted in ICIP 2024
☆ LiCAF: LiDAR-Camera Asymmetric Fusion for Gait Recognition ICIP2024
Gait recognition is a biometric technology that identifies individuals by using walking patterns. Due to the significant achievements of multimodal fusion in gait recognition, we consider employing LiDAR-camera fusion to obtain robust gait representations. However, existing methods often overlook intrinsic characteristics of modalities, and lack fine-grained fusion and temporal modeling. In this paper, we introduce a novel modality-sensitive network LiCAF for LiDAR-camera fusion, which employs an asymmetric modeling strategy. Specifically, we propose Asymmetric Cross-modal Channel Attention (ACCA) and Interlaced Cross-modal Temporal Modeling (ICTM) for cross-modal valuable channel information selection and powerful temporal modeling. Our method achieves state-of-the-art performance (93.9% in Rank-1 and 98.8% in Rank-5) on the SUSTech1K dataset, demonstrating its effectiveness.
comment: Accepted by ICIP2024
☆ Encoding Matching Criteria for Cross-domain Deformable Image Registration
Most existing deep learning-based registration methods are trained on single-type images to address same-domain tasks.However, cross-domain deformable registration remains challenging.We argue that the tailor-made matching criteria in traditional registration methods is one of the main reason they are applicable in different domains.Motivated by this, we devise a registration-oriented encoder to model the matching criteria of image features and structural features, which is beneficial to boost registration accuracy and adaptability.Specifically, a general feature encoder (Encoder-G) is proposed to capture comprehensive medical image features, while a structural feature encoder (Encoder-S) is designed to encode the structural self-similarity into the global representation.Extensive experiments on images from three different domains prove the efficacy of the proposed method. Moreover, by updating Encoder-S using one-shot learning, our method can effectively adapt to different domains.The code is publicly available at https://github.com/JuliusWang-7/EncoderReg.
comment: 10 pages, 6 figures
☆ Automatic benchmarking of large multimodal models via iterative experiment programming
Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
comment: 31 pages, 6 figures, code is available at https://github.com/altndrr/apex
☆ Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning ICMR'24
The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from both modalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.
comment: Accepyed by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ Immiscible Diffusion: Accelerating Diffusion Training with Noise Assignment
In this paper, we point out suboptimal noise-data mapping leads to slow training of diffusion models. During diffusion training, current methods diffuse each image across the entire noise space, resulting in a mixture of all images at every point in the noise layer. We emphasize that this random mixture of noise-data mapping complicates the optimization of the denoising function in diffusion models. Drawing inspiration from the immiscible phenomenon in physics, we propose Immiscible Diffusion, a simple and effective method to improve the random mixture of noise-data mapping. In physics, miscibility can vary according to various intermolecular forces. Thus, immiscibility means that the mixing of the molecular sources is distinguishable. Inspired by this, we propose an assignment-then-diffusion training strategy. Specifically, prior to diffusing the image data into noise, we assign diffusion target noise for the image data by minimizing the total image-noise pair distance in a mini-batch. The assignment functions analogously to external forces to separate the diffuse-able areas of images, thus mitigating the inherent difficulties in diffusion training. Our approach is remarkably simple, requiring only one line of code to restrict the diffuse-able area for each image while preserving the Gaussian distribution of noise. This ensures that each image is projected only to nearby noise. To address the high complexity of the assignment algorithm, we employ a quantized-assignment method to reduce the computational overhead to a negligible level. Experiments demonstrate that our method achieve up to 3x faster training for consistency models and DDIM on the CIFAR dataset, and up to 1.3x faster on CelebA datasets for consistency models. Besides, we conduct thorough analysis about the Immiscible Diffusion, which sheds lights on how it improves diffusion training speed while improving the fidelity.
☆ IR2QSM: Quantitative Susceptibility Mapping via Deep Neural Networks with Iterative Reverse Concatenations and Recurrent Modules
Quantitative susceptibility mapping (QSM) is an MRI phase-based post-processing technique to extract the distribution of tissue susceptibilities, demonstrating significant potential in studying neurological diseases. However, the ill-conditioned nature of dipole inversion makes QSM reconstruction from the tissue field prone to noise and artifacts. In this work, we propose a novel deep learning-based IR2QSM method for QSM reconstruction. It is designed by iterating four times of a reverse concatenations and middle recurrent modules enhanced U-net, which could dramatically improve the efficiency of latent feature utilization. Simulated and in vivo experiments were conducted to compare IR2QSM with several traditional algorithms (MEDI and iLSQR) and state-of-the-art deep learning methods (U-net, xQSM, and LPCNN). The results indicated that IR2QSM was able to obtain QSM images with significantly increased accuracy and mitigated artifacts over other methods. Particularly, IR2QSM demonstrated on average the best NRMSE (27.59%) in simulated experiments, which is 15.48%, 7.86%, 17.24%, 9.26%, and 29.13% lower than iLSQR, MEDI, U-net, xQSM, LPCNN, respectively, and led to improved QSM results with fewer artifacts for the in vivo data.
comment: 10 pages, 9 figures
☆ Unleashing the Potential of Open-set Noisy Samples Against Label Noise for Medical Image Classification
The challenge of addressing mixed closed-set and open-set label noise in medical image classification remains largely unexplored. Unlike natural image classification where there is a common practice of segregation and separate processing of closed-set and open-set noisy samples from clean ones, medical image classification faces difficulties due to high inter-class similarity which complicates the identification of open-set noisy samples. Moreover, prevailing methods do not leverage the full potential of open-set noisy samples for label noise mitigation, often leading to their exclusion or application of uniform soft labels. To address these issues, we propose an Extended Noise-robust Contrastive and Open-set Feature Augmentation (ENCOFA) framework. ENCOFA includes the Extended Noise-robust Supervised Contrastive (ENSC) Loss, which aids in distinguishing features across classes. The ENSC loss regards open-set noisy samples as an extended class and mitigates label noise by weighting contrastive pairs with label reliability. Furthermore, we develop an Open-set Feature Augmentation (OSFeatAug) module that enriches the features of open-set samples, utilizing the model's extra capacity to prevent overfitting to noisy data. We conducted experiments on a synthetic noisy dataset and a real-world noisy dataset. Our results indicate the superiority of ENCOFA and the effectiveness of leveraging the open-set noisy samples to combat label noise.
comment: 10 pages, 1 figure
☆ VIRL: Volume-Informed Representation Learning towards Few-shot Manufacturability Estimation
Designing for manufacturing poses significant challenges in part due to the computation bottleneck of Computer-Aided Manufacturing (CAM) simulations. Although deep learning as an alternative offers fast inference, its performance is dependently bounded by the need for abundant training data. Representation learning, particularly through pre-training, offers promise for few-shot learning, aiding in manufacturability tasks where data can be limited. This work introduces VIRL, a Volume-Informed Representation Learning approach to pre-train a 3D geometric encoder. The pretrained model is evaluated across four manufacturability indicators obtained from CAM simulations: subtractive machining (SM) time, additive manufacturing (AM) time, residual von Mises stress, and blade collisions during Laser Power Bed Fusion process. Across all case studies, the model pre-trained by VIRL shows substantial enhancements on demonstrating improved generalizability with limited data and superior performance with larger datasets. Regarding deployment strategy, case-specific phenomenon exists where finetuning VIRL-pretrained models adversely affects AM tasks with limited data but benefits SM time prediction. Moreover, the efficacy of Low-rank adaptation (LoRA), which balances between probing and finetuning, is explored. LoRA shows stable performance akin to probing with limited data, while achieving a higher upper bound than probing as data size increases, without the computational costs of finetuning. Furthermore, static normalization of manufacturing indicators consistently performs well across tasks, while dynamic normalization enhances performance when a reliable task dependent input is available.
☆ DASSF: Dynamic-Attention Scale-Sequence Fusion for Aerial Object Detection
The detection of small objects in aerial images is a fundamental task in the field of computer vision. Moving objects in aerial photography have problems such as different shapes and sizes, dense overlap, occlusion by the background, and object blur, however, the original YOLO algorithm has low overall detection accuracy due to its weak ability to perceive targets of different scales. In order to improve the detection accuracy of densely overlapping small targets and fuzzy targets, this paper proposes a dynamic-attention scale-sequence fusion algorithm (DASSF) for small target detection in aerial images. First, we propose a dynamic scale sequence feature fusion (DSSFF) module that improves the up-sampling mechanism and reduces computational load. Secondly, a x-small object detection head is specially added to enhance the detection capability of small targets. Finally, in order to improve the expressive ability of targets of different types and sizes, we use the dynamic head (DyHead). The model we proposed solves the problem of small target detection in aerial images and can be applied to multiple different versions of the YOLO algorithm, which is universal. Experimental results show that when the DASSF method is applied to YOLOv8, compared to YOLOv8n, on the VisDrone-2019 and DIOR datasets, the model shows an increase of 9.2% and 2.4% in the mean average precision (mAP), respectively, and outperforms the current mainstream methods.
☆ VoCo-LLaMA: Towards Vision Compression with Large Language Models
Vision-Language Models (VLMs) have achieved remarkable success in various multi-modal tasks, but they are often bottlenecked by the limited context window and high computational cost of processing high-resolution image inputs and videos. Vision compression can alleviate this problem by reducing the vision token count. Previous approaches compress vision tokens with external modules and force LLMs to understand the compressed ones, leading to visual information loss. However, the LLMs' understanding paradigm of vision tokens is not fully utilised in the compression learning process. We propose VoCo-LLaMA, the first approach to compress vision tokens using LLMs. By introducing Vision Compression tokens during the vision instruction tuning phase and leveraging attention distillation, our method distill how LLMs comprehend vision tokens into their processing of VoCo tokens. VoCo-LLaMA facilitates effective vision compression and improves the computational efficiency during the inference stage. Specifically, our method achieves minimal performance loss with a compression ratio of 576$\times$, resulting in up to 94.8$\%$ fewer FLOPs and 69.6$\%$ acceleration in inference time. Furthermore, through continuous training using time-series compressed token sequences of video frames, VoCo-LLaMA demonstrates the ability to understand temporal correlations, outperforming previous methods on popular video question-answering benchmarks. Our approach presents a promising way to unlock the full potential of VLMs' contextual window, enabling more scalable multi-modal applications. The project page, along with the associated code, can be accessed via $\href{https://yxxxb.github.io/VoCo-LLaMA-page/}{\text{this https URL}}$.
comment: 18 pages, 5 figures
☆ Agriculture-Vision Challenge 2024 -- The Runner-Up Solution for Agricultural Pattern Recognition via Class Balancing and Model Ensemble
The Agriculture-Vision Challenge at CVPR 2024 aims at leveraging semantic segmentation models to produce pixel level semantic segmentation labels within regions of interest for multi-modality satellite images. It is one of the most famous and competitive challenges for global researchers to break the boundary between computer vision and agriculture sectors. However, there is a serious class imbalance problem in the agriculture-vision dataset, which hinders the semantic segmentation performance. To solve this problem, firstly, we propose a mosaic data augmentation with a rare class sampling strategy to enrich long-tail class samples. Secondly, we employ an adaptive class weight scheme to suppress the contribution of the common classes while increasing the ones of rare classes. Thirdly, we propose a probability post-process to increase the predicted value of the rare classes. Our methodology achieved a mean Intersection over Union (mIoU) score of 0.547 on the test set, securing second place in this challenge.
☆ Advancing Cross-Domain Generalizability in Face Anti-Spoofing: Insights, Design, and Metrics CVPR
This paper presents a novel perspective for enhancing anti-spoofing performance in zero-shot data domain generalization. Unlike traditional image classification tasks, face anti-spoofing datasets display unique generalization characteristics, necessitating novel zero-shot data domain generalization. One step forward to the previous frame-wise spoofing prediction, we introduce a nuanced metric calculation that aggregates frame-level probabilities for a video-wise prediction, to tackle the gap between the reported frame-wise accuracy and instability in real-world use-case. This approach enables the quantification of bias and variance in model predictions, offering a more refined analysis of model generalization. Our investigation reveals that simply scaling up the backbone of models does not inherently improve the mentioned instability, leading us to propose an ensembled backbone method from a Bayesian perspective. The probabilistically ensembled backbone both improves model robustness measured from the proposed metric and spoofing accuracy, and also leverages the advantages of measuring uncertainty, allowing for enhanced sampling during training that contributes to model generalization across new datasets. We evaluate the proposed method from the benchmark OMIC dataset and also the public CelebA-Spoof and SiW-Mv2. Our final model outperforms existing state-of-the-art methods across the datasets, showcasing advancements in Bias, Variance, HTER, and AUC metrics.
comment: 10 pages with 4 figures, Accepted by CVPRW 2024
☆ Symmetric Multi-Similarity Loss for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge 2024
In this report, we present our champion solution for EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge in CVPR 2024. Essentially, this challenge differs from traditional visual-text retrieval tasks by providing a correlation matrix that acts as a set of soft labels for video-text clip combinations. However, existing loss functions have not fully exploited this information. Motivated by this, we propose a novel loss function, Symmetric Multi-Similarity Loss, which offers a more precise learning objective. Together with tricks and ensemble learning, the model achieves 63.76% average mAP and 74.25% average nDCG on the public leaderboard, demonstrating the effectiveness of our approach. Our code will be released at: https://github.com/xqwang14/SMS-Loss/tree/main
comment: The champion solution for Epic-Kitchen-100 Multi-Instance Retrieval Challenge
☆ Enhancing Single-Slice Segmentation with 3D-to-2D Unpaired Scan Distillation
2D single-slice abdominal computed tomography (CT) enables the assessment of body habitus and organ health with low radiation exposure. However, single-slice data necessitates the use of 2D networks for segmentation, but these networks often struggle to capture contextual information effectively. Consequently, even when trained on identical datasets, 3D networks typically achieve superior segmentation results. In this work, we propose a novel 3D-to-2D distillation framework, leveraging pre-trained 3D models to enhance 2D single-slice segmentation. Specifically, we extract the prediction distribution centroid from the 3D representations, to guide the 2D student by learning intra- and inter-class correlation. Unlike traditional knowledge distillation methods that require the same data input, our approach employs unpaired 3D CT scans with any contrast to guide the 2D student model. Experiments conducted on 707 subjects from the single-slice Baltimore Longitudinal Study of Aging (BLSA) dataset demonstrate that state-of-the-art 2D multi-organ segmentation methods can benefit from the 3D teacher model, achieving enhanced performance in single-slice multi-organ segmentation. Notably, our approach demonstrates considerable efficacy in low-data regimes, outperforming the model trained with all available training subjects even when utilizing only 200 training subjects. Thus, this work underscores the potential to alleviate manual annotation burdens.
☆ TroL: Traversal of Layers for Large Language and Vision Models
Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.
comment: Code is available in https://github.com/ByungKwanLee/TroL
☆ Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM
Towards open-ended Video Anomaly Detection (VAD), existing methods often exhibit biased detection when faced with challenging or unseen events and lack interpretability. To address these drawbacks, we propose Holmes-VAD, a novel framework that leverages precise temporal supervision and rich multimodal instructions to enable accurate anomaly localization and comprehensive explanations. Firstly, towards unbiased and explainable VAD system, we construct the first large-scale multimodal VAD instruction-tuning benchmark, i.e., VAD-Instruct50k. This dataset is created using a carefully designed semi-automatic labeling paradigm. Efficient single-frame annotations are applied to the collected untrimmed videos, which are then synthesized into high-quality analyses of both abnormal and normal video clips using a robust off-the-shelf video captioner and a large language model (LLM). Building upon the VAD-Instruct50k dataset, we develop a customized solution for interpretable video anomaly detection. We train a lightweight temporal sampler to select frames with high anomaly response and fine-tune a multimodal large language model (LLM) to generate explanatory content. Extensive experimental results validate the generality and interpretability of the proposed Holmes-VAD, establishing it as a novel interpretable technique for real-world video anomaly analysis. To support the community, our benchmark and model will be publicly available at https://github.com/pipixin321/HolmesVAD.
comment: 19 pages, 9 figures
☆ SyncVSR: Data-Efficient Visual Speech Recognition with End-to-End Crossmodal Audio Token Synchronization
Visual Speech Recognition (VSR) stands at the intersection of computer vision and speech recognition, aiming to interpret spoken content from visual cues. A prominent challenge in VSR is the presence of homophenes-visually similar lip gestures that represent different phonemes. Prior approaches have sought to distinguish fine-grained visemes by aligning visual and auditory semantics, but often fell short of full synchronization. To address this, we present SyncVSR, an end-to-end learning framework that leverages quantized audio for frame-level crossmodal supervision. By integrating a projection layer that synchronizes visual representation with acoustic data, our encoder learns to generate discrete audio tokens from a video sequence in a non-autoregressive manner. SyncVSR shows versatility across tasks, languages, and modalities at the cost of a forward pass. Our empirical evaluations show that it not only achieves state-of-the-art results but also reduces data usage by up to ninefold.
☆ The Solution for CVPR2024 Foundational Few-Shot Object Detection Challenge CVPR2024
This report introduces an enhanced method for the Foundational Few-Shot Object Detection (FSOD) task, leveraging the vision-language model (VLM) for object detection. However, on specific datasets, VLM may encounter the problem where the detected targets are misaligned with the target concepts of interest. This misalignment hinders the zero-shot performance of VLM and the application of fine-tuning methods based on pseudo-labels. To address this issue, we propose the VLM+ framework, which integrates the multimodal large language model (MM-LLM). Specifically, we use MM-LLM to generate a series of referential expressions for each category. Based on the VLM predictions and the given annotations, we select the best referential expression for each category by matching the maximum IoU. Subsequently, we use these referential expressions to generate pseudo-labels for all images in the training set and then combine them with the original labeled data to fine-tune the VLM. Additionally, we employ iterative pseudo-label generation and optimization to further enhance the performance of the VLM. Our approach achieve 32.56 mAP in the final test.
comment: CVPR2024 Foundational Few-Shot Object Detection Challenge
☆ Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking
Transformers have established themselves as the leading neural network model in natural language processing and are increasingly foundational in various domains. In vision, the MLP-Mixer model has demonstrated competitive performance, suggesting that attention mechanisms might not be indispensable. Inspired by this, recent research has explored replacing attention modules with other mechanisms, including those described by MetaFormers. However, the theoretical framework for these models remains underdeveloped. This paper proposes a novel perspective by integrating Krotov's hierarchical associative memory with MetaFormers, enabling a comprehensive representation of the entire Transformer block, encompassing token-/channel-mixing modules, layer normalization, and skip connections, as a single Hopfield network. This approach yields a parallelized MLP-Mixer derived from a three-layer Hopfield network, which naturally incorporates symmetric token-/channel-mixing modules and layer normalization. Empirical studies reveal that symmetric interaction matrices in the model hinder performance in image recognition tasks. Introducing symmetry-breaking effects transitions the performance of the symmetric parallelized MLP-Mixer to that of the vanilla MLP-Mixer. This indicates that during standard training, weight matrices of the vanilla MLP-Mixer spontaneously acquire a symmetry-breaking configuration, enhancing their effectiveness. These findings offer insights into the intrinsic properties of Transformers and MLP-Mixers and their theoretical underpinnings, providing a robust framework for future model design and optimization.
comment: 18 pages
☆ PCIE_EgoHandPose Solution for EgoExo4D Hand Pose Challenge
This report presents our team's 'PCIE_EgoHandPose' solution for the EgoExo4D Hand Pose Challenge at CVPR2024. The main goal of the challenge is to accurately estimate hand poses, which involve 21 3D joints, using an RGB egocentric video image provided for the task. This task is particularly challenging due to the subtle movements and occlusions. To handle the complexity of the task, we propose the Hand Pose Vision Transformer (HP-ViT). The HP-ViT comprises a ViT backbone and transformer head to estimate joint positions in 3D, utilizing MPJPE and RLE loss function. Our approach achieved the 1st position in the Hand Pose challenge with 25.51 MPJPE and 8.49 PA-MPJPE. Code is available at https://github.com/KanokphanL/PCIE_EgoHandPose
☆ Is Your HD Map Constructor Reliable under Sensor Corruptions?
Driving systems often rely on high-definition (HD) maps for precise environmental information, which is crucial for planning and navigation. While current HD map constructors perform well under ideal conditions, their resilience to real-world challenges, \eg, adverse weather and sensor failures, is not well understood, raising safety concerns. This work introduces MapBench, the first comprehensive benchmark designed to evaluate the robustness of HD map construction methods against various sensor corruptions. Our benchmark encompasses a total of 29 types of corruptions that occur from cameras and LiDAR sensors. Extensive evaluations across 31 HD map constructors reveal significant performance degradation of existing methods under adverse weather conditions and sensor failures, underscoring critical safety concerns. We identify effective strategies for enhancing robustness, including innovative approaches that leverage multi-modal fusion, advanced data augmentation, and architectural techniques. These insights provide a pathway for developing more reliable HD map construction methods, which are essential for the advancement of autonomous driving technology. The benchmark toolkit and affiliated code and model checkpoints have been made publicly accessible.
comment: project url: https://mapbench.github.io/
☆ PCIE_LAM Solution for Ego4D Looking At Me Challenge
This report presents our team's 'PCIE_LAM' solution for the Ego4D Looking At Me Challenge at CVPR2024. The main goal of the challenge is to accurately determine if a person in the scene is looking at the camera wearer, based on a video where the faces of social partners have been localized. Our proposed solution, InternLSTM, consists of an InternVL image encoder and a Bi-LSTM network. The InternVL extracts spatial features, while the Bi-LSTM extracts temporal features. However, this task is highly challenging due to the distance between the person in the scene and the camera movement, which results in significant blurring in the face image. To address the complexity of the task, we implemented a Gaze Smoothing filter to eliminate noise or spikes from the output. Our approach achieved the 1st position in the looking at me challenge with 0.81 mAP and 0.93 accuracy rate. Code is available at https://github.com/KanokphanL/Ego4D_LAM_InternLSTM
☆ Knowledge Fusion By Evolving Weights of Language Models ACL2024
Fine-tuning pre-trained language models, particularly large language models, demands extensive computing resources and can result in varying performance outcomes across different domains and datasets. This paper examines the approach of integrating multiple models from diverse training scenarios into a unified model. This unified model excels across various data domains and exhibits the ability to generalize well on out-of-domain data. We propose a knowledge fusion method named Evolver, inspired by evolutionary algorithms, which does not need further training or additional training data. Specifically, our method involves aggregating the weights of different language models into a population and subsequently generating offspring models through mutation and crossover operations. These offspring models are then evaluated against their parents, allowing for the preservation of those models that show enhanced performance on development datasets. Importantly, our model evolving strategy can be seamlessly integrated with existing model merging frameworks, offering a versatile tool for model enhancement. Experimental results on mainstream language models (i.e., encoder-only, decoder-only, encoder-decoder) reveal that Evolver outperforms previous state-of-the-art models by large margins. The code is publicly available at {https://github.com/duguodong7/model-evolution}.
comment: Accepted by ACL2024 Findings
☆ Unlocking the Potential of Early Epochs: Uncertainty-aware CT Metal Artifact Reduction
In computed tomography (CT), the presence of metallic implants in patients often leads to disruptive artifacts in the reconstructed images, hindering accurate diagnosis. Recently, a large amount of supervised deep learning-based approaches have been proposed for metal artifact reduction (MAR). However, these methods neglect the influence of initial training weights. In this paper, we have discovered that the uncertainty image computed from the restoration result of initial training weights can effectively highlight high-frequency regions, including metal artifacts. This observation can be leveraged to assist the MAR network in removing metal artifacts. Therefore, we propose an uncertainty constraint (UC) loss that utilizes the uncertainty image as an adaptive weight to guide the MAR network to focus on the metal artifact region, leading to improved restoration. The proposed UC loss is designed to be a plug-and-play method, compatible with any MAR framework, and easily adoptable. To validate the effectiveness of the UC loss, we conduct extensive experiments on the public available Deeplesion and CLINIC-metal dataset. Experimental results demonstrate that the UC loss further optimizes the network training process and significantly improves the removal of metal artifacts.
☆ The Wisdom of a Crowd of Brains: A Universal Brain Encoder
Image-to-fMRI encoding is important for both neuroscience research and practical applications. However, such "Brain-Encoders" have been typically trained per-subject and per fMRI-dataset, thus restricted to very limited training data. In this paper we propose a Universal Brain-Encoder, which can be trained jointly on data from many different subjects/datasets/machines. What makes this possible is our new voxel-centric Encoder architecture, which learns a unique "voxel-embedding" per brain-voxel. Our Encoder trains to predict the response of each brain-voxel on every image, by directly computing the cross-attention between the brain-voxel embedding and multi-level deep image features. This voxel-centric architecture allows the functional role of each brain-voxel to naturally emerge from the voxel-image cross-attention. We show the power of this approach to (i) combine data from multiple different subjects (a "Crowd of Brains") to improve each individual brain-encoding, (ii) quick & effective Transfer-Learning across subjects, datasets, and machines (e.g., 3-Tesla, 7-Tesla), with few training examples, and (iii) use the learned voxel-embeddings as a powerful tool to explore brain functionality (e.g., what is encoded where in the brain).
☆ FCA-RAC: First Cycle Annotated Repetitive Action Counting
Repetitive action counting quantifies the frequency of specific actions performed by individuals. However, existing action-counting datasets have limited action diversity, potentially hampering model performance on unseen actions. To address this issue, we propose a framework called First Cycle Annotated Repetitive Action Counting (FCA-RAC). This framework contains 4 parts: 1) a labeling technique that annotates each training video with the start and end of the first action cycle, along with the total action count. This technique enables the model to capture the correlation between the initial action cycle and subsequent actions; 2) an adaptive sampling strategy that maximizes action information retention by adjusting to the speed of the first annotated action cycle in videos; 3) a Multi-Temporal Granularity Convolution (MTGC) module, that leverages the muli-scale first action as a kernel to convolve across the entire video. This enables the model to capture action variations at different time scales within the video; 4) a strategy called Training Knowledge Augmentation (TKA) that exploits the annotated first action cycle information from the entire dataset. This allows the network to harness shared characteristics across actions effectively, thereby enhancing model performance and generalizability to unseen actions. Experimental results demonstrate that our approach achieves superior outcomes on RepCount-A and related datasets, highlighting the efficacy of our framework in improving model performance on seen and unseen actions. Our paper makes significant contributions to the field of action counting by addressing the limitations of existing datasets and proposing novel techniques for improving model generalizability.
☆ Location-based Radiology Report-Guided Semi-supervised Learning for Prostate Cancer Detection
Prostate cancer is one of the most prevalent malignancies in the world. While deep learning has potential to further improve computer-aided prostate cancer detection on MRI, its efficacy hinges on the exhaustive curation of manually annotated images. We propose a novel methodology of semisupervised learning (SSL) guided by automatically extracted clinical information, specifically the lesion locations in radiology reports, allowing for use of unannotated images to reduce the annotation burden. By leveraging lesion locations, we refined pseudo labels, which were then used to train our location-based SSL model. We show that our SSL method can improve prostate lesion detection by utilizing unannotated images, with more substantial impacts being observed when larger proportions of unannotated images are used.
comment: 4 page paper accepted to IEEE International Symposium on Biomedical Imaging (ISBI 2024)
☆ MiSuRe is all you need to explain your image segmentation
The last decade of computer vision has been dominated by Deep Learning architectures, thanks to their unparalleled success. Their performance, however, often comes at the cost of explainability owing to their highly non-linear nature. Consequently, a parallel field of eXplainable Artificial Intelligence (XAI) has developed with the aim of generating insights regarding the decision making process of deep learning models. An important problem in XAI is that of the generation of saliency maps. These are regions in an input image which contributed most towards the model's final decision. Most work in this regard, however, has been focused on image classification, and image segmentation - despite being a ubiquitous task - has not received the same attention. In the present work, we propose MiSuRe (Minimally Sufficient Region) as an algorithm to generate saliency maps for image segmentation. The goal of the saliency maps generated by MiSuRe is to get rid of irrelevant regions, and only highlight those regions in the input image which are crucial to the image segmentation decision. We perform our analysis on 3 datasets: Triangle (artificially constructed), COCO-2017 (natural images), and the Synapse multi-organ (medical images). Additionally, we identify a potential usecase of these post-hoc saliency maps in order to perform post-hoc reliability of the segmentation model.
♻ ☆ Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection
Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency.
comment: 10 pages, 4 figures
♻ ☆ Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate NAS from the expensive training process. The key idea behind zero-shot NAS approaches is to design proxies that can predict the accuracy of some given networks without training the network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical understanding of deep learning and have shown great potential on several datasets and NAS benchmarks. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies. Our source code and the list of related papers are available on https://github.com/SLDGroup/survey-zero-shot-nas.
comment: IEEE T-PAMI
♻ ☆ MeshXL: Neural Coordinate Field for Generative 3D Foundation Models
The polygon mesh representation of 3D data exhibits great flexibility, fast rendering speed, and storage efficiency, which is widely preferred in various applications. However, given its unstructured graph representation, the direct generation of high-fidelity 3D meshes is challenging. Fortunately, with a pre-defined ordering strategy, 3D meshes can be represented as sequences, and the generation process can be seamlessly treated as an auto-regressive problem. In this paper, we validate the Neural Coordinate Field (NeurCF), an explicit coordinate representation with implicit neural embeddings, is a simple-yet-effective representation for large-scale sequential mesh modeling. After that, we present MeshXL, a family of generative pre-trained auto-regressive models, which addresses the process of 3D mesh generation with modern large language model approaches. Extensive experiments show that MeshXL is able to generate high-quality 3D meshes, and can also serve as foundation models for various down-stream applications.
♻ ☆ The Lie Derivative for Measuring Learned Equivariance ICLR 2023
Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.
comment: ICLR 2023. Code available at: https://github.com/ngruver/lie-deriv
♻ ☆ MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More
When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.
♻ ☆ MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models
The security concerns surrounding Large Language Models (LLMs) have been extensively explored, yet the safety of Multimodal Large Language Models (MLLMs) remains understudied. In this paper, we observe that Multimodal Large Language Models (MLLMs) can be easily compromised by query-relevant images, as if the text query itself were malicious. To address this, we introduce MM-SafetyBench, a comprehensive framework designed for conducting safety-critical evaluations of MLLMs against such image-based manipulations. We have compiled a dataset comprising 13 scenarios, resulting in a total of 5,040 text-image pairs. Our analysis across 12 state-of-the-art models reveals that MLLMs are susceptible to breaches instigated by our approach, even when the equipped LLMs have been safety-aligned. In response, we propose a straightforward yet effective prompting strategy to enhance the resilience of MLLMs against these types of attacks. Our work underscores the need for a concerted effort to strengthen and enhance the safety measures of open-source MLLMs against potential malicious exploits. The resource is available at https://github.com/isXinLiu/MM-SafetyBench
comment: The datasets were incomplete as they did not include all the necessary copyrights
♻ ☆ A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification ICIP 2024
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which are inherently more heavyweight than multilayer perceptrons (MLPs), which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale (single channel) image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as a vector and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
comment: Accepted to IEEE ICIP 2024
♻ ☆ Erase to Enhance: Data-Efficient Machine Unlearning in MRI Reconstruction
Machine unlearning is a promising paradigm for removing unwanted data samples from a trained model, towards ensuring compliance with privacy regulations and limiting harmful biases. Although unlearning has been shown in, e.g., classification and recommendation systems, its potential in medical image-to-image translation, specifically in image recon-struction, has not been thoroughly investigated. This paper shows that machine unlearning is possible in MRI tasks and has the potential to benefit for bias removal. We set up a protocol to study how much shared knowledge exists between datasets of different organs, allowing us to effectively quantify the effect of unlearning. Our study reveals that combining training data can lead to hallucinations and reduced image quality in the reconstructed data. We use unlearning to remove hallucinations as a proxy exemplar of undesired data removal. Indeed, we show that machine unlearning is possible without full retraining. Furthermore, our observations indicate that maintaining high performance is feasible even when using only a subset of retain data. We have made our code publicly accessible.
comment: The paper is accpeted by MIDL 2024
♻ ☆ Retrieval-Enhanced Visual Prompt Learning for Few-shot Classification
Prompt learning has become a popular approach for adapting large vision-language models, such as CLIP, to downstream tasks. Typically, prompt learning relies on a fixed prompt token or an input-conditional token to fit a small amount of data under full supervision. While this paradigm can generalize to a certain range of unseen classes, it may struggle when domain gap increases, such as in fine-grained classification and satellite image segmentation. To address this limitation, we propose Retrieval-enhanced Prompt learning (RePrompt), which introduces retrieval mechanisms to cache the knowledge representations from downstream tasks. we first construct a retrieval database from training examples, or from external examples when available. We then integrate this retrieval-enhanced mechanism into various stages of a simple prompt learning baseline. By referencing similar samples in the training set, the enhanced model is better able to adapt to new tasks with few samples. Our extensive experiments over 15 vision datasets, including 11 downstream tasks with few-shot setting and 4 domain generalization benchmarks, demonstrate that RePrompt achieves considerably improved performance. Our proposed approach provides a promising solution to the challenges faced by prompt learning when domain gap increases. The code and models will be available.
♻ ☆ The Effectiveness of a Simplified Model Structure for Crowd Counting
In the field of crowd counting research, many recent deep learning based methods have demonstrated robust capabilities for accurately estimating crowd sizes. However, the enhancement in their performance often arises from an increase in the complexity of the model structure. This paper discusses how to construct high-performance crowd counting models using only simple structures. We proposes the Fuss-Free Network (FFNet) that is characterized by its simple and efficieny structure, consisting of only a backbone network and a multi-scale feature fusion structure. The multi-scale feature fusion structure is a simple structure consisting of three branches, each only equipped with a focus transition module, and combines the features from these branches through the concatenation operation. Our proposed crowd counting model is trained and evaluated on four widely used public datasets, and it achieves accuracy that is comparable to that of existing complex models. Furthermore, we conduct a comprehensive evaluation by replacing the existing backbones of various models such as FFNet and CCTrans with different networks, including MobileNet-v3, ConvNeXt-Tiny, and Swin-Transformer-Small. The experimental results further indicate that excellent crowd counting performance can be achieved with the simplied structure proposed by us.
♻ ☆ Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters
Obtaining accurate labels for instance segmentation is particularly challenging due to the complex nature of the task. Each image necessitates multiple annotations, encompassing not only the object's class but also its precise spatial boundaries. These requirements elevate the likelihood of errors and inconsistencies in both manual and automated annotation processes. By simulating different noise conditions, we provide a realistic scenario for assessing the robustness and generalization capabilities of instance segmentation models in different segmentation tasks, introducing COCO-N and Cityscapes-N. We also propose a benchmark for weakly annotation noise, dubbed COCO-WAN, which utilizes foundation models and weak annotations to simulate semi-automated annotation tools and their noisy labels. This study sheds light on the quality of segmentation masks produced by various models and challenges the efficacy of popular methods designed to address learning with label noise.
♻ ☆ How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses
Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.
♻ ☆ A generalizable framework for low-rank tensor completion with numerical priors
Low-Rank Tensor Completion, a method which exploits the inherent structure of tensors, has been studied extensively as an effective approach to tensor completion. Whilst such methods attained great success, none have systematically considered exploiting the numerical priors of tensor elements. Ignoring numerical priors causes loss of important information regarding the data, and therefore prevents the algorithms from reaching optimal accuracy. Despite the existence of some individual works which consider ad hoc numerical priors for specific tasks, no generalizable frameworks for incorporating numerical priors have appeared. We present the Generalized CP Decomposition Tensor Completion (GCDTC) framework, the first generalizable framework for low-rank tensor completion that takes numerical priors of the data into account. We test GCDTC by further proposing the Smooth Poisson Tensor Completion (SPTC) algorithm, an instantiation of the GCDTC framework, whose performance exceeds current state-of-the-arts by considerable margins in the task of non-negative tensor completion, exemplifying GCDTC's effectiveness. Our code is open-source.
comment: Accepted to Pattern Recognition
♻ ☆ How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models
Forming and using symbol-like structured representations for reasoning has been considered essential for generalising over novel inputs. The primary tool that allows generalisation outside training data distribution is the ability to abstract away irrelevant information into a compact form relevant to the task. An extreme form of such abstract representations is symbols. Humans make use of symbols to bind information while abstracting away irrelevant parts to utilise the information consistently and meaningfully. This work estimates the state of such structured representations in vision encoders. Specifically, we evaluate image encoders in large vision-language pre-trained models to address the question of which desirable properties their representations lack by applying the criteria of symbolic structured reasoning described for LLMs to the image models. We test the representation space of image encoders like VIT, BLIP, CLIP, and FLAVA to characterise the distribution of the object representations in these models. In particular, we create decoding tasks using multi-object scenes from the COCO dataset, relating the token space to its input content for various objects in the scene. We use these tasks to characterise the network's token and layer-wise information modelling. Our analysis highlights that the CLS token, used for the downstream task, only focuses on a few objects necessary for the trained downstream task. Still, other individual objects are well-modelled separately by the tokens in the network originating from those objects. We further observed a widespread distribution of scene information. This demonstrates that information is far more entangled in tokens than optimal for representing objects similar to symbols. Given these symbolic properties, we show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
♻ ☆ ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
comment: Accepted for Scientific Data
♻ ☆ What If We Recaption Billions of Web Images with LLaMA-3?
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/
comment: First five authors contributed equally
♻ ☆ Advances in 3D Neural Stylization: A Survey
Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural networks. We establish a taxonomy for neural stylization, considering crucial design choices such as scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then presents in-depth discussions on recent neural stylization methods for 3D data, accompanied by a mini-benchmark evaluating selected neural field stylization methods. Based on the insights gained from the survey, we highlight the practical significance, open challenges, future research, and potential impacts of neural stylization, which facilitates researchers and practitioners to navigate the rapidly evolving landscape of 3D content creation using modern artificial intelligence.
♻ ☆ SRGS: Super-Resolution 3D Gaussian Splatting
Recently, 3D Gaussian Splatting (3DGS) has gained popularity as a novel explicit 3D representation. This approach relies on the representation power of Gaussian primitives to provide a high-quality rendering. However, primitives optimized at low resolution inevitably exhibit sparsity and texture deficiency, posing a challenge for achieving high-resolution novel view synthesis (HRNVS). To address this problem, we propose Super-Resolution 3D Gaussian Splatting (SRGS) to perform the optimization in a high-resolution (HR) space. The sub-pixel constraint is introduced for the increased viewpoints in HR space, exploiting the sub-pixel cross-view information of the multiple low-resolution (LR) views. The gradient accumulated from more viewpoints will facilitate the densification of primitives. Furthermore, a pre-trained 2D super-resolution model is integrated with the sub-pixel constraint, enabling these dense primitives to learn faithful texture features. In general, our method focuses on densification and texture learning to effectively enhance the representation ability of primitives. Experimentally, our method achieves high rendering quality on HRNVS only with LR inputs, outperforming state-of-the-art methods on challenging datasets such as Mip-NeRF 360 and Tanks & Temples. Related codes will be released upon acceptance.
comment: The first to focus on the HRNVS of 3DGS
♻ ☆ Segmentation and Characterization of Macerated Fibers and Vessels Using Deep Learning
Wood comprises different cell types, such as fibers, tracheids and vessels, defining its properties. Studying cells' shape, size, and arrangement in microscopy images is crucial for understanding wood characteristics. Typically, this involves macerating (soaking) samples in a solution to separate cells, then spreading them on slides for imaging with a microscope that covers a wide area, capturing thousands of cells. However, these cells often cluster and overlap in images, making the segmentation difficult and time-consuming using standard image-processing methods. In this work, we developed an automatic deep learning segmentation approach that utilizes the one-stage YOLOv8 model for fast and accurate segmentation and characterization of macerated fiber and vessel form aspen trees in microscopy images. The model can analyze 32,640 x 25,920 pixels images and demonstrate effective cell detection and segmentation, achieving a mAP_{0.5-0.95} of 78 %. To assess the model's robustness, we examined fibers from a genetically modified tree line known for longer fibers. The outcomes were comparable to previous manual measurements. Additionally, we created a user-friendly web application for image analysis and provided the code for use on Google Colab. By leveraging YOLOv8's advances, this work provides a deep learning solution to enable efficient quantification and analysis of wood cells suitable for practical applications.
comment: 7 figures
♻ ☆ Enhancing Zero-Shot Facial Expression Recognition by LLM Knowledge Transfer
Current facial expression recognition (FER) models are often designed in a supervised learning manner and thus are constrained by the lack of large-scale facial expression images with high-quality annotations. Consequently, these models often fail to generalize well, performing poorly on unseen images in inference. Vision-language-based zero-shot models demonstrate a promising potential for addressing such challenges. However, these models lack task-specific knowledge and therefore are not optimized for the nuances of recognizing facial expressions. To bridge this gap, this work proposes a novel method, Exp-CLIP, to enhance zero-shot FER by transferring the task knowledge from large language models (LLMs). Specifically, based on the pre-trained vision-language encoders, we incorporate a projection head designed to map the initial joint vision-language space into a space that captures representations of facial actions. To train this projection head for subsequent zero-shot predictions, we propose to align the projected visual representations with task-specific semantic meanings derived from the LLM encoder, and the text instruction-based strategy is employed to customize the LLM knowledge. Given unlabelled facial data and efficient training of the projection head, Exp-CLIP achieves superior zero-shot results to the CLIP models and several other large vision-language models (LVLMs) on seven in-the-wild FER datasets.
comment: The code and pre-trained models are available at https://github.com/zengqunzhao/Exp-CLIP
♻ ☆ Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and Beyond
Thin-plate spline (TPS) is a principal warp that allows for representing elastic, nonlinear transformation with control point motions. With the increase of control points, the warp becomes increasingly flexible but usually encounters a bottleneck caused by undesired issues, e.g., content distortion. In this paper, we explore generic applications of TPS in single-image-based warping tasks, such as rotation correction, rectangling, and portrait correction. To break this bottleneck, we propose the coupled thin-plate spline model (CoupledTPS), which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. Concretely, we first design an iterative search to predict new control points according to the current latent condition. Then, we present the warping flow as a bridge for the coupling of different TPS transformations, effectively eliminating interpolation errors caused by multiple warps. Besides, in light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. It is formulated through dual transformation between the searched control points of unlabeled data and its graphic augmentation, yielding an implicit correction consistency constraint. Finally, we collect massive unlabeled data to exhibit the benefit of our semi-supervised scheme in rotation correction. Extensive experiments demonstrate the superiority and universality of CoupledTPS over the existing state-of-the-art (SoTA) solutions for rotation correction and beyond. The code and data are available at https://github.com/nie-lang/CoupledTPS.
comment: Accepted to TPAMI2024
♻ ☆ Uncertainty-Aware AB3DMOT by Variational 3D Object Detection ICIP 2024
Autonomous driving needs to rely on high-quality 3D object detection to ensure safe navigation in the world. Uncertainty estimation is an effective tool to provide statistically accurate predictions, while the associated detection uncertainty can be used to implement a more safe navigation protocol or include the user in the loop. In this paper, we propose a Variational Neural Network-based TANet 3D object detector to generate 3D object detections with uncertainty and introduce these detections to an uncertainty-aware AB3DMOT tracker. This is done by applying a linear transformation to the estimated uncertainty matrix, which is subsequently used as a measurement noise for the adopted Kalman filter. We implement two ways to estimate output uncertainty, i.e., internally, by computing the variance of the CNN outputs and then propagating the uncertainty through the post-processing, and externally, by associating the final predictions of different samples and computing the covariance of each predicted box. In experiments, we show that the external uncertainty estimation leads to better results, outperforming both internal uncertainty estimation and classical tracking approaches. Furthermore, we propose a method to initialize the Variational 3D object detector with a pretrained TANet model, which leads to the best performing models.
comment: 5 pages, 1 figure, accepted to ICIP 2024
♻ ☆ High-Performance Inference Graph Convolutional Networks for Skeleton-Based Action Recognition
Recently, the significant achievements have been made in skeleton-based human action recognition with the emergence of graph convolutional networks (GCNs). However, the state-of-the-art (SOTA) models used for this task focus on constructing more complex higher-order connections between joint nodes to describe skeleton information, which leads to complex inference processes and high computational costs. To address the slow inference speed caused by overly complex model structures, we introduce re-parameterization and over-parameterization techniques to GCNs and propose two novel high-performance inference GCNs, namely HPI-GCN-RP and HPI-GCN-OP. After the completion of model training, model parameters are fixed. HPI-GCN-RP adopts re-parameterization technique to transform high-performance training model into fast inference model through linear transformations, which achieves a higher inference speed with competitive model performance. HPI-GCN-OP further utilizes over-parameterization technique to achieve higher performance improvement by introducing additional inference parameters, albeit with slightly decreased inference speed. The experimental results on the two skeleton-based action recognition datasets demonstrate the effectiveness of our approach. Our HPI-GCN-OP achieves performance comparable to the current SOTA models, with inference speeds five times faster. Specifically, our HPI-GCN-OP achieves an accuracy of 93\% on the cross-subject split of the NTU-RGB+D 60 dataset, and 90.1\% on the cross-subject benchmark of the NTU-RGB+D 120 dataset. Code is available at github.com/lizaowo/HPI-GCN.
♻ ☆ Data Set Terminology of Deep Learning in Medicine: A Historical Review and Recommendation
Medicine and deep learning-based artificial intelligence (AI) engineering represent two distinct fields each with decades of published history. With such history comes a set of terminology that has a specific way in which it is applied. However, when two distinct fields with overlapping terminology start to collaborate, miscommunication and misunderstandings can occur. This narrative review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts, and offer solutions to mitigate misunderstandings by readers from either field. Through an examination of historical documents, including articles, writing guidelines, and textbooks, this review traces the divergent evolution of terms for data sets and their impact. Initially, the discordant interpretations of the word 'validation' in medical and AI contexts are explored. Then the data sets used for AI evaluation are classified, namely random splitting, cross-validation, temporal, geographic, internal, and external sets. The accurate and standardized description of these data sets is crucial for demonstrating the robustness and generalizability of AI applications in medicine. This review clarifies existing literature to provide a comprehensive understanding of these classifications and their implications in AI evaluation. This review then identifies often misunderstood terms and proposes pragmatic solutions to mitigate terminological confusion. Among these solutions are the use of standardized terminology such as 'training set,' 'validation (or tuning) set,' and 'test set,' and explicit definition of data set splitting terminologies in each medical AI research publication. This review aspires to enhance the precision of communication in medical AI, thereby fostering more effective and transparent research methodologies in this interdisciplinary field.
comment: 20 pages, 3 figures, 3 tables
♻ ☆ YOLO9tr: A Lightweight Model for Pavement Damage Detection Utilizing a Generalized Efficient Layer Aggregation Network and Attention Mechanism
Maintaining road pavement integrity is crucial for ensuring safe and efficient transportation. Conventional methods for assessing pavement condition are often laborious and susceptible to human error. This paper proposes YOLO9tr, a novel lightweight object detection model for pavement damage detection, leveraging the advancements of deep learning. YOLO9tr is based on the YOLOv9 architecture, incorporating a partial attention block that enhances feature extraction and attention mechanisms, leading to improved detection performance in complex scenarios. The model is trained on a comprehensive dataset comprising road damage images from multiple countries, including an expanded set of damage categories beyond the standard four. This broadened classification range allows for a more accurate and realistic assessment of pavement conditions. Comparative analysis demonstrates YOLO9tr's superior precision and inference speed compared to state-of-the-art models like YOLO8, YOLO9 and YOLO10, achieving a balance between computational efficiency and detection accuracy. The model achieves a high frame rate of up to 136 FPS, making it suitable for real-time applications such as video surveillance and automated inspection systems. The research presents an ablation study to analyze the impact of architectural modifications and hyperparameter variations on model performance, further validating the effectiveness of the partial attention block. The results highlight YOLO9tr's potential for practical deployment in real-time pavement condition monitoring, contributing to the development of robust and efficient solutions for maintaining safe and functional road infrastructure.
♻ ☆ Domain adaptation strategies for 3D reconstruction of the lumbar spine using real fluoroscopy data
This study tackles key obstacles in adopting surgical navigation in orthopedic surgeries, including time, cost, radiation, and workflow integration challenges. Recently, our work X23D showed an approach for generating 3D anatomical models of the spine from only a few intraoperative fluoroscopic images. This negates the need for conventional registration-based surgical navigation by creating a direct intraoperative 3D reconstruction of the anatomy. Despite these strides, the practical application of X23D has been limited by a domain gap between synthetic training data and real intraoperative images. In response, we devised a novel data collection protocol for a paired dataset consisting of synthetic and real fluoroscopic images from the same perspectives. Utilizing this dataset, we refined our deep learning model via transfer learning, effectively bridging the domain gap between synthetic and real X-ray data. A novel style transfer mechanism also allows us to convert real X-rays to mirror the synthetic domain, enabling our in-silico-trained X23D model to achieve high accuracy in real-world settings. Our results demonstrated that the refined model can rapidly generate accurate 3D reconstructions of the entire lumbar spine from as few as three intraoperative fluoroscopic shots. It achieved an 84% F1 score, matching the accuracy of our previous synthetic data-based research. Additionally, with a computational time of only 81.1 ms, our approach provides real-time capabilities essential for surgery integration. Through examining ideal imaging setups and view angle dependencies, we've further confirmed our system's practicality and dependability in clinical settings. Our research marks a significant step forward in intraoperative 3D reconstruction, offering enhancements to surgical planning, navigation, and robotics.
♻ ☆ A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
This paper presents a video inversion approach for zero-shot video editing, which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing, which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches, we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion, which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion, which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods. Project page: https://stem-inv.github.io/page/.
comment: 14 pages, Project page: https://stem-inv.github.io/page/
♻ ☆ SparseRadNet: Sparse Perception Neural Network on Subsampled Radar Data
Radar-based perception has gained increasing attention in autonomous driving, yet the inherent sparsity of radars poses challenges. Radar raw data often contains excessive noise, whereas radar point clouds retain only limited information. In this work, we holistically treat the sparse nature of radar data by introducing an adaptive subsampling method together with a tailored network architecture that exploits the sparsity patterns to discover global and local dependencies in the radar signal. Our subsampling module selects a subset of pixels from range-doppler (RD) spectra that contribute most to the downstream perception tasks. To improve the feature extraction on sparse subsampled data, we propose a new way of applying graph neural networks on radar data and design a novel two-branch backbone to capture both global and local neighbor information. An attentive fusion module is applied to combine features from both branches. Experiments on the RADIal dataset show that our SparseRadNet exceeds state-of-the-art (SOTA) performance in object detection and achieves close to SOTA accuracy in freespace segmentation, meanwhile using sparse subsampled input data.
comment: 18 pages, 4 figures, 5 tables
♻ ☆ Similarity Metrics for MR Image-To-Image Translation
Image-to-image translation can create large impact in medical imaging, for instance the possibility to synthetically transform images to other modalities, sequence types, higher resolutions or lower noise levels. In order to assure a high level of patient safety, these methods are mostly validated by human reader studies, which require a considerable amount of time and costs. Quantitative metrics have been used to complement such studies and to provide reproducible and objective assessment of synthetic images. Even though the SSIM and PSNR metrics are extensively used, they do not detect all types of errors in synthetic images as desired. Other metrics could provide additional useful evaluation. In this study, we give an overview and a quantitative analysis of 15 metrics for assessing the quality of synthetically generated images. We include 11 full-reference metrics (SSIM, MS-SSIM, CW-SSIM, PSNR, MSE, NMSE, MAE, LPIPS, DISTS, NMI and PCC), three non-reference metrics (BLUR, MLC, MSLC) and one downstream task segmentation metric (DICE) to detect 11 kinds of typical distortions and artifacts that occur in MR images. In addition, we analyze the influence of four prominent normalization methods (Minmax, cMinmax, Zscore and Quantile) on the different metrics and distortions. Finally, we provide adverse examples to highlight pitfalls in metric assessment and derive recommendations for effective usage of the analyzed similarity metrics for evaluation of image-to-image translation models.
comment: 36 pages, 8 figures, appendix with 7 figures
♻ ☆ CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
♻ ☆ First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{https://github.com/360AILAB-NLP/FlowCE} \end{abstract}
♻ ☆ A Comprehensive Taxonomy and Analysis of Talking Head Synthesis: Techniques for Portrait Generation, Driving Mechanisms, and Editing
Talking head synthesis, an advanced method for generating portrait videos from a still image driven by specific content, has garnered widespread attention in virtual reality, augmented reality and game production. Recently, significant breakthroughs have been made with the introduction of novel models such as the transformer and the diffusion model. Current methods can not only generate new content but also edit the generated material. This survey systematically reviews the technology, categorizing it into three pivotal domains: portrait generation, driven mechanisms, and editing techniques. We summarize milestone studies and critically analyze their innovations and shortcomings within each domain. Additionally, we organize an extensive collection of datasets and provide a thorough performance analysis of current methodologies based on various evaluation metrics, aiming to furnish a clear framework and robust data support for future research. Finally, we explore application scenarios of talking head synthesis, illustrate them with specific cases, and examine potential future directions.
♻ ☆ W-Net: A Facial Feature-Guided Face Super-Resolution Network
Face Super-Resolution (FSR) aims to recover high-resolution (HR) face images from low-resolution (LR) ones. Despite the progress made by convolutional neural networks in FSR, the results of existing approaches are not ideal due to their low reconstruction efficiency and insufficient utilization of prior information. Considering that faces are highly structured objects, effectively leveraging facial priors to improve FSR results is a worthwhile endeavor. This paper proposes a novel network architecture called W-Net to address this challenge. W-Net leverages meticulously designed Parsing Block to fully exploit the resolution potential of LR image. We use this parsing map as an attention prior, effectively integrating information from both the parsing map and LR images. Simultaneously, we perform multiple fusions in various dimensions through the W-shaped network structure combined with the LPF(LR-Parsing Map Fusion Module). Additionally, we utilize a facial parsing graph as a mask, assigning different weights and loss functions to key facial areas to balance the performance of our reconstructed facial images between perceptual quality and pixel accuracy. We conducted extensive comparative experiments, not only limited to conventional facial super-resolution metrics but also extending to downstream tasks such as facial recognition and facial keypoint detection. The experiments demonstrate that W-Net exhibits outstanding performance in quantitative metrics, visual quality, and downstream tasks.
comment: 15 pages,9 figures
♻ ☆ Language-Driven Active Learning for Diverse Open-Set 3D Object Detection
Object detection is crucial for ensuring safe autonomous driving. However, data-driven approaches face challenges when encountering minority or novel objects in the 3D driving scene. In this paper, we propose VisLED, a language-driven active learning framework for diverse open-set 3D Object Detection. Our method leverages active learning techniques to query diverse and informative data samples from an unlabeled pool, enhancing the model's ability to detect underrepresented or novel objects. Specifically, we introduce the Vision-Language Embedding Diversity Querying (VisLED-Querying) algorithm, which operates in both open-world exploring and closed-world mining settings. In open-world exploring, VisLED-Querying selects data points most novel relative to existing data, while in closed-world mining, it mines novel instances of known classes. We evaluate our approach on the nuScenes dataset and demonstrate its efficiency compared to random sampling and entropy-querying methods. Our results show that VisLED-Querying consistently outperforms random sampling and offers competitive performance compared to entropy-querying despite the latter's model-optimality, highlighting the potential of VisLED for improving object detection in autonomous driving scenarios. We make our code publicly available at https://github.com/Bjork-crypto/VisLED-Querying
♻ ☆ GaitGS: Temporal Feature Learning in Granularity and Span Dimension for Gait Recognition ICIP2024
Gait recognition, a growing field in biological recognition technology, utilizes distinct walking patterns for accurate individual identification. However, existing methods lack the incorporation of temporal information. To reach the full potential of gait recognition, we advocate for the consideration of temporal features at varying granularities and spans. This paper introduces a novel framework, GaitGS, which aggregates temporal features simultaneously in both granularity and span dimensions. Specifically, the Multi-Granularity Feature Extractor (MGFE) is designed to capture micro-motion and macro-motion information at fine and coarse levels respectively, while the Multi-Span Feature Extractor (MSFE) generates local and global temporal representations. Through extensive experiments on two datasets, our method demonstrates state-of-the-art performance, achieving Rank-1 accuracy of 98.2%, 96.5%, and 89.7% on CASIA-B under different conditions, and 97.6% on OU-MVLP. The source code will be available at https://github.com/Haijun-Xiong/GaitGS.
comment: Accepted by ICIP2024
♻ ☆ Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity CVPR2024
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.
comment: CVPR2024 MMFM workshop
♻ ☆ Evaluating Text-to-Visual Generation with Image-to-Text Generation
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
comment: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore
♻ ☆ A Comprehensive Survey on Underwater Image Enhancement Based on Deep Learning
Underwater image enhancement (UIE) is a challenging research task in the field of computer vision. Although hundreds of UIE algorithms have been proposed, a comprehensive and systematic review is still lacking. To promote future research, we summarize the UIE task from multiple perspectives. First, the physical models, data construction processes, evaluation metrics, and loss functions are introduced. Second, according to the contributions brought by different literatures, recent proposed algorithms are discussed and classified from six perspectives, namely network architecture, learning strategy, learning stage, assistance task, domain perspective and disentanglement fusion, respectively. Third, considering the inconsistencies in experimental settings in different literatures, a comprehensive and fair comparison does not yet exist. To this end, we quantitatively and qualitatively evaluate state-of-the-art algorithms on multiple benchmark datasets. Finally, issues worthy of further research in the UIE task are raised. A collection of useful materials is available at https://github.com/YuZhao1999/UIE.
comment: A survey on the underwater image enhancement task
♻ ☆ LMPT: Prompt Tuning with Class-Specific Embedding Loss for Long-tailed Multi-Label Visual Recognition ACL 2024
Long-tailed multi-label visual recognition (LTML) task is a highly challenging task due to the label co-occurrence and imbalanced data distribution. In this work, we propose a unified framework for LTML, namely prompt tuning with class-specific embedding loss (LMPT), capturing the semantic feature interactions between categories by combining text and image modality data and improving the performance synchronously on both head and tail classes. Specifically, LMPT introduces the embedding loss function with class-aware soft margin and re-weighting to learn class-specific contexts with the benefit of textual descriptions (captions), which could help establish semantic relationships between classes, especially between the head and tail classes. Furthermore, taking into account the class imbalance, the distribution-balanced loss is adopted as the classification loss function to further improve the performance on the tail classes without compromising head classes. Extensive experiments are conducted on VOC-LT and COCO-LT datasets, which demonstrates that our method significantly surpasses the previous state-of-the-art methods and zero-shot CLIP in LTML. Our codes are fully public at https://github.com/richard-peng-xia/LMPT.
comment: Accepted by 3rd Workshop on Advances in Language and Vision Research (ALVR) @ ACL 2024
♻ ☆ Effective Rank Analysis and Regularization for Enhanced 3D Gaussian Splatting
3D reconstruction from multi-view images is one of the fundamental challenges in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising technique capable of real-time rendering with high-quality 3D reconstruction. This method utilizes 3D Gaussian representation and tile-based splatting techniques, bypassing the expensive neural field querying. Despite its potential, 3DGS encounters challenges, including needle-like artifacts, suboptimal geometries, and inaccurate normals, due to the Gaussians converging into anisotropic Gaussians with one dominant variance. We propose using effective rank analysis to examine the shape statistics of 3D Gaussian primitives, and identify the Gaussians indeed converge into needle-like shapes with the effective rank 1. To address this, we introduce effective rank as a regularization, which constrains the structure of the Gaussians. Our new regularization method enhances normal and geometry reconstruction while reducing needle-like artifacts. The approach can be integrated as an add-on module to other 3DGS variants, improving their quality without compromising visual fidelity.
comment: project page: https://junhahyung.github.io/erankgs.github.io
♻ ☆ High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net ICCV 2023
Shadows often occur when we capture the documents with casual equipment, which influences the visual quality and readability of the digital copies. Different from the algorithms for natural shadow removal, the algorithms in document shadow removal need to preserve the details of fonts and figures in high-resolution input. Previous works ignore this problem and remove the shadows via approximate attention and small datasets, which might not work in real-world situations. We handle high-resolution document shadow removal directly via a larger-scale real-world dataset and a carefully designed frequency-aware network. As for the dataset, we acquire over 7k couples of high-resolution (2462 x 3699) images of real-world document pairs with various samples under different lighting circumstances, which is 10 times larger than existing datasets. As for the design of the network, we decouple the high-resolution images in the frequency domain, where the low-frequency details and high-frequency boundaries can be effectively learned via the carefully designed network structure. Powered by our network and dataset, the proposed method clearly shows a better performance than previous methods in terms of visual quality and numerical results. The code, models, and dataset are available at: https://github.com/CXH-Research/DocShadow-SD7K
comment: Accepted by International Conference on Computer Vision 2023 (ICCV 2023)
♻ ☆ LSKNet: A Foundation Lightweight Backbone for Remote Sensing
Remote sensing images pose distinct challenges for downstream tasks due to their inherent complexity. While a considerable amount of research has been dedicated to remote sensing classification, object detection and semantic segmentation, most of these studies have overlooked the valuable prior knowledge embedded within remote sensing scenarios. Such prior knowledge can be useful because remote sensing objects may be mistakenly recognized without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes a lightweight Large Selective Kernel Network (LSKNet) backbone. LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing images. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard remote sensing classification, object detection and semantic segmentation benchmarks. Our comprehensive analysis further validated the significance of the identified priors and the effectiveness of LSKNet. The code is available at https://github.com/zcablii/LSKNet.
comment: arXiv admin note: substantial text overlap with arXiv:2303.09030
♻ ☆ Event3DGS: Event-Based 3D Gaussian Splatting for High-Speed Robot Egomotion
By combining differentiable rendering with explicit point-based scene representations, 3D Gaussian Splatting (3DGS) has demonstrated breakthrough 3D reconstruction capabilities. However, to date 3DGS has had limited impact on robotics, where high-speed egomotion is pervasive: Egomotion introduces motion blur and leads to artifacts in existing frame-based 3DGS reconstruction methods. To address this challenge, we introduce Event3DGS, an {\em event-based} 3DGS framework. By exploiting the exceptional temporal resolution of event cameras, Event3GDS can reconstruct high-fidelity 3D structure and appearance under high-speed egomotion. Extensive experiments on multiple synthetic and real-world datasets demonstrate the superiority of Event3DGS compared with existing event-based dense 3D scene reconstruction frameworks; Event3DGS substantially improves reconstruction quality (+3dB) while reducing computational costs by 95\%. Our framework also allows one to incorporate a few motion-blurred frame-based measurements into the reconstruction process to further improve appearance fidelity without loss of structural accuracy.
♻ ☆ HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction MICCAI2024
Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in https://github.com/dddavid4real/HistGen.
comment: Accepted by MICCAI2024
♻ ☆ Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images. Extensive experiments on visual reasoning tasks demonstrate the superiority of P2G, achieving performance comparable to GPT-4V on P2GB with a 7B backbone. Our work underscores the potential of grounding reasoning with external agents in MLLMs, presenting a promising alternative to mere model scaling.
comment: 15 pages, 8 figures
♻ ☆ BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models ICLR 2024
Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using R\'enyi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
comment: Preprint updated from our earlier manuscript submitted to ICLR 2024 (https://openreview.net/forum?id=KNtcoAM5Gy)
♻ ☆ MegActor: Harness the Power of Raw Video for Vivid Portrait Animation
Despite raw driving videos contain richer information on facial expressions than intermediate representations such as landmarks in the field of portrait animation, they are seldom the subject of research. This is due to two challenges inherent in portrait animation driven with raw videos: 1) significant identity leakage; 2) Irrelevant background and facial details such as wrinkles degrade performance. To harnesses the power of the raw videos for vivid portrait animation, we proposed a pioneering conditional diffusion model named as MegActor. First, we introduced a synthetic data generation framework for creating videos with consistent motion and expressions but inconsistent IDs to mitigate the issue of ID leakage. Second, we segmented the foreground and background of the reference image and employed CLIP to encode the background details. This encoded information is then integrated into the network via a text embedding module, thereby ensuring the stability of the background. Finally, we further style transfer the appearance of the reference image to the driving video to eliminate the influence of facial details in the driving videos. Our final model was trained solely on public datasets, achieving results comparable to commercial models. We hope this will help the open-source community.The code is available at https://github.com/megvii-research/MegFaceAnimate.
♻ ☆ Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. Despite the advancements of large language models (LLMs) in generating human-like text with remarkable accuracy, they present biases inherited from the training data, inconsistency across different contexts, and difficulty understanding complex scenarios involving multiple layers of context. Therefore, recent research attempts to leverage the strength of multiple agents working collaboratively with various types of data and tools for enhanced consistency and reliability. To that end, this paper aims to understand whether multi-modal and multi-agent systems are advancing toward rationality by surveying the state-of-the-art works, identifying advancements over single-agent and single-modal systems in terms of rationality, and discussing open problems and future directions. We maintain an open repository at https://github.com/bowen-upenn/MMMA_Rationality.
♻ ☆ BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving
The rapid development of the autonomous driving industry has led to a significant accumulation of autonomous driving data. Consequently, there comes a growing demand for retrieving data to provide specialized optimization. However, directly applying previous image retrieval methods faces several challenges, such as the lack of global feature representation and inadequate text retrieval ability for complex driving scenes. To address these issues, firstly, we propose the BEV-TSR framework which leverages descriptive text as an input to retrieve corresponding scenes in the Bird's Eye View (BEV) space. Then to facilitate complex scene retrieval with extensive text descriptions, we employ a large language model (LLM) to extract the semantic features of the text inputs and incorporate knowledge graph embeddings to enhance the semantic richness of the language embedding. To achieve feature alignment between the BEV feature and language embedding, we propose Shared Cross-modal Embedding with a set of shared learnable embeddings to bridge the gap between these two modalities, and employ a caption generation task to further enhance the alignment. Furthermore, there lack of well-formed retrieval datasets for effective evaluation. To this end, we establish a multi-level retrieval dataset, nuScenes-Retrieval, based on the widely adopted nuScenes dataset. Experimental results on the multi-level nuScenes-Retrieval show that BEV-TSR achieves state-of-the-art performance, e.g., 85.78% and 87.66% top-1 accuracy on scene-to-text and text-to-scene retrieval respectively. Codes and datasets will be available.
♻ ☆ TriSAM: Tri-Plane SAM for zero-shot cortical blood vessel segmentation in VEM images
While imaging techniques at macro and mesoscales have garnered substantial attention and resources, microscale Volume Electron Microscopy (vEM) imaging, capable of revealing intricate vascular details, has lacked the necessary benchmarking infrastructure. In this paper, we address a significant gap in this field of neuroimaging by introducing the first-in-class public benchmark, BvEM, designed specifically for cortical blood vessel segmentation in vEM images. Our BvEM benchmark is based on vEM image volumes from three mammals: adult mouse, macaque, and human. We standardized the resolution, addressed imaging variations, and meticulously annotated blood vessels through semi-automatic, manual, and quality control processes, ensuring high-quality 3D segmentation. Furthermore, we developed a zero-shot cortical blood vessel segmentation method named TriSAM, which leverages the powerful segmentation model SAM for 3D segmentation. To extend SAM from 2D to 3D volume segmentation, TriSAM employs a multi-seed tracking framework, leveraging the reliability of certain image planes for tracking while using others to identify potential turning points. This approach effectively achieves long-term 3D blood vessel segmentation without model training or fine-tuning. Experimental results show that TriSAM achieved superior performances on the BvEM benchmark across three species. Our dataset, code, and model are available online at \url{https://jia-wan.github.io/bvem}.
comment: BvEM-Mouse can be visualized at: https://tinyurl.com/yc2s38x9
♻ ☆ Self-supervised Graph Neural Network for Mechanical CAD Retrieval
CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making traditional retrieval methods ineffective. While existing representation learning approaches have been developed for CAD, manually labeling similar samples in these methods is expensive. Additionally, CAD models' unique parameterized data structure presents challenges for applying existing 3D shape representation learning techniques directly. In this work, we propose GC-CAD, a self-supervised contrastive graph neural network-based method for mechanical CAD retrieval that directly models parameterized CAD raw files. GC-CAD consists of two key modules: structure-aware representation learning and contrastive graph learning framework. The method leverages graph neural networks to extract both geometric and topological information from CAD models, generating feature representations. We then introduce a simple yet effective contrastive graph learning framework approach, enabling the model to train without manual labels and generate retrieval-ready representations. Experimental results on four datasets including human evaluation demonstrate that the proposed method achieves significant accuracy improvements and up to 100 times efficiency improvement over the baseline methods.
♻ ☆ Segment Anything Model is a Good Teacher for Local Feature Learning
Local feature detection and description play an important role in many computer vision tasks, which are designed to detect and describe keypoints in "any scene" and "any downstream task". Data-driven local feature learning methods need to rely on pixel-level correspondence for training, which is challenging to acquire at scale, thus hindering further improvements in performance. In this paper, we propose SAMFeat to introduce SAM (segment anything model), a fundamental model trained on 11 million images, as a teacher to guide local feature learning and thus inspire higher performance on limited datasets. To do so, first, we construct an auxiliary task of Attention-weighted Semantic Relation Distillation (ASRD), which distillates feature relations with category-agnostic semantic information learned by the SAM encoder into a local feature learning network, to improve local feature description using semantic discrimination. Second, we develop a technique called Weakly Supervised Contrastive Learning Based on Semantic Grouping (WSC), which utilizes semantic groupings derived from SAM as weakly supervised signals, to optimize the metric space of local descriptors. Third, we design an Edge Attention Guidance (EAG) to further improve the accuracy of local feature detection and description by prompting the network to pay more attention to the edge region guided by SAM. SAMFeat's performance on various tasks such as image matching on HPatches, and long-term visual localization on Aachen Day-Night showcases its superiority over previous local features. The release code is available at https://github.com/vignywang/SAMFeat.
♻ ☆ Learning Discriminative Features for Crowd Counting
Crowd counting models in highly congested areas confront two main challenges: weak localization ability and difficulty in differentiating between foreground and background, leading to inaccurate estimations. The reason is that objects in highly congested areas are normally small and high level features extracted by convolutional neural networks are less discriminative to represent small objects. To address these problems, we propose a learning discriminative features framework for crowd counting, which is composed of a masked feature prediction module (MPM) and a supervised pixel-level contrastive learning module (CLM). The MPM randomly masks feature vectors in the feature map and then reconstructs them, allowing the model to learn about what is present in the masked regions and improving the model's ability to localize objects in high density regions. The CLM pulls targets close to each other and pushes them far away from background in the feature space, enabling the model to discriminate foreground objects from background. Additionally, the proposed modules can be beneficial in various computer vision tasks, such as crowd counting and object detection, where dense scenes or cluttered environments pose challenges to accurate localization. The proposed two modules are plug-and-play, incorporating the proposed modules into existing models can potentially boost their performance in these scenarios.
♻ ☆ Attack on Scene Flow using Point Clouds
Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks.
Computation and Language 150
☆ Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.
comment: Technical report v1. Code and model are released at https://github.com/RLHFlow/RLHF-Reward-Modeling/
☆ LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.
☆ What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
Language models (LM) are capable of remarkably complex linguistic tasks; however, numerical reasoning is an area in which they frequently struggle. An important but rarely evaluated form of reasoning is understanding probability distributions. In this paper, we focus on evaluating the probabilistic reasoning capabilities of LMs using idealized and real-world statistical distributions. We perform a systematic evaluation of state-of-the-art LMs on three tasks: estimating percentiles, drawing samples, and calculating probabilities. We evaluate three ways to provide context to LMs 1) anchoring examples from within a distribution or family of distributions, 2) real-world context, 3) summary statistics on which to base a Normal approximation. Models can make inferences about distributions, and can be further aided by the incorporation of real-world context, example shots and simplified assumptions, even if these assumptions are incorrect or misspecified. To conduct this work, we developed a comprehensive benchmark distribution dataset with associated question-answer pairs that we will release publicly.
comment: 21 pages, 9 figures, 2 tables
☆ From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries
Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.
☆ Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may mismatch this intention owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; on the other hand, it remains questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues by using controlled native or translated data during instruction tuning and evaluation stages and observing model results. Experiments on eight base models and eight different benchmarks reveal that native or generation benchmarks display a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
☆ Adversarial Attacks on Multimodal Agents
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack
comment: 19 pages
☆ Can Large Language Models Always Solve Easy Problems if They Can Solve Harder Ones?
Large language models (LLMs) have demonstrated impressive capabilities, but still suffer from inconsistency issues (e.g. LLMs can react differently to disturbances like rephrasing or inconsequential order change). In addition to these inconsistencies, we also observe that LLMs, while capable of solving hard problems, can paradoxically fail at easier ones. To evaluate this hard-to-easy inconsistency, we develop the ConsisEval benchmark, where each entry comprises a pair of questions with a strict order of difficulty. Furthermore, we introduce the concept of consistency score to quantitatively measure this inconsistency and analyze the potential for improvement in consistency by relative consistency score. Based on comprehensive experiments across a variety of existing models, we find: (1) GPT-4 achieves the highest consistency score of 92.2\% but is still inconsistent to specific questions due to distraction by redundant information, misinterpretation of questions, etc.; (2) models with stronger capabilities typically exhibit higher consistency, but exceptions also exist; (3) hard data enhances consistency for both fine-tuning and in-context learning. Our data and code will be publicly available on GitHub.
comment: 25 pages, 12 figures, 10 tables
☆ ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B. They represent our most capable models that are trained with all the insights and lessons gained from the preceding three generations of ChatGLM. To date, the GLM-4 models are pre-trained on ten trillions of tokens mostly in Chinese and English, along with a small set of corpus from 24 languages, and aligned primarily for Chinese and English usage. The high-quality alignment is achieved via a multi-stage post-training process, which involves supervised fine-tuning and learning from human feedback. Evaluations show that GLM-4 1) closely rivals or outperforms GPT-4 in terms of general metrics such as MMLU, GSM8K, MATH, BBH, GPQA, and HumanEval, 2) gets close to GPT-4-Turbo in instruction following as measured by IFEval, 3) matches GPT-4 Turbo (128K) and Claude 3 for long context tasks, and 4) outperforms GPT-4 in Chinese alignments as measured by AlignBench. The GLM-4 All Tools model is further aligned to understand user intent and autonomously decide when and which tool(s) touse -- including web browser, Python interpreter, text-to-image model, and user-defined functions -- to effectively complete complex tasks. In practical applications, it matches and even surpasses GPT-4 All Tools in tasks like accessing online information via web browsing and solving math problems using Python interpreter. Over the course, we have open-sourced a series of models, including ChatGLM-6B (three generations), GLM-4-9B (128K, 1M), GLM-4V-9B, WebGLM, and CodeGeeX, attracting over 10 million downloads on Hugging face in the year 2023 alone. The open models can be accessed through https://github.com/THUDM and https://huggingface.co/THUDM.
☆ Generating Educational Materials with Different Levels of Readability using LLMs
This study introduces the leveled-text generation task, aiming to rewrite educational materials to specific readability levels while preserving meaning. We assess the capability of GPT-3.5, LLaMA-2 70B, and Mixtral 8x7B, to generate content at various readability levels through zero-shot and few-shot prompting. Evaluating 100 processed educational materials reveals that few-shot prompting significantly improves performance in readability manipulation and information preservation. LLaMA-2 70B performs better in achieving the desired difficulty range, while GPT-3.5 maintains original meaning. However, manual inspection highlights concerns such as misinformation introduction and inconsistent edit distribution. These findings emphasize the need for further research to ensure the quality of generated educational content.
comment: In2Writing 2024
☆ UBENCH: Benchmarking Uncertainty in Large Language Models with Multiple Choice Questions
The rapid development of large language models (LLMs) has shown promising practical results. However, their low interpretability often leads to errors in unforeseen circumstances, limiting their utility. Many works have focused on creating comprehensive evaluation systems, but previous benchmarks have primarily assessed problem-solving abilities while neglecting the response's uncertainty, which may result in unreliability. Recent methods for measuring LLM reliability are resource-intensive and unable to test black-box models. To address this, we propose UBENCH, a comprehensive benchmark for evaluating LLM reliability. UBENCH includes 3,978 multiple-choice questions covering knowledge, language, understanding, and reasoning abilities. Experimental results show that UBENCH has achieved state-of-the-art performance, while its single-sampling method significantly saves computational resources compared to baseline methods that require multiple samplings. Additionally, based on UBENCH, we evaluate the reliability of 15 popular LLMs, finding GLM4 to be the most outstanding, closely followed by GPT-4. We also explore the impact of Chain-of-Thought prompts, role-playing prompts, option order, and temperature on LLM reliability, analyzing the varying effects on different LLMs.
comment: Under review
☆ Composited-Nested-Learning with Data Augmentation for Nested Named Entity Recognition SC
Nested Named Entity Recognition (NNER) focuses on addressing overlapped entity recognition. Compared to Flat Named Entity Recognition (FNER), annotated resources are scarce in the corpus for NNER. Data augmentation is an effective approach to address the insufficient annotated corpus. However, there is a significant lack of exploration in data augmentation methods for NNER. Due to the presence of nested entities in NNER, existing data augmentation methods cannot be directly applied to NNER tasks. Therefore, in this work, we focus on data augmentation for NNER and resort to more expressive structures, Composited-Nested-Label Classification (CNLC) in which constituents are combined by nested-word and nested-label, to model nested entities. The dataset is augmented using the Composited-Nested-Learning (CNL). In addition, we propose the Confidence Filtering Mechanism (CFM) for a more efficient selection of generated data. Experimental results demonstrate that this approach results in improvements in ACE2004 and ACE2005 and alleviates the impact of sample imbalance.
comment: Accepted by CSCWD 2024
☆ Hopping Too Late: Exploring the Limitations of Large Language Models on Multi-Hop Queries
Large language models (LLMs) can solve complex multi-step problems, but little is known about how these computations are implemented internally. Motivated by this, we study how LLMs answer multi-hop queries such as "The spouse of the performer of Imagine is". These queries require two information extraction steps: a latent one for resolving the first hop ("the performer of Imagine") into the bridge entity (John Lennon), and one for resolving the second hop ("the spouse of John Lennon") into the target entity (Yoko Ono). Understanding how the latent step is computed internally is key to understanding the overall computation. By carefully analyzing the internal computations of transformer-based LLMs, we discover that the bridge entity is resolved in the early layers of the model. Then, only after this resolution, the two-hop query is solved in the later layers. Because the second hop commences in later layers, there could be cases where these layers no longer encode the necessary knowledge for correctly predicting the answer. Motivated by this, we propose a novel "back-patching" analysis method whereby a hidden representation from a later layer is patched back to an earlier layer. We find that in up to 57% of previously incorrect cases there exists a back-patch that results in the correct generation of the answer, showing that the later layers indeed sometimes lack the needed functionality. Overall our methods and findings open further opportunities for understanding and improving latent reasoning in transformer-based LLMs.
☆ Formatics & dairy industry coalition: AI trends and present challenges
Artificial Intelligence (AI) can potentially transform the industry, enhancing the production process and minimizing manual, repetitive tasks. Accordingly, the synergy between high-performance computing and powerful mathematical models enables the application of sophisticated data analysis procedures like Machine Learning. However, challenges exist regarding effective, efficient, and flexible processing to generate valuable knowledge. Consequently, this work comprehensively describes industrial challenges where AI can be exploited, focusing on the dairy industry. The conclusions presented can help researchers apply novel approaches for cattle monitoring and farmers by proposing advanced technological solutions to their needs.
☆ Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Existing humor datasets and evaluations predominantly focus on English, lacking resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, a dataset sourced from Ruo Zhi Ba (RZB), a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes. We annotate explanations for each joke and evaluate human explanations against two state-of-the-art LLMs, GPT-4o and ERNIE Bot, through A/B testing by native Chinese speakers. Our evaluation shows that Chumor is challenging even for SOTA LLMs, and the human explanations for Chumor jokes are significantly better than explanations generated by the LLMs.
☆ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
comment: 44 pages
☆ Rationale-based Ensemble of Multiple QA Strategies for Zero-shot Knowledge-based VQA
Knowledge-based Visual Qustion-answering (K-VQA) necessitates the use of background knowledge beyond what is depicted in the image. Current zero-shot K-VQA methods usually translate an image to a single type of textual decision context and use a text-based model to answer the question based on it, which conflicts with the fact that K-VQA questions often require the combination of multiple question-answering strategies. In light of this, we propose Rationale-based Ensemble of Answer Context Tactics (REACT) to achieve a dynamic ensemble of multiple question-answering tactics, comprising Answer Candidate Generation (ACG) and Rationale-based Strategy Fusion (RSF). In ACG, we generate three distinctive decision contexts to provide different strategies for each question, resulting in the generation of three answer candidates. RSF generates automatic and mechanistic rationales from decision contexts for each candidate, allowing the model to select the correct answer from all candidates. We conduct comprehensive experiments on the OK-VQA and A-OKVQA datasets, and our method significantly outperforms state-of-the-art LLM-based baselines on all datasets.
☆ Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.
comment: First three authors contributed equally. Dataset: https://huggingface.co/datasets/VLLMs/MIRB
☆ Self-Distillation for Model Stacking Unlocks Cross-Lingual NLU in 200+ Languages
LLMs have become a go-to solution not just for text generation, but also for natural language understanding (NLU) tasks. Acquiring extensive knowledge through language modeling on web-scale corpora, they excel on English NLU, yet struggle to extend their NLU capabilities to underrepresented languages. In contrast, machine translation models (MT) produce excellent multilingual representations, resulting in strong translation performance even for low-resource languages. MT encoders, however, lack the knowledge necessary for comprehensive NLU that LLMs obtain through language modeling training on immense corpora. In this work, we get the best both worlds by integrating MT encoders directly into LLM backbones via sample-efficient self-distillation. The resulting MT-LLMs preserve the inherent multilingual representational alignment from the MT encoder, allowing lower-resource languages to tap into the rich knowledge embedded in English-centric LLMs. Merging the MT encoder and LLM in a single model, we mitigate the propagation of translation errors and inference overhead of MT decoding inherent to discrete translation-based cross-lingual transfer (e.g., translate-test). Evaluation spanning three prominent NLU tasks and 127 predominantly low-resource languages renders MT-LLMs highly effective in cross-lingual transfer. MT-LLMs substantially and consistently outperform translate-test based on the same MT model, showing that we truly unlock multilingual language understanding for LLMs.
☆ Large Language Model as a Universal Clinical Multi-task Decoder
The development of effective machine learning methodologies for enhancing the efficiency and accuracy of clinical systems is crucial. Despite significant research efforts, managing a plethora of diversified clinical tasks and adapting to emerging new tasks remain significant challenges. This paper presents a novel paradigm that employs a pre-trained large language model as a universal clinical multi-task decoder. This approach leverages the flexibility and diversity of language expressions to handle task topic variations and associated arguments. The introduction of a new task simply requires the addition of a new instruction template. We validate this framework across hundreds of tasks, demonstrating its robustness in facilitating multi-task predictions, performing on par with traditional multi-task learning and single-task learning approaches. Moreover, it shows exceptional adaptability to new tasks, with impressive zero-shot performance in some instances and superior data efficiency in few-shot scenarios. This novel approach offers a unified solution to manage a wide array of new and emerging tasks in clinical applications.
comment: Work in progress
☆ Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction
Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.
☆ On the Robustness of Language Models for Tabular Question Answering
Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
☆ AgentReview: Exploring Peer Review Dynamics with LLM Agents
Peer review is fundamental to the integrity and advancement of scientific publication. Traditional methods of peer review analyses often rely on exploration and statistics of existing peer review data, which do not adequately address the multivariate nature of the process, account for the latent variables, and are further constrained by privacy concerns due to the sensitive nature of the data. We introduce AgentReview, the first large language model (LLM) based peer review simulation framework, which effectively disentangles the impacts of multiple latent factors and addresses the privacy issue. Our study reveals significant insights, including a notable 37.1% variation in paper decisions due to reviewers' biases, supported by sociological theories such as the social influence theory, altruism fatigue, and authority bias. We believe that this study could offer valuable insights to improve the design of peer review mechanisms.
comment: 22 pages, 10 figures
☆ Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction ACL24
Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.
comment: 9 pages, 3 figures, ACL24 accepted
☆ Jailbreak Paradox: The Achilles' Heel of LLMs
We introduce two paradoxes concerning jailbreak of foundation models: First, it is impossible to construct a perfect jailbreak classifier, and second, a weaker model cannot consistently detect whether a stronger (in a pareto-dominant sense) model is jailbroken or not. We provide formal proofs for these paradoxes and a short case study on Llama and GPT4-o to demonstrate this. We discuss broader theoretical and practical repercussions of these results.
☆ MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL
Self-correction in text-to-SQL is the process of prompting large language model (LLM) to revise its previously incorrectly generated SQL, and commonly relies on manually crafted self-correction guidelines by human experts that are not only labor-intensive to produce but also limited by the human ability in identifying all potential error patterns in LLM responses. We introduce MAGIC, a novel multi-agent method that automates the creation of the self-correction guideline. MAGIC uses three specialized agents: a manager, a correction, and a feedback agent. These agents collaborate on the failures of an LLM-based method on the training set to iteratively generate and refine a self-correction guideline tailored to LLM mistakes, mirroring human processes but without human involvement. Our extensive experiments show that MAGIC's guideline outperforms expert human's created ones. We empirically find out that the guideline produced by MAGIC enhance the interpretability of the corrections made, providing insights in analyzing the reason behind the failures and successes of LLMs in self-correction. We make all agent interactions publicly available to the research community, to foster further research in this area, offering a synthetic dataset for future explorations into automatic self-correction guideline generation.
comment: 20 pages, 17 figures
☆ Using LLMs to Aid Annotation and Collection of Clinically-Enriched Data in Bipolar Disorder and Schizophrenia
NLP in mental health has been primarily social media focused. Real world practitioners also have high case loads and often domain specific variables, of which modern LLMs lack context. We take a dataset made by recruiting 644 participants, including individuals diagnosed with Bipolar Disorder (BD), Schizophrenia (SZ), and Healthy Controls (HC). Participants undertook tasks derived from a standardized mental health instrument, and the resulting data were transcribed and annotated by experts across five clinical variables. This paper demonstrates the application of contemporary language models in sequence-to-sequence tasks to enhance mental health research. Specifically, we illustrate how these models can facilitate the deployment of mental health instruments, data collection, and data annotation with high accuracy and scalability. We show that small models are capable of annotation for domain-specific clinical variables, data collection for mental-health instruments, and perform better then commercial large models.
☆ Measuring Psychological Depth in Language Models
Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and toxicity. While these metrics are indispensable, they do not speak to a story's subjective, psychological impact from a reader's perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM's ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff's alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of $0.51$ with human judgment while Llama-3-70B scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.
comment: Preprint. Under Review
☆ Vernacular? I Barely Know Her: Challenges with Style Control and Stereotyping
Large Language Models (LLMs) are increasingly being used in educational and learning applications. Research has demonstrated that controlling for style, to fit the needs of the learner, fosters increased understanding, promotes inclusion, and helps with knowledge distillation. To understand the capabilities and limitations of contemporary LLMs in style control, we evaluated five state-of-the-art models: GPT-3.5, GPT-4, GPT-4o, Llama-3, and Mistral-instruct- 7B across two style control tasks. We observed significant inconsistencies in the first task, with model performances averaging between 5th and 8th grade reading levels for tasks intended for first-graders, and standard deviations up to 27.6. For our second task, we observed a statistically significant improvement in performance from 0.02 to 0.26. However, we find that even without stereotypes in reference texts, LLMs often generated culturally insensitive content during their tasks. We provide a thorough analysis and discussion of the results.
☆ Estimating Knowledge in Large Language Models Without Generating a Single Token
To evaluate knowledge in large language models (LLMs), current methods query the model and then evaluate its generated responses. In this work, we ask whether evaluation can be done $\textit{before}$ the model has generated any text. Concretely, is it possible to estimate how knowledgeable a model is about a certain entity, only from its internal computation? We study this question with two tasks: given a subject entity, the goal is to predict (a) the ability of the model to answer common questions about the entity, and (b) the factuality of responses generated by the model about the entity. Experiments with a variety of LLMs show that KEEN, a simple probe trained over internal subject representations, succeeds at both tasks - strongly correlating with both the QA accuracy of the model per-subject and FActScore, a recent factuality metric in open-ended generation. Moreover, KEEN naturally aligns with the model's hedging behavior and faithfully reflects changes in the model's knowledge after fine-tuning. Lastly, we show a more interpretable yet equally performant variant of KEEN, which highlights a small set of tokens that correlates with the model's lack of knowledge. Being simple and lightweight, KEEN can be leveraged to identify gaps and clusters of entity knowledge in LLMs, and guide decisions such as augmenting queries with retrieval.
☆ CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis
The rise of unifying frameworks that enable seamless interoperability of Large Language Models (LLMs) has made LLM-LLM collaboration for open-ended tasks a possibility. Despite this, there have not been efforts to explore such collaborative writing. We take the next step beyond human-LLM collaboration to explore this multi-LLM scenario by generating the first exclusively LLM-generated collaborative stories dataset called CollabStory. We focus on single-author ($N=1$) to multi-author (up to $N=5$) scenarios, where multiple LLMs co-author stories. We generate over 32k stories using open-source instruction-tuned LLMs. Further, we take inspiration from the PAN tasks that have set the standard for human-human multi-author writing tasks and analysis. We extend their authorship-related tasks for multi-LLM settings and present baselines for LLM-LLM collaboration. We find that current baselines are not able to handle this emerging scenario. Thus, CollabStory is a resource that could help propel an understanding as well as the development of techniques to discern the use of multiple LLMs. This is crucial to study in the context of writing tasks since LLM-LLM collaboration could potentially overwhelm ongoing challenges related to plagiarism detection, credit assignment, maintaining academic integrity in educational settings, and addressing copyright infringement concerns. We make our dataset and code available at \texttt{\url{https://github.com/saranya-venkatraman/multi_llm_story_writing}}.
☆ Transforming Surgical Interventions with Embodied Intelligence for Ultrasound Robotics MICCAI 2024
Ultrasonography has revolutionized non-invasive diagnostic methodologies, significantly enhancing patient outcomes across various medical domains. Despite its advancements, integrating ultrasound technology with robotic systems for automated scans presents challenges, including limited command understanding and dynamic execution capabilities. To address these challenges, this paper introduces a novel Ultrasound Embodied Intelligence system that synergistically combines ultrasound robots with large language models (LLMs) and domain-specific knowledge augmentation, enhancing ultrasound robots' intelligence and operational efficiency. Our approach employs a dual strategy: firstly, integrating LLMs with ultrasound robots to interpret doctors' verbal instructions into precise motion planning through a comprehensive understanding of ultrasound domain knowledge, including APIs and operational manuals; secondly, incorporating a dynamic execution mechanism, allowing for real-time adjustments to scanning plans based on patient movements or procedural errors. We demonstrate the effectiveness of our system through extensive experiments, including ablation studies and comparisons across various models, showcasing significant improvements in executing medical procedures from verbal commands. Our findings suggest that the proposed system improves the efficiency and quality of ultrasound scans and paves the way for further advancements in autonomous medical scanning technologies, with the potential to transform non-invasive diagnostics and streamline medical workflows.
comment: This work has been accepted by MICCAI 2024
☆ Evaluating Transparency of Machine Generated Fact Checking Explanations
An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (whether an explanation cites sources properly) and utility (whether an explanation is helpful in clarifying a claim). Surprisingly, we found that large language models generate similar or higher quality explanations using machine-selected evidence, suggesting carefully curated evidence (by humans) may not be necessary. That said, even with the best model, the generated explanations are not always faithful to the sources, suggesting further room for improvement in explanation generation for fact-checking.
☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.
☆ DetectBench: Can Large Language Model Detect and Piece Together Implicit Evidence?
Detecting evidence within the context is a key step in the process of reasoning task. Evaluating and enhancing the capabilities of LLMs in evidence detection will strengthen context-based reasoning performance. This paper proposes a benchmark called DetectBench for verifying the ability to detect and piece together implicit evidence within a long context. DetectBench contains 3,928 multiple-choice questions, with an average of 994 tokens per question. Each question contains an average of 4.55 pieces of implicit evidence, and solving the problem typically requires 7.62 logical jumps to find the correct answer. To enhance the performance of LLMs in evidence detection, this paper proposes Detective Reasoning Prompt and Finetune. Experiments demonstrate that the existing LLMs' abilities to detect evidence in long contexts are far inferior to humans. However, the Detective Reasoning Prompt effectively enhances the capability of powerful LLMs in evidence detection, while the Finetuning method shows significant effects in enhancing the performance of weaker LLMs. Moreover, when the abilities of LLMs in evidence detection are improved, their final reasoning performance is also enhanced accordingly.
☆ Ask-before-Plan: Proactive Language Agents for Real-World Planning
The evolution of large language models (LLMs) has enhanced the planning capabilities of language agents in diverse real-world scenarios. Despite these advancements, the potential of LLM-powered agents to comprehend ambiguous user instructions for reasoning and decision-making is still under exploration. In this work, we introduce a new task, Proactive Agent Planning, which requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction, invoke external tools to collect valid information, and generate a plan to fulfill the user's demands. To study this practical problem, we establish a new benchmark dataset, Ask-before-Plan. To tackle the deficiency of LLMs in proactive planning, we propose a novel multi-agent framework, Clarification-Execution-Planning (\texttt{CEP}), which consists of three agents specialized in clarification, execution, and planning. We introduce the trajectory tuning scheme for the clarification agent and static execution agent, as well as the memory recollection mechanism for the dynamic execution agent. Extensive evaluations and comprehensive analyses conducted on the Ask-before-Plan dataset validate the effectiveness of our proposed framework.
☆ SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation
Out-of-distribution (OOD) detection is crucial for the safe deployment of neural networks. Existing CLIP-based approaches perform OOD detection by devising novel scoring functions or sophisticated fine-tuning methods. In this work, we propose SeTAR, a novel, training-free OOD detection method that leverages selective low-rank approximation of weight matrices in vision-language and vision-only models. SeTAR enhances OOD detection via post-hoc modification of the model's weight matrices using a simple greedy search algorithm. Based on SeTAR, we further propose SeTAR+FT, a fine-tuning extension optimizing model performance for OOD detection tasks. Extensive evaluations on ImageNet1K and Pascal-VOC benchmarks show SeTAR's superior performance, reducing the false positive rate by up to 18.95% and 36.80% compared to zero-shot and fine-tuning baselines. Ablation studies further validate our approach's effectiveness, robustness, and generalizability across different model backbones. Our work offers a scalable, efficient solution for OOD detection, setting a new state-of-the-art in this area.
comment: Code are available at \url{https://github.com/X1AOX1A/SeTAR}
☆ Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.
☆ Growing Trees on Sounds: Assessing Strategies for End-to-End Dependency Parsing of Speech ACL 2024
Direct dependency parsing of the speech signal -- as opposed to parsing speech transcriptions -- has recently been proposed as a task (Pupier et al. 2022), as a way of incorporating prosodic information in the parsing system and bypassing the limitations of a pipeline approach that would consist of using first an Automatic Speech Recognition (ASR) system and then a syntactic parser. In this article, we report on a set of experiments aiming at assessing the performance of two parsing paradigms (graph-based parsing and sequence labeling based parsing) on speech parsing. We perform this evaluation on a large treebank of spoken French, featuring realistic spontaneous conversations. Our findings show that (i) the graph based approach obtain better results across the board (ii) parsing directly from speech outperforms a pipeline approach, despite having 30% fewer parameters.
comment: Accepted at ACL 2024
☆ What makes two models think alike?
Do architectural differences significantly affect the way models represent and process language? We propose a new approach, based on metric-learning encoding models (MLEMs), as a first step to answer this question. The approach provides a feature-based comparison of how any two layers of any two models represent linguistic information. We apply the method to BERT, GPT-2 and Mamba. Unlike previous methods, MLEMs offer a transparent comparison, by identifying the specific linguistic features responsible for similarities and differences. More generally, the method uses formal, symbolic descriptions of a domain, and use these to compare neural representations. As such, the approach can straightforwardly be extended to other domains, such as speech and vision, and to other neural systems, including human brains.
comment: 7 pages, 6 figures
☆ From Insights to Actions: The Impact of Interpretability and Analysis Research on NLP
Interpretability and analysis (IA) research is a growing subfield within NLP with the goal of developing a deeper understanding of the behavior or inner workings of NLP systems and methods. Despite growing interest in the subfield, a commonly voiced criticism is that it lacks actionable insights and therefore has little impact on NLP. In this paper, we seek to quantify the impact of IA research on the broader field of NLP. We approach this with a mixed-methods analysis of: (1) a citation graph of 185K+ papers built from all papers published at ACL and EMNLP conferences from 2018 to 2023, and (2) a survey of 138 members of the NLP community. Our quantitative results show that IA work is well-cited outside of IA, and central in the NLP citation graph. Through qualitative analysis of survey responses and manual annotation of 556 papers, we find that NLP researchers build on findings from IA work and perceive it is important for progress in NLP, multiple subfields, and rely on its findings and terminology for their own work. Many novel methods are proposed based on IA findings and highly influenced by them, but highly influential non-IA work cites IA findings without being driven by them. We end by summarizing what is missing in IA work today and provide a call to action, to pave the way for a more impactful future of IA research.
☆ EUvsDisinfo: a Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles
This work introduces EUvsDisinfo, a multilingual dataset of trustworthy and disinformation articles related to pro-Kremlin themes. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting specific disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
comment: 4 pages, 3 figures, 2 tables
☆ Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting INTERSPEECH 2024
End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
comment: Accepted by INTERSPEECH 2024
☆ Bridging Local Details and Global Context in Text-Attributed Graphs
Representation learning on text-attributed graphs (TAGs) is vital for real-world applications, as they combine semantic textual and contextual structural information. Research in this field generally consist of two main perspectives: local-level encoding and global-level aggregating, respectively refer to textual node information unification (e.g., using Language Models) and structure-augmented modeling (e.g., using Graph Neural Networks). Most existing works focus on combining different information levels but overlook the interconnections, i.e., the contextual textual information among nodes, which provides semantic insights to bridge local and global levels. In this paper, we propose GraphBridge, a multi-granularity integration framework that bridges local and global perspectives by leveraging contextual textual information, enhancing fine-grained understanding of TAGs. Besides, to tackle scalability and efficiency challenges, we introduce a graphaware token reduction module. Extensive experiments across various models and datasets show that our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
☆ Low-Redundant Optimization for Large Language Model Alignment
Large language models (LLMs) are still struggling in aligning with human preference in complex tasks and scenarios. They are prone to overfit into the unexpected patterns or superficial styles in the training data. We conduct an empirical study that only selects the top-10\% most updated parameters in LLMs for alignment training, and see improvements in the convergence process and final performance. It indicates the existence of redundant neurons in LLMs for alignment training. To reduce its influence, we propose a low-redundant alignment method named \textbf{ALLO}, focusing on optimizing the most related neurons with the most useful supervised signals. Concretely, we first identify the neurons that are related to the human preference data by a gradient-based strategy, then identify the alignment-related key tokens by reward models for computing loss. Besides, we also decompose the alignment process into the forgetting and learning stages, where we first forget the tokens with unaligned knowledge and then learn aligned knowledge, by updating different ratios of neurons, respectively. Experimental results on 10 datasets have shown the effectiveness of ALLO. Our code and data are available at \url{https://github.com/RUCAIBox/ALLO}.
comment: 14 pages, working in progress
PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs) for efficient document retrieval without relying on external indexes. However, DSIs need full re-training to handle updates in dynamic corpora, causing significant computational inefficiencies. We introduce PromptDSI, a rehearsal-free, prompt-based approach for instance-wise incremental learning in document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of DSI, leveraging its powerful representation to efficiently index new corpora while maintaining a balance between stability and plasticity. We eliminate the initial forward pass of prompt-based continual learning methods that doubles training and inference time. Moreover, we propose a topic-aware prompt pool that employs neural topic embeddings as fixed keys. This strategy ensures diverse and effective prompt usage, addressing the challenge of parameter underutilization caused by the collapse of the query-key matching mechanism. Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing forgetting while significantly enhancing recall by over 4% on new corpora.
comment: 21 pages
☆ Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the Generation of each token by LLMs as a Classification (GaC) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks.
☆ Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 5th graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
☆ Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection
In this paper, we study the problem of detecting machine-generated text when the large language model (LLM) it is possibly derived from is unknown. We do so by apply ensembling methods to the outputs from DetectGPT classifiers (Mitchell et al. 2023), a zero-shot model for machine-generated text detection which is highly accurate when the generative (or base) language model is the same as the discriminative (or scoring) language model. We find that simple summary statistics of DetectGPT sub-model outputs yield an AUROC of 0.73 (relative to 0.61) while retaining its zero-shot nature, and that supervised learning methods sharply boost the accuracy to an AUROC of 0.94 but require a training dataset. This suggests the possibility of further generalisation to create a highly-accurate, model-agnostic machine-generated text detector.
☆ RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) effectively addresses issues of static knowledge and hallucination in large language models. Existing studies mostly focus on question scenarios with clear user intents and concise answers. However, it is prevalent that users issue broad, open-ended queries with diverse sub-intents, for which they desire rich and long-form answers covering multiple relevant aspects. To tackle this important yet underexplored problem, we propose a novel RAG framework, namely RichRAG. It includes a sub-aspect explorer to identify potential sub-aspects of input questions, a multi-faceted retriever to build a candidate pool of diverse external documents related to these sub-aspects, and a generative list-wise ranker, which is a key module to provide the top-k most valuable documents for the final generator. These ranked documents sufficiently cover various query aspects and are aware of the generator's preferences, hence incentivizing it to produce rich and comprehensive responses for users. The training of our ranker involves a supervised fine-tuning stage to ensure the basic coverage of documents, and a reinforcement learning stage to align downstream LLM's preferences to the ranking of documents. Experimental results on two publicly available datasets prove that our framework effectively and efficiently provides comprehensive and satisfying responses to users.
☆ Low-Resource Machine Translation through the Lens of Personalized Federated Learning
We present a new approach based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task (Small Track #2) and the subset of Sami languages from the multilingual benchmark for Finno-Ugric languages. In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply with a few lines of code, and we provide scripts for reproducing the experiments at https://github.com/VityaVitalich/MeritFed
comment: 18 pages, 7 figures
☆ MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts
Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
☆ P-Tailor: Customizing Personality Traits for Language Models via Mixture of Specialized LoRA Experts
Personalized large language models (LLMs) have attracted great attention in many applications, such as intelligent education and emotional support. Most work focuses on controlling the character settings based on the profile (e.g., age, skill, experience, and so on). Conversely, the psychological theory-based personality traits with implicit expression and behavior are not well modeled, limiting their potential application in more specialized fields such as the psychological counseling agents. In this paper, we propose a mixture of experts (MoE)-based personalized LLMs, named P-tailor, to model the Big Five Personality Traits. Particularly, we learn specialized LoRA experts to represent various traits, such as openness, conscientiousness, extraversion, agreeableness and neuroticism. Then, we integrate P-Tailor with a personality specialization loss, promoting experts to specialize in distinct personality traits, thereby enhancing the efficiency of model parameter utilization. Due to the lack of datasets, we also curate a high-quality personality crafting dataset (PCD) to learn and develop the ability to exhibit different personality traits across various topics. We conduct extensive experiments to verify the great performance and effectiveness of P-Tailor in manipulation of the fine-grained personality traits of LLMs.
☆ Liar, Liar, Logical Mire: A Benchmark for Suppositional Reasoning in Large Language Models
Knights and knaves problems represent a classic genre of logical puzzles where characters either tell the truth or lie. The objective is to logically deduce each character's identity based on their statements. The challenge arises from the truth-telling or lying behavior, which influences the logical implications of each statement. Solving these puzzles requires not only direct deductions from individual statements, but the ability to assess the truthfulness of statements by reasoning through various hypothetical scenarios. As such, knights and knaves puzzles serve as compelling examples of suppositional reasoning. In this paper, we introduce $\textit{TruthQuest}$, a benchmark for suppositional reasoning based on the principles of knights and knaves puzzles. Our benchmark presents problems of varying complexity, considering both the number of characters and the types of logical statements involved. Evaluations on $\textit{TruthQuest}$ show that large language models like Llama 3 and Mixtral-8x7B exhibit significant difficulties solving these tasks. A detailed error analysis of the models' output reveals that lower-performing models exhibit a diverse range of reasoning errors, frequently failing to grasp the concept of truth and lies. In comparison, more proficient models primarily struggle with accurately inferring the logical implications of potentially false statements.
comment: 22 pages, 19 figures
☆ Unified Active Retrieval for Retrieval Augmented Generation
In Retrieval-Augmented Generation (RAG), retrieval is not always helpful and applying it to every instruction is sub-optimal. Therefore, determining whether to retrieve is crucial for RAG, which is usually referred to as Active Retrieval. However, existing active retrieval methods face two challenges: 1. They usually rely on a single criterion, which struggles with handling various types of instructions. 2. They depend on specialized and highly differentiated procedures, and thus combining them makes the RAG system more complicated and leads to higher response latency. To address these challenges, we propose Unified Active Retrieval (UAR). UAR contains four orthogonal criteria and casts them into plug-and-play classification tasks, which achieves multifaceted retrieval timing judgements with negligible extra inference cost. We further introduce the Unified Active Retrieval Criteria (UAR-Criteria), designed to process diverse active retrieval scenarios through a standardized procedure. Experiments on four representative types of user instructions show that UAR significantly outperforms existing work on the retrieval timing judgement and the performance of downstream tasks, which shows the effectiveness of UAR and its helpfulness to downstream tasks.
☆ FuseGen: PLM Fusion for Data-generation based Zero-shot Learning
Data generation-based zero-shot learning, although effective in training Small Task-specific Models (STMs) via synthetic datasets generated by Pre-trained Language Models (PLMs), is often limited by the low quality of such synthetic datasets. Previous solutions have primarily focused on single PLM settings, where synthetic datasets are typically restricted to specific sub-spaces and often deviate from real-world distributions, leading to severe distribution bias. To mitigate such bias, we propose FuseGen, a novel data generation-based zero-shot learning framework that introduces a new criteria for subset selection from synthetic datasets via utilizing multiple PLMs and trained STMs. The chosen subset provides in-context feedback to each PLM, enhancing dataset quality through iterative data generation. Trained STMs are then used for sample re-weighting as well, further improving data quality. Extensive experiments across diverse tasks demonstrate that FuseGen substantially outperforms existing methods, highly effective in boosting STM performance in a PLM-agnostic way. Code is provided in https://github.com/LindaLydia/FuseGen.
comment: 17 pages, 8 figures, 12 tabels
☆ Code-Optimise: Self-Generated Preference Data for Correctness and Efficiency EMNLP 2024
Code Language Models have been trained to generate accurate solutions, typically with no regard for runtime. On the other hand, previous works that explored execution optimisation have observed corresponding drops in functional correctness. To that end, we introduce Code-Optimise, a framework that incorporates both correctness (passed, failed) and runtime (quick, slow) as learning signals via self-generated preference data. Our framework is both lightweight and robust as it dynamically selects solutions to reduce overfitting while avoiding a reliance on larger models for learning signals. Code-Optimise achieves significant improvements in pass@k while decreasing the competitive baseline runtimes by an additional 6% for in-domain data and up to 3% for out-of-domain data. As a byproduct, the average length of the generated solutions is reduced by up to 48% on MBPP and 23% on HumanEval, resulting in faster and cheaper inference. The generated data and codebase will be open-sourced at www.open-source.link.
comment: Under review at ARR (for EMNLP 2024)
☆ LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization
Open-Domain Multi-Document Summarization (ODMDS) is crucial for addressing diverse information needs, which aims to generate a summary as answer to user's query, synthesizing relevant content from multiple documents in a large collection. Existing approaches that first find relevant passages and then generate a summary using a language model are inadequate for ODMDS. This is because open-ended queries often require additional context for the retrieved passages to cover the topic comprehensively, making it challenging to retrieve all relevant passages initially. While iterative retrieval methods have been explored for multi-hop question answering (MQA), they are impractical for ODMDS due to high latency from repeated large language model (LLM) inference for reasoning. To address this issue, we propose LightPAL, a lightweight passage retrieval method for ODMDS that constructs a graph representing passage relationships using an LLM during indexing and employs random walk instead of iterative reasoning and retrieval at inference time. Experiments on ODMDS benchmarks show that LightPAL outperforms baseline retrievers in summary quality while being significantly more efficient than an iterative MQA approach.
comment: 13 pages, 3 figures
☆ The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Stance detection holds great potential for enhancing the quality of online political discussions, as it has shown to be useful for summarizing discussions, detecting misinformation, and evaluating opinion distributions. Usually, transformer-based models are used directly for stance detection, which require large amounts of data. However, the broad range of debate questions in online political discussion creates a variety of possible scenarios that the model is faced with and thus makes data acquisition for model training difficult. In this work, we show how to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions:(i) We generate synthetic data for specific debate questions by prompting a Mistral-7B model and show that fine-tuning with the generated synthetic data can substantially improve the performance of stance detection. (ii) We examine the impact of combining synthetic data with the most informative samples from an unlabelled dataset. First, we use the synthetic data to select the most informative samples, second, we combine both these samples and the synthetic data for fine-tuning. This approach reduces labelling effort and consistently surpasses the performance of the baseline model that is trained with fully labeled data. Overall, we show in comprehensive experiments that LLM-generated data greatly improves stance detection performance for online political discussions.
☆ Exploring Intra and Inter-language Consistency in Embeddings with ICA
Word embeddings represent words as multidimensional real vectors, facilitating data analysis and processing, but are often challenging to interpret. Independent Component Analysis (ICA) creates clearer semantic axes by identifying independent key features. Previous research has shown ICA's potential to reveal universal semantic axes across languages. However, it lacked verification of the consistency of independent components within and across languages. We investigated the consistency of semantic axes in two ways: both within a single language and across multiple languages. We first probed into intra-language consistency, focusing on the reproducibility of axes by performing ICA multiple times and clustering the outcomes. Then, we statistically examined inter-language consistency by verifying those axes' correspondences using statistical tests. We newly applied statistical methods to establish a robust framework that ensures the reliability and universality of semantic axes.
☆ Fighting Randomness with Randomness: Mitigating Optimisation Instability of Fine-Tuning using Delayed Ensemble and Noisy Interpolation
While fine-tuning of pre-trained language models generally helps to overcome the lack of labelled training samples, it also displays model performance instability. This instability mainly originates from randomness in initialisation or data shuffling. To address this, researchers either modify the training process or augment the available samples, which typically results in increased computational costs. We propose a new mitigation strategy, called Delayed Ensemble with Noisy Interpolation (DENI), that leverages the strengths of ensembling, noise regularisation and model interpolation, while retaining computational efficiency. We compare DENI with 9 representative mitigation strategies across 3 models, 4 tuning strategies and 7 text classification datasets. We show that: 1) DENI outperforms the best performing mitigation strategy (Ensemble), while using only a fraction of its cost; 2) the mitigation strategies are beneficial for parameter-efficient fine-tuning (PEFT) methods, outperforming full fine-tuning in specific cases; and 3) combining DENI with data augmentation often leads to even more effective instability mitigation.
☆ Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities
The parametric knowledge memorized by large language models (LLMs) becomes outdated quickly. In-context editing (ICE) is currently the most effective method for updating the knowledge of LLMs. Recent advancements involve enhancing ICE by modifying the decoding strategy, obviating the need for altering internal model structures or adjusting external prompts. However, this enhancement operates across the entire sequence generation, encompassing a plethora of non-critical tokens. In this work, we introduce $\textbf{A}$daptive $\textbf{T}$oken $\textbf{Bias}$er ($\textbf{ATBias}$), a new decoding technique designed to enhance ICE. It focuses on the tokens that are mostly related to knowledge during decoding, biasing their logits by matching key entities related to new and parametric knowledge. Experimental results show that ATBias significantly enhances ICE performance, achieving up to a 32.3% improvement over state-of-the-art ICE methods while incurring only half the latency. ATBias not only improves the knowledge editing capabilities of ICE but can also be widely applied to LLMs with negligible cost.
☆ Abstraction-of-Thought Makes Language Models Better Reasoners
Abstract reasoning, the ability to reason from the abstract essence of a problem, serves as a key to generalization in human reasoning. However, eliciting language models to perform reasoning with abstraction remains unexplored. This paper seeks to bridge this gap by introducing a novel structured reasoning format called Abstraction-of-Thought (AoT). The uniqueness of AoT lies in its explicit requirement for varying levels of abstraction within the reasoning process. This approach could elicit language models to first contemplate on the abstract level before incorporating concrete details, which is overlooked by the prevailing step-by-step Chain-of-Thought (CoT) method. To align models with the AoT format, we present AoT Collection, a generic finetuning dataset consisting of 348k high-quality samples with AoT reasoning processes, collected via an automated and scalable pipeline. We finetune a wide range of language models with AoT Collection and conduct extensive evaluations on 23 unseen tasks from the challenging benchmark Big-Bench Hard. Experimental results indicate that models aligned to AoT reasoning format substantially outperform those aligned to CoT in many reasoning tasks.
comment: Work in Process
☆ PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers NAACL 2024
In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.
comment: NAACL 2024
☆ PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
comment: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM
☆ Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech
We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.cz/services/udpipe/. The source code is available at GitHub (https://github.com/ufal/udpipe/tree/udpipe-2), along with a Python client for a simple use. The documentation for the models can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model.
comment: Accepted to TSD 2024
☆ MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event templates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
☆ AI-Assisted Human Evaluation of Machine Translation
Annually, research teams spend large amounts of money to evaluate the quality of machine translation systems (WMT, inter alia). This is expensive because it requires detailed human labor. The recently proposed annotation protocol, Error Span Annotation (ESA), has annotators marking erroneous parts of the translation. In our work, we help the annotators by pre-filling the span annotations with automatic quality estimation. With AI assistance, we obtain more detailed annotations while cutting down the time per span annotation by half (71s/error span $\rightarrow$ 31s/error span). The biggest advantage of ESA$^\mathrm{AI}$ protocol is an accurate priming of annotators (pre-filled error spans) before they assign the final score as opposed to starting from scratch. In addition, the annotation budget can be reduced by up to 24% with filtering of examples that the AI deems to be very likely to be correct.
☆ Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
☆ PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models
In the context of real-world applications, leveraging large language models (LLMs) for domain-specific tasks often faces two major challenges: domain-specific knowledge privacy and constrained resources. To address these issues, we propose PDSS, a privacy-preserving framework for step-by-step distillation of LLMs. PDSS works on a server-client architecture, wherein client transmits perturbed prompts to the server's LLM for rationale generation. The generated rationales are then decoded by the client and used to enrich the training of task-specific small language model(SLM) within a multi-task learning paradigm. PDSS introduces two privacy protection strategies: the Exponential Mechanism Strategy and the Encoder-Decoder Strategy, balancing prompt privacy and rationale usability. Experiments demonstrate the effectiveness of PDSS in various text generation tasks, enabling the training of task-specific SLM with enhanced performance while prioritizing data privacy protection.
☆ Flee the Flaw: Annotating the Underlying Logic of Fallacious Arguments Through Templates and Slot-filling
Prior research in computational argumentation has mainly focused on scoring the quality of arguments, with less attention on explicating logical errors. In this work, we introduce four sets of explainable templates for common informal logical fallacies designed to explicate a fallacy's implicit logic. Using our templates, we conduct an annotation study on top of 400 fallacious arguments taken from LOGIC dataset and achieve a high agreement score (Krippendorf's alpha of 0.54) and reasonable coverage (0.83). Finally, we conduct an experiment for detecting the structure of fallacies and discover that state-of-the-art language models struggle with detecting fallacy templates (0.47 accuracy). To facilitate research on fallacies, we make our dataset and guidelines publicly available.
☆ QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities
With the increasing role of Natural Language Processing (NLP) in various applications, challenges concerning bias and stereotype perpetuation are accentuated, which often leads to hate speech and harm. Despite existing studies on sexism and misogyny, issues like homophobia and transphobia remain underexplored and often adopt binary perspectives, putting the safety of LGBTQIA+ individuals at high risk in online spaces. In this paper, we assess the potential harm caused by sentence completions generated by English large language models (LLMs) concerning LGBTQIA+ individuals. This is achieved using QueerBench, our new assessment framework, which employs a template-based approach and a Masked Language Modeling (MLM) task. The analysis indicates that large language models tend to exhibit discriminatory behaviour more frequently towards individuals within the LGBTQIA+ community, reaching a difference gap of 7.2% in the QueerBench score of harmfulness.
☆ Unveiling the Flaws: Exploring Imperfections in Synthetic Data and Mitigation Strategies for Large Language Models
Synthetic data has been proposed as a solution to address the issue of high-quality data scarcity in the training of large language models (LLMs). Studies have shown that synthetic data can effectively improve the performance of LLMs on downstream benchmarks. However, despite its potential benefits, our analysis suggests that there may be inherent flaws in synthetic data. The uniform format of synthetic data can lead to pattern overfitting and cause significant shifts in the output distribution, thereby reducing the model's instruction-following capabilities. Our work delves into these specific flaws associated with question-answer (Q-A) pairs, a prevalent type of synthetic data, and presents a method based on unlearning techniques to mitigate these flaws. The empirical results demonstrate the effectiveness of our approach, which can reverse the instruction-following issues caused by pattern overfitting without compromising performance on benchmarks at relatively low cost. Our work has yielded key insights into the effective use of synthetic data, aiming to promote more robust and efficient LLM training.
comment: 15 pages
☆ EMO-KNOW: A Large Scale Dataset on Emotion and Emotion-cause EMNLP 2023
Emotion-Cause analysis has attracted the attention of researchers in recent years. However, most existing datasets are limited in size and number of emotion categories. They often focus on extracting parts of the document that contain the emotion cause and fail to provide more abstractive, generalizable root cause. To bridge this gap, we introduce a large-scale dataset of emotion causes, derived from 9.8 million cleaned tweets over 15 years. We describe our curation process, which includes a comprehensive pipeline for data gathering, cleaning, labeling, and validation, ensuring the dataset's reliability and richness. We extract emotion labels and provide abstractive summarization of the events causing emotions. The final dataset comprises over 700,000 tweets with corresponding emotion-cause pairs spanning 48 emotion classes, validated by human evaluators. The novelty of our dataset stems from its broad spectrum of emotion classes and the abstractive emotion cause that facilitates the development of an emotion-cause knowledge graph for nuanced reasoning. Our dataset will enable the design of emotion-aware systems that account for the diverse emotional responses of different people for the same event.
comment: Accepted to Findings of EMNLP 2023
☆ Performant ASR Models for Medical Entities in Accented Speech
Recent strides in automatic speech recognition (ASR) have accelerated their application in the medical domain where their performance on accented medical named entities (NE) such as drug names, diagnoses, and lab results, is largely unknown. We rigorously evaluate multiple ASR models on a clinical English dataset of 93 African accents. Our analysis reveals that despite some models achieving low overall word error rates (WER), errors in clinical entities are higher, potentially posing substantial risks to patient safety. To empirically demonstrate this, we extract clinical entities from transcripts, develop a novel algorithm to align ASR predictions with these entities, and compute medical NE Recall, medical WER, and character error rate. Our results show that fine-tuning on accented clinical speech improves medical WER by a wide margin (25-34 % relative), improving their practical applicability in healthcare environments.
comment: Accepted at Interspeech 2024
☆ IPEval: A Bilingual Intellectual Property Agency Consultation Evaluation Benchmark for Large Language Models
The rapid development of Large Language Models (LLMs) in vertical domains, including intellectual property (IP), lacks a specific evaluation benchmark for assessing their understanding, application, and reasoning abilities. To fill this gap, we introduce IPEval, the first evaluation benchmark tailored for IP agency and consulting tasks. IPEval comprises 2657 multiple-choice questions across four major dimensions: creation, application, protection, and management of IP. These questions span patent rights (inventions, utility models, designs), trademarks, copyrights, trade secrets, and other related laws. Evaluation methods include zero-shot, 5-few-shot, and Chain of Thought (CoT) for seven LLM types, predominantly in English or Chinese. Results show superior English performance by models like GPT series and Qwen series, while Chinese-centric LLMs excel in Chinese tests, albeit specialized IP LLMs lag behind general-purpose ones. Regional and temporal aspects of IP underscore the need for LLMs to grasp legal nuances and evolving laws. IPEval aims to accurately gauge LLM capabilities in IP and spur development of specialized models. Website: \url{https://ipeval.github.io/}
☆ From Instance Training to Instruction Learning: Task Adapters Generation from Instructions
Large language models (LLMs) have acquired the ability to solve general tasks by utilizing instruction finetuning (IFT). However, IFT still relies heavily on instance training of extensive task data, which greatly limits the adaptability of LLMs to real-world scenarios where labeled task instances are scarce and broader task generalization becomes paramount. Contrary to LLMs, humans acquire skills and complete tasks not merely through repeated practice but also by understanding and following instructional guidelines. This paper is dedicated to simulating human learning to address the shortcomings of instance training, focusing on instruction learning to enhance cross-task generalization. Within this context, we introduce Task Adapters Generation from Instructions (TAGI), which automatically constructs the task-specific model in a parameter generation manner based on the given task instructions without retraining for unseen tasks. Specifically, we utilize knowledge distillation to enhance the consistency between TAGI developed through Learning with Instruction and task-specific models developed through Training with Instance, by aligning the labels, output logits, and adapter parameters between them. TAGI is endowed with cross-task generalization capabilities through a two-stage training process that includes hypernetwork pretraining and finetuning. We evaluate TAGI on the Super-Natural Instructions and P3 datasets. The experimental results demonstrate that TAGI can match or even outperform traditional meta-trained models and other hypernetwork models, while significantly reducing computational requirements.
☆ QOG:Question and Options Generation based on Language Model
Question-Options Generation (QOG) is a task that involves generating a set of question-options pairs given context. This task has various applications, including fine-tuning large models, information retrieval, and automated multiple-choice question generation for education. In this paper, we develop QOG models using three different methods based on fine-tuning sequence-to-sequence language models (LMs). Experiments demonstrate that the end-to-end QOG model is computationally efficient and stable during both training and inference, outperforming other methods. Furthermore, our analysis indicates that our QOG models are competitive on the QOG task compared to the large language model Llama 3-8B.
comment: 8 pages, 3 figures, 4 tables
☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
☆ A Comparative Study of Continuous Sign Language Recognition Techniques
Continuous Sign Language Recognition (CSLR) focuses on the interpretation of a sequence of sign language gestures performed continually without pauses. In this study, we conduct an empirical evaluation of recent deep learning CSLR techniques and assess their performance across various datasets and sign languages. The models selected for analysis implement a range of approaches for extracting meaningful features and employ distinct training strategies. To determine their efficacy in modeling different sign languages, these models were evaluated using multiple datasets, specifically RWTH-PHOENIX-Weather-2014, ArabSign, and GrSL, each representing a unique sign language. The performance of the models was further tested with unseen signers and sentences. The conducted experiments establish new benchmarks on the selected datasets and provide valuable insights into the robustness and generalization of the evaluated techniques under challenging scenarios.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Does Context Help Mitigate Gender Bias in Neural Machine Translation?
Neural Machine Translation models tend to perpetuate gender bias present in their training data distribution. Context-aware models have been previously suggested as a means to mitigate this type of bias. In this work, we examine this claim by analysing in detail the translation of stereotypical professions in English to German, and translation with non-informative context in Basque to Spanish. Our results show that, although context-aware models can significantly enhance translation accuracy for feminine terms, they can still maintain or even amplify gender bias. These results highlight the need for more fine-grained approaches to bias mitigation in Neural Machine Translation.
☆ Cross-Lingual Unlearning of Selective Knowledge in Multilingual Language Models
Pretrained language models memorize vast amounts of information, including private and copyrighted data, raising significant safety concerns. Retraining these models after excluding sensitive data is prohibitively expensive, making machine unlearning a viable, cost-effective alternative. Previous research has focused on machine unlearning for monolingual models, but we find that unlearning in one language does not necessarily transfer to others. This vulnerability makes models susceptible to low-resource language attacks, where sensitive information remains accessible in less dominant languages. This paper presents a pioneering approach to machine unlearning for multilingual language models, selectively erasing information across different languages while maintaining overall performance. Specifically, our method employs an adaptive unlearning scheme that assigns language-dependent weights to address different language performances of multilingual language models. Empirical results demonstrate the effectiveness of our framework compared to existing unlearning baselines, setting a new standard for secure and adaptable multilingual language models.
comment: 15 pages, 5 figures
☆ Interpreting Bias in Large Language Models: A Feature-Based Approach
Large Language Models (LLMs) such as Mistral and LLaMA have showcased remarkable performance across various natural language processing (NLP) tasks. Despite their success, these models inherit social biases from the diverse datasets on which they are trained. This paper investigates the propagation of biases within LLMs through a novel feature-based analytical approach. Drawing inspiration from causal mediation analysis, we hypothesize the evolution of bias-related features and validate them using interpretability techniques like activation and attribution patching. Our contributions are threefold: (1) We introduce and empirically validate a feature-based method for bias analysis in LLMs, applied to LLaMA-2-7B, LLaMA-3-8B, and Mistral-7B-v0.3 with templates from a professions dataset. (2) We extend our method to another form of gender bias, demonstrating its generalizability. (3) We differentiate the roles of MLPs and attention heads in bias propagation and implement targeted debiasing using a counterfactual dataset. Our findings reveal the complex nature of bias in LLMs and emphasize the necessity for tailored debiasing strategies, offering a deeper understanding of bias mechanisms and pathways for effective mitigation.
☆ A Compass for Navigating the World of Sentence Embeddings for the Telecom Domain
A plethora of sentence embedding models makes it challenging to choose one, especially for domains such as telecom, rich with specialized vocabulary. We evaluate multiple embeddings obtained from publicly available models and their domain-adapted variants, on both point retrieval accuracies as well as their (95\%) confidence intervals. We establish a systematic method to obtain thresholds for similarity scores for different embeddings. We observe that fine-tuning improves mean bootstrapped accuracies as well as tightens confidence intervals. The pre-training combined with fine-tuning makes confidence intervals even tighter. To understand these variations, we analyse and report significant correlations between the distributional overlap between top-$K$, correct and random sentence similarities with retrieval accuracies and similarity thresholds. Following current literature, we analyze if retrieval accuracy variations can be attributed to isotropy of embeddings. Our conclusions are that isotropy of embeddings (as measured by two independent state-of-the-art isotropy metric definitions) cannot be attributed to better retrieval performance. However, domain adaptation which improves retrieval accuracies also improves isotropy. We establish that domain adaptation moves domain specific embeddings further away from general domain embeddings.
comment: 10 pages, 3 figures, 4 tables
☆ Attention Score is not All You Need for Token Importance Indicator in KV Cache Reduction: Value Also Matters
Scaling the context size of large language models (LLMs) enables them to perform various new tasks, e.g., book summarization. However, the memory cost of the Key and Value (KV) cache in attention significantly limits the practical applications of LLMs. Recent works have explored token pruning for KV cache reduction in LLMs, relying solely on attention scores as a token importance indicator. However, our investigation into value vector norms revealed a notably non-uniform pattern questioning their reliance only on attention scores. Inspired by this, we propose a new method: Value-Aware Token Pruning (VATP) which uses both attention scores and the $ \ell_{1} $ norm of value vectors to evaluate token importance. Extensive experiments on LLaMA2-7B-chat and Vicuna-v1.5-7B across 16 LongBench tasks demonstrate VATP's superior performance.
☆ Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding
Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introduce a novel approach that re-imagines information retrieval through dynamic in-context editing, inspired by recent breakthroughs in knowledge editing. By treating lengthy contexts as malleable external knowledge, our method interactively gathers and integrates relevant information, thereby enabling LLMs to perform sophisticated reasoning steps. Experimental results demonstrate that our method effectively empowers context-limited LLMs, such as Llama2, to engage in multi-hop reasoning with improved performance, which outperforms state-of-the-art context window extrapolation methods and even compares favorably to more advanced commercial long-context models. Our interactive method not only enhances reasoning capabilities but also mitigates the associated training and computational costs, making it a pragmatic solution for enhancing LLMs' reasoning within expansive contexts.
☆ SNAP: Unlearning Selective Knowledge in Large Language Models with Negative Instructions
Instruction-following large language models (LLMs), such as ChatGPT, have become increasingly popular with the general audience, many of whom are incorporating them into their daily routines. However, these LLMs inadvertently disclose personal or copyrighted information, which calls for a machine unlearning method to remove selective knowledge. Previous attempts sought to forget the link between the target information and its associated entities, but it rather led to generating undesirable responses about the target, compromising the end-user experience. In this work, we propose SNAP, an innovative framework designed to selectively unlearn information by 1) training an LLM with negative instructions to generate obliterated responses, 2) augmenting hard positives to retain the original LLM performance, and 3) applying the novel Wasserstein regularization to ensure adequate deviation from the initial weights of the LLM. We evaluate our framework on various NLP benchmarks and demonstrate that our approach retains the original LLM capabilities, while successfully unlearning the specified information.
comment: 16 pages, 5 figures
☆ Automatic benchmarking of large multimodal models via iterative experiment programming
Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
comment: 31 pages, 6 figures, code is available at https://github.com/altndrr/apex
☆ PRePair: Pointwise Reasoning Enhance Pairwise Evaluating for Robust Instruction-Following Assessments
Pairwise evaluation using large language models (LLMs) is widely used for evaluating natural language generation (NLG) tasks. However, the reliability of LLMs is often compromised by biases, such as favoring verbosity and authoritative tone. In the study, we focus on the comparison of two LLM-based evaluation approaches, pointwise and pairwise. Our findings demonstrate that pointwise evaluators exhibit more robustness against undesirable preferences. Further analysis reveals that pairwise evaluators can accurately identify the shortcomings of low-quality outputs even when their judgment is incorrect. These results indicate that LLMs are more severely influenced by their bias in a pairwise evaluation setup. To mitigate this, we propose a hybrid method that integrates pointwise reasoning into pairwise evaluation. Experimental results show that our method enhances the robustness of pairwise evaluators against adversarial samples while preserving accuracy on normal samples.
☆ Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model
Recently, multi-task spoken language understanding (SLU) models have emerged, designed to address various speech processing tasks. However, these models often rely on a large number of parameters. Also, they often encounter difficulties in adapting to new data for a specific task without experiencing catastrophic forgetting of previously trained tasks. In this study, we propose finding task-specific subnetworks within a multi-task SLU model via neural network pruning. In addition to model compression, we expect that the forgetting of previously trained tasks can be mitigated by updating only a task-specific subnetwork. We conduct experiments on top of the state-of-the-art multi-task SLU model ``UniverSLU'', trained for several tasks such as emotion recognition (ER), intent classification (IC), and automatic speech recognition (ASR). We show that pruned models were successful in adapting to additional ASR or IC data with minimal performance degradation on previously trained tasks.
comment: Accepted to Interspeech2024
☆ Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?
Recent advancements in integrating large language models (LLMs) with tools have allowed the models to interact with real-world environments. However, these tool-augmented LLMs often encounter incomplete scenarios when users provide partial information or the necessary tools are unavailable. Recognizing and managing such scenarios is crucial for LLMs to ensure their reliability, but this exploration remains understudied. This study examines whether LLMs can identify incomplete conditions and appropriately determine when to refrain from using tools. To this end, we address a dataset by manipulating instances from two datasets by removing necessary tools or essential information for tool invocation. We confirm that most LLMs are challenged to identify the additional information required to utilize specific tools and the absence of appropriate tools. Our research can contribute to advancing reliable LLMs by addressing scenarios that commonly arise during interactions between humans and LLMs.
☆ COT: A Generative Approach for Hate Speech Counter-Narratives via Contrastive Optimal Transport
Counter-narratives, which are direct responses consisting of non-aggressive fact-based arguments, have emerged as a highly effective approach to combat the proliferation of hate speech. Previous methodologies have primarily focused on fine-tuning and post-editing techniques to ensure the fluency of generated contents, while overlooking the critical aspects of individualization and relevance concerning the specific hatred targets, such as LGBT groups, immigrants, etc. This research paper introduces a novel framework based on contrastive optimal transport, which effectively addresses the challenges of maintaining target interaction and promoting diversification in generating counter-narratives. Firstly, an Optimal Transport Kernel (OTK) module is leveraged to incorporate hatred target information in the token representations, in which the comparison pairs are extracted between original and transported features. Secondly, a self-contrastive learning module is employed to address the issue of model degeneration. This module achieves this by generating an anisotropic distribution of token representations. Finally, a target-oriented search method is integrated as an improved decoding strategy to explicitly promote domain relevance and diversification in the inference process. This strategy modifies the model's confidence score by considering both token similarity and target relevance. Quantitative and qualitative experiments have been evaluated on two benchmark datasets, which demonstrate that our proposed model significantly outperforms current methods evaluated by metrics from multiple aspects.
comment: IEEE jounrnals
☆ Fast and Slow Generating: An Empirical Study on Large and Small Language Models Collaborative Decoding
Large Language Models (LLMs) demonstrate impressive performance in diverse applications, yet they face significant drawbacks, including high inference latency, expensive training cost, and generation of hallucination. Collaborative decoding between large and small language models (SLMs) offers a novel approach to address these challenges. Inspired by dual-process cognitive theory, we integrate these methods into a unified framework termed Fast and Slow Generating (FS-GEN). This paper explores several techniques within the FS-GEN framework, including speculative decoding, contrastive decoding, and emulator or proxy fine-tuning. We provide a comprehensive analysis of these methodologies, offering insights into their similarities and differences under this framework. Our study delves into the differential knowledge capabilities of LLMs versus SLMs through the FS-GEN lens, revealing that fewer than 20% of collaborative interactions are required across various methods. These interactions adhere to a scaling law relative to the parameter ratios, thereby facilitating predictable collaboration. Furthermore, we investigate the specific positions where collaboration is most effective from an uncertainty perspective, yielding novel insights that could refine FS-GEN methods. Our findings reveal that the essential difference between models of different sizes lies in the uncertainty of the next token prediction, where interventions by larger models are most needed to assist the smaller ones. Code for Reproduction: https://github.com/TsinghuaC3I/FS-GEN
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in \textit{a single autoregressive step}, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of an LLM to control a symbolic architecture which performs arithmetic. Our implementation using Llama 3 8B Instruct with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o and on par with GPT 4o using a code interpreter. OccamLlama also outperforms GPT 4o both with and without a code interpreter on mathematical problem solving benchmarks involving challenging arithmetic, thus enabling small LLMs to match the arithmetic performance of even much larger models. We will make our code public shortly.
♻ ☆ Scalable MatMul-free Language Modeling
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github.com/ridgerchu/matmulfreellm.
♻ ☆ Let Me Teach You: Pedagogical Foundations of Feedback for Language Models
Natural Language Feedback (NLF) is an increasingly popular mechanism for aligning Large Language Models (LLMs) to human preferences. Despite the diversity of the information it can convey, NLF methods are often hand-designed and arbitrary, with little systematic grounding. At the same time, research in learning sciences has long established several effective feedback models. In this opinion piece, we compile ideas from pedagogy to introduce FELT, a feedback framework for LLMs that outlines various characteristics of the feedback space, and a feedback content taxonomy based on these variables, providing a general mapping of the feedback space. In addition to streamlining NLF designs, FELT also brings out new, unexplored directions for research in NLF. We make our taxonomy available to the community, providing guides and examples for mapping our categorizations to future research.
comment: 8 pages, 2 figures
♻ ☆ Getting More from Less: Large Language Models are Good Spontaneous Multilingual Learners
Recently, Large Language Models (LLMs) have shown impressive language capabilities. While most of the existing LLMs have very unbalanced performance across different languages, multilingual alignment based on translation parallel data is an effective method to enhance the LLMs' multilingual capabilities. In this work, we discover and comprehensively investigate the spontaneous multilingual alignment improvement of LLMs. We find that LLMs instruction-tuned on the question translation data (i.e. without annotated answers) are able to encourage the alignment between English and a wide range of languages, even including those unseen during instruction-tuning. Additionally, we utilize different settings and mechanistic interpretability methods to analyze the LLM's performance in the multilingual scenario comprehensively. Our work suggests that LLMs have enormous potential for improving multilingual alignment efficiently with great language and task generalization.
♻ ☆ DataComp-LM: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
comment: Project page: https://www.datacomp.ai/dclm/
♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
This paper delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). Prompt engineering is the process of structuring input text for LLMs and is a technique integral to optimizing the efficacy of LLMs. This survey elucidates foundational principles of prompt engineering, such as role-prompting, one-shot, and few-shot prompting, as well as more advanced methodologies such as the chain-of-thought and tree-of-thoughts prompting. The paper sheds light on how external assistance in the form of plugins can assist in this task, and reduce machine hallucination by retrieving external knowledge. We subsequently delineate prospective directions in prompt engineering research, emphasizing the need for a deeper understanding of structures and the role of agents in Artificial Intelligence-Generated Content (AIGC) tools. We discuss how to assess the efficacy of prompt methods from different perspectives and using different methods. Finally, we gather information about the application of prompt engineering in such fields as education and programming, showing its transformative potential. This comprehensive survey aims to serve as a friendly guide for anyone venturing through the big world of LLMs and prompt engineering.
♻ ☆ Challenges to Evaluating the Generalization of Coreference Resolution Models: A Measurement Modeling Perspective ACL
It is increasingly common to evaluate the same coreference resolution (CR) model on multiple datasets. Do these multi-dataset evaluations allow us to draw meaningful conclusions about model generalization? Or, do they rather reflect the idiosyncrasies of a particular experimental setup (e.g., the specific datasets used)? To study this, we view evaluation through the lens of measurement modeling, a framework commonly used in the social sciences for analyzing the validity of measurements. By taking this perspective, we show how multi-dataset evaluations risk conflating different factors concerning what, precisely, is being measured. This in turn makes it difficult to draw more generalizable conclusions from these evaluations. For instance, we show that across seven datasets, measurements intended to reflect CR model generalization are often correlated with differences in both how coreference is defined and how it is operationalized; this limits our ability to draw conclusions regarding the ability of CR models to generalize across any singular dimension. We believe the measurement modeling framework provides the needed vocabulary for discussing challenges surrounding what is actually being measured by CR evaluations.
comment: ACL Findings 2024
♻ ☆ Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce Self-Control_{prefix}, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate Self-Control's efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, Self-Control_{prefix} enables a plug-and-play control and jointly controls multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
comment: 41 pages, 12 figures, 41 tables; Website: https://llm-self-control.github.io/
♻ ☆ Lower Bounds on the Expressivity of Recurrent Neural Language Models
The recent successes and spread of large neural language models (LMs) call for a thorough understanding of their computational ability. Describing their computational abilities through LMs' \emph{representational capacity} is a lively area of research. However, investigation into the representational capacity of neural LMs has predominantly focused on their ability to \emph{recognize} formal languages. For example, recurrent neural networks (RNNs) with Heaviside activations are tightly linked to regular languages, i.e., languages defined by finite-state automata (FSAs). Such results, however, fall short of describing the capabilities of RNN \emph{language models} (LMs), which are definitionally \emph{distributions} over strings. We take a fresh look at the representational capacity of RNN LMs by connecting them to \emph{probabilistic} FSAs and demonstrate that RNN LMs with linearly bounded precision can express arbitrary regular LMs.
♻ ☆ MBBQ: A Dataset for Cross-Lingual Comparison of Stereotypes in Generative LLMs
Generative large language models (LLMs) have been shown to exhibit harmful biases and stereotypes. While safety fine-tuning typically takes place in English, if at all, these models are being used by speakers of many different languages. There is existing evidence that the performance of these models is inconsistent across languages and that they discriminate based on demographic factors of the user. Motivated by this, we investigate whether the social stereotypes exhibited by LLMs differ as a function of the language used to prompt them, while controlling for cultural differences and task accuracy. To this end, we present MBBQ (Multilingual Bias Benchmark for Question-answering), a carefully curated version of the English BBQ dataset extended to Dutch, Spanish, and Turkish, which measures stereotypes commonly held across these languages. We further complement MBBQ with a parallel control dataset to measure task performance on the question-answering task independently of bias. Our results based on several open-source and proprietary LLMs confirm that some non-English languages suffer from bias more than English, even when controlling for cultural shifts. Moreover, we observe significant cross-lingual differences in bias behaviour for all except the most accurate models. With the release of MBBQ, we hope to encourage further research on bias in multilingual settings. The dataset and code are available at https://github.com/Veranep/MBBQ.
♻ ☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
♻ ☆ On Efficiently Representing Regular Languages as RNNs
Recent work by Hewitt et al. (2020) provides an interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language. This suggests that RNNs' success might be linked to their ability to model hierarchy. However, a closer inspection of Hewitt et al.'s (2020) construction shows that it is not inherently limited to hierarchical structures. This poses a natural question: What other classes of LMs can RNNs efficiently represent? To this end, we generalize Hewitt et al.'s (2020) construction and show that RNNs can efficiently represent a larger class of LMs than previously claimed -- specifically, those that can be represented by a pushdown automaton with a bounded stack and a specific stack update function. Altogether, the efficiency of representing this diverse class of LMs with RNN LMs suggests novel interpretations of their inductive bias.
♻ ☆ LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation
Low-Rank Adaptation (LoRA) is currently the most commonly used Parameter-efficient fine-tuning (PEFT) method, it introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. However, it still faces resource consumption challenges during training when scaling up to larger models. Most previous studies have tackled this issue by using pruning techniques, which involve removing LoRA parameters deemed unimportant. Nonetheless, these efforts only analyze LoRA parameter features to evaluate their importance, such as parameter count, size, and gradient. In fact, the output of LoRA (product of LoRA parameter and hidden state), directly impacts the final results. Preliminary experiments indicate that a fraction of LoRA elements possesses significantly high output values, substantially influencing the layer output. Motivated by the observation, we propose LoRA-drop. Concretely, LoRA-drop evaluates the importance of LoRA based on the LoRA output. Then we retain LoRA for important layers and the other layers share the same LoRA. We conduct abundant experiments with models of different scales on NLU and NLG tasks. Results demonstrate that LoRA-drop can achieve performance comparable to full fine-tuning and LoRA, while retaining 50\% of the LoRA parameters on average.
comment: 15 pages, 12 figures
♻ ☆ Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries
Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while Text-to-SQL systems that incorporate such models continuously reach new high scores on -- often synthetic -- benchmark datasets, a systematic exploration of their robustness towards different data models in a real-world, realistic scenario is notably missing. This paper provides the first in-depth evaluation of the data model robustness of Text-to-SQL systems in practice based on a multi-year international project focused on Text-to-SQL interfaces. Our evaluation is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022, during which about 6K natural language questions were asked and executed. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models. For each data model, we explore the performance of representative Text-to-SQL systems and language models. We further quantify the impact of training data size, pre-, and post-processing steps as well as language model inference time. Our comprehensive evaluation sheds light on the design choices of real-world Text-to-SQL systems and their impact on moving from research prototypes to real deployments. Last, we provide a new benchmark dataset to the community, which is the first to enable the evaluation of different data models for the same dataset and is substantially more challenging than most previous datasets in terms of query complexity.
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.
♻ ☆ WRDScore: New Metric for Evaluation of Natural Language Generation Models
The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.
♻ ☆ An Embedded Diachronic Sense Change Model with a Case Study from Ancient Greek
Word meanings change over time, and word senses evolve, emerge or die out in the process. For ancient languages, where the corpora are often small and sparse, modelling such changes accurately proves challenging, and quantifying uncertainty in sense-change estimates consequently becomes important. GASC (Genre-Aware Semantic Change) and DiSC (Diachronic Sense Change) are existing generative models that have been used to analyse sense change for target words from an ancient Greek text corpus, using unsupervised learning without the help of any pre-training. These models represent the senses of a given target word such as ``kosmos'' (meaning decoration, order or world) as distributions over context words, and sense prevalence as a distribution over senses. The models are fitted using Markov Chain Monte Carlo (MCMC) methods to measure temporal changes in these representations. This paper introduces EDiSC, an Embedded DiSC model, which combines word embeddings with DiSC to provide superior model performance. It is shown empirically that EDiSC offers improved predictive accuracy, ground-truth recovery and uncertainty quantification, as well as better sampling efficiency and scalability properties with MCMC methods. The challenges of fitting these models are also discussed.
♻ ☆ $S^3$ -- Semantic Signal Separation
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of $S^3$, among other approaches, in the Turftopic Python package.
comment: 26 pages, 9 figures (main manuscript has 9 pages and 4 figures)
♻ ☆ From Complex to Simple: Enhancing Multi-Constraint Complex Instruction Following Ability of Large Language Models
It is imperative for Large language models (LLMs) to follow instructions with elaborate requirements (i.e. Complex Instructions Following). Yet, it remains under-explored how to enhance the ability of LLMs to follow complex instructions with multiple constraints. To bridge the gap, we initially study what training data is effective in enhancing complex constraints following abilities. We found that training LLMs with instructions containing multiple constraints enhances their understanding of complex instructions, especially those with lower complexity levels. The improvement can even generalize to compositions of out-of-domain constraints. Additionally, we further propose methods addressing how to obtain and utilize the effective training data. Finally, we conduct extensive experiments to prove the effectiveness of our methods in terms of overall performance and training efficiency. We also demonstrate that our methods improve models' ability to follow instructions generally and generalize effectively across out-of-domain, in-domain, and adversarial settings, while maintaining general capabilities.
♻ ☆ Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data
Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.
♻ ☆ Assistive Large Language Model Agents for Socially-Aware Negotiation Dialogues EMNLP 2024
We develop assistive agents based on Large Language Models (LLMs) that aid interlocutors in business negotiations. Specifically, we simulate business negotiations by letting two LLM-based agents engage in role play. A third LLM acts as a remediator agent to rewrite utterances violating norms for improving negotiation outcomes. We introduce a simple tuning-free and label-free In-Context Learning (ICL) method to identify high-quality ICL exemplars for the remediator, where we propose a novel select criteria, called value impact, to measure the quality of the negotiation outcomes. We provide rich empirical evidence to demonstrate its effectiveness in negotiations across three different negotiation topics. The source code and the generated dataset will be publicly available upon acceptance.
comment: 25 pages, 3 figures, 13 tables; Under review in EMNLP 2024
♻ ☆ RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples. Moreover, RoCIT on RoleBench yields RoleLLaMA (English) and RoleGLM (Chinese), significantly enhancing role-playing abilities and even achieving comparable results with RoleGPT (using GPT-4).
comment: 30 pages, repo at https://github.com/InteractiveNLP-Team/RoleLLM-public
♻ ☆ Authorship Obfuscation in Multilingual Machine-Generated Text Detection
High-quality text generation capability of recent Large Language Models (LLMs) causes concerns about their misuse (e.g., in massive generation/spread of disinformation). Machine-generated text (MGT) detection is important to cope with such threats. However, it is susceptible to authorship obfuscation (AO) methods, such as paraphrasing, which can cause MGTs to evade detection. So far, this was evaluated only in monolingual settings. Thus, the susceptibility of recently proposed multilingual detectors is still unknown. We fill this gap by comprehensively benchmarking the performance of 10 well-known AO methods, attacking 37 MGT detection methods against MGTs in 11 languages (i.e., 10 $\times$ 37 $\times$ 11 = 4,070 combinations). We also evaluate the effect of data augmentation on adversarial robustness using obfuscated texts. The results indicate that all tested AO methods can cause evasion of automated detection in all tested languages, where homoglyph attacks are especially successful. However, some of the AO methods severely damaged the text, making it no longer readable or easily recognizable by humans (e.g., changed language, weird characters).
♻ ☆ ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.
♻ ☆ How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses
Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.
♻ ☆ Accelerating Speculative Decoding using Dynamic Speculation Length
Speculative decoding is commonly used for reducing the inference latency of large language models. Its effectiveness depends highly on the speculation lookahead (SL)-the number of tokens generated by the draft model at each iteration. In this work we show that the common practice of using the same SL for all iterations static SL is suboptimal. We introduce DISCO (DynamIc SpeCulation lookahead Optimization), a novel method for dynamically selecting the SL. Our experiments with four datasets show that DISCO reaches an average speedup of 10% compared to the best static SL baseline, while generating the exact same text.
♻ ☆ How structured are the representations in transformer-based vision encoders? An analysis of multi-object representations in vision-language models
Forming and using symbol-like structured representations for reasoning has been considered essential for generalising over novel inputs. The primary tool that allows generalisation outside training data distribution is the ability to abstract away irrelevant information into a compact form relevant to the task. An extreme form of such abstract representations is symbols. Humans make use of symbols to bind information while abstracting away irrelevant parts to utilise the information consistently and meaningfully. This work estimates the state of such structured representations in vision encoders. Specifically, we evaluate image encoders in large vision-language pre-trained models to address the question of which desirable properties their representations lack by applying the criteria of symbolic structured reasoning described for LLMs to the image models. We test the representation space of image encoders like VIT, BLIP, CLIP, and FLAVA to characterise the distribution of the object representations in these models. In particular, we create decoding tasks using multi-object scenes from the COCO dataset, relating the token space to its input content for various objects in the scene. We use these tasks to characterise the network's token and layer-wise information modelling. Our analysis highlights that the CLS token, used for the downstream task, only focuses on a few objects necessary for the trained downstream task. Still, other individual objects are well-modelled separately by the tokens in the network originating from those objects. We further observed a widespread distribution of scene information. This demonstrates that information is far more entangled in tokens than optimal for representing objects similar to symbols. Given these symbolic properties, we show the network dynamics that cause failure modes of these models on basic downstream tasks in a multi-object scene.
♻ ☆ HARE: HumAn pRiors, a key to small language model Efficiency
Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
♻ ☆ What If We Recaption Billions of Web Images with LLaMA-3?
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal retrieval tasks. For generative models like text-to-image Diffusion Transformers, the generated images exhibit a significant improvement in alignment with users' text instructions, especially in following complex queries. Our project page is https://www.haqtu.me/Recap-Datacomp-1B/
comment: First five authors contributed equally
♻ ☆ Retrieve to Explain: Evidence-driven Predictions with Language Models ICML
Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this issue, we introduce Retrieve to Explain (R2E), a retrieval-based language model. R2E scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods, such as Shapley values, to transparently attribute each answer's score back to its supporting evidence at inference time. The architecture also allows R2E to incorporate new evidence without retraining, including non-textual data modalities templated into natural language. We assess on the challenging task of drug target identification from scientific literature, a human-in-the-loop process where failures are extremely costly and explainability is paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.
comment: ICML AI for Science 2024
♻ ☆ Connected Speech-Based Cognitive Assessment in Chinese and English
We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. The models achieved unweighted average recall was 59.2% in diagnosis, and root mean squared error of 2.89 in score prediction.
comment: To appear in Proceedings of Interspeech 2024
♻ ☆ Instruction Fine-Tuning: Does Prompt Loss Matter?
We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.
comment: 8 pages of content. 13 pages of appendices. 45 figures
♻ ☆ GADePo: Graph-Assisted Declarative Pooling Transformers for Document-Level Relation Extraction ACL 2024
Document-level relation extraction typically relies on text-based encoders and hand-coded pooling heuristics to aggregate information learned by the encoder. In this paper, we leverage the intrinsic graph processing capabilities of the Transformer model and propose replacing hand-coded pooling methods with new tokens in the input, which are designed to aggregate information via explicit graph relations in the computation of attention weights. We introduce a joint text-graph Transformer model and a graph-assisted declarative pooling (GADePo) specification of the input, which provides explicit and high-level instructions for information aggregation. GADePo allows the pooling process to be guided by domain-specific knowledge or desired outcomes but still learned by the Transformer, leading to more flexible and customisable pooling strategies. We evaluate our method across diverse datasets and models and show that our approach yields promising results that are consistently better than those achieved by the hand-coded pooling functions.
comment: Accepted to KnowledgeNLP workshop at ACL 2024
♻ ☆ Transformers are Multi-State RNNs
Transformers are considered conceptually different from the previous generation of state-of-the-art NLP models - recurrent neural networks (RNNs). In this work, we demonstrate that decoder-only transformers can in fact be conceptualized as unbounded multi-state RNNs - an RNN variant with unlimited hidden state size. We further show that transformers can be converted into $\textit{bounded}$ multi-state RNNs by fixing the size of their hidden state, effectively compressing their key-value cache. We introduce a novel, training-free compression policy - $\textbf{T}$oken $\textbf{O}$mission $\textbf{V}$ia $\textbf{A}$ttention (TOVA). Our experiments with four long range tasks and several LLMs show that TOVA outperforms several baseline compression policies. Particularly, our results are nearly on par with the full model, using in some cases only $\frac{1}{8}$ of the original cache size, which translates to 4.8X higher throughput. Our results shed light on the connection between transformers and RNNs, and help mitigate one of LLMs' most painful computational bottlenecks - the size of their key-value cache. We publicly release our code at https://github.com/schwartz-lab-NLP/TOVA
comment: preprint
♻ ☆ Identifying and Mitigating Privacy Risks Stemming from Language Models: A Survey
Large Language Models (LLMs) have shown greatly enhanced performance in recent years, attributed to increased size and extensive training data. This advancement has led to widespread interest and adoption across industries and the public. However, training data memorization in Machine Learning models scales with model size, particularly concerning for LLMs. Memorized text sequences have the potential to be directly leaked from LLMs, posing a serious threat to data privacy. Various techniques have been developed to attack LLMs and extract their training data. As these models continue to grow, this issue becomes increasingly critical. To help researchers and policymakers understand the state of knowledge around privacy attacks and mitigations, including where more work is needed, we present the first SoK on data privacy for LLMs. We (i) identify a taxonomy of salient dimensions where attacks differ on LLMs, (ii) systematize existing attacks, using our taxonomy of dimensions to highlight key trends, (iii) survey existing mitigation strategies, highlighting their strengths and limitations, and (iv) identify key gaps, demonstrating open problems and areas for concern.
comment: 15 pages
♻ ☆ Curating Grounded Synthetic Data with Global Perspectives for Equitable AI
The development of robust AI models relies heavily on the quality and variety of training data available. In fields where data scarcity is prevalent, synthetic data generation offers a vital solution. In this paper, we introduce a novel approach to creating synthetic datasets, grounded in real-world diversity and enriched through strategic diversification. We synthesize data using a comprehensive collection of news articles spanning 12 languages and originating from 125 countries, to ensure a breadth of linguistic and cultural representations. Through enforced topic diversification, translation, and summarization, the resulting dataset accurately mirrors real-world complexities and addresses the issue of underrepresentation in traditional datasets. This methodology, applied initially to Named Entity Recognition (NER), serves as a model for numerous AI disciplines where data diversification is critical for generalizability. Preliminary results demonstrate substantial improvements in performance on traditional NER benchmarks, by up to 7.3%, highlighting the effectiveness of our synthetic data in mimicking the rich, varied nuances of global data sources. This paper outlines the strategies employed for synthesizing diverse datasets and provides such a curated dataset for NER.
♻ ☆ Timeline-based Sentence Decomposition with In-Context Learning for Temporal Fact Extraction ACL2024
Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
comment: Accepted to ACL2024 main conference
♻ ☆ Knowledge-Driven Cross-Document Relation Extraction ACL 2024
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task. However, a handful of recent efforts explore it across documents or in the cross-document setting (CrossDocRE). This is distinct from the single document case because different documents often focus on disparate themes, while text within a document tends to have a single goal. Linking findings from disparate documents to identify new relationships is at the core of the popular literature-based knowledge discovery paradigm in biomedicine and other domains. Current CrossDocRE efforts do not consider domain knowledge, which are often assumed to be known to the reader when documents are authored. Here, we propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE. Our proposed framework has three main benefits over baselines: 1) it incorporates domain knowledge of entities along with documents' text; 2) it offers interpretability by producing explanatory text for predicted relations between entities 3) it improves performance over the prior methods.
comment: Accepted in ACL 2024 Findings
♻ ☆ A Survey on Human Preference Learning for Large Language Models
The recent surge of versatile large language models (LLMs) largely depends on aligning increasingly capable foundation models with human intentions by preference learning, enhancing LLMs with excellent applicability and effectiveness in a wide range of contexts. Despite the numerous related studies conducted, a perspective on how human preferences are introduced into LLMs remains limited, which may prevent a deeper comprehension of the relationships between human preferences and LLMs as well as the realization of their limitations. In this survey, we review the progress in exploring human preference learning for LLMs from a preference-centered perspective, covering the sources and formats of preference feedback, the modeling and usage of preference signals, as well as the evaluation of the aligned LLMs. We first categorize the human feedback according to data sources and formats. We then summarize techniques for human preferences modeling and compare the advantages and disadvantages of different schools of models. Moreover, we present various preference usage methods sorted by the objectives to utilize human preference signals. Finally, we summarize some prevailing approaches to evaluate LLMs in terms of alignment with human intentions and discuss our outlooks on the human intention alignment for LLMs.
comment: IEEE copyright statement added (also applied to the former version)
♻ ☆ FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models
Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server's LLM to clients' SLMs while concurrently enriching the LLM with clients' unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a range of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs.
♻ ☆ Spatial-Temporal Large Language Model for Traffic Prediction MDM 2024
Traffic prediction, an essential component for intelligent transportation systems, endeavours to use historical data to foresee future traffic features at specific locations. Although existing traffic prediction models often emphasize developing complex neural network structures, their accuracy has not improved. Recently, large language models have shown outstanding capabilities in time series analysis. Differing from existing models, LLMs progress mainly through parameter expansion and extensive pretraining while maintaining their fundamental structures. Motivated by these developments, we propose a Spatial-Temporal Large Language Model (ST-LLM) for traffic prediction. In the ST-LLM, we define timesteps at each location as tokens and design a spatial-temporal embedding to learn the spatial location and global temporal patterns of these tokens. Additionally, we integrate these embeddings by a fusion convolution to each token for a unified spatial-temporal representation. Furthermore, we innovate a partially frozen attention strategy to adapt the LLM to capture global spatial-temporal dependencies for traffic prediction. Comprehensive experiments on real traffic datasets offer evidence that ST-LLM is a powerful spatial-temporal learner that outperforms state-of-the-art models. Notably, the ST-LLM also exhibits robust performance in both few-shot and zero-shot prediction scenarios. The code is publicly available at https://github.com/ChenxiLiu-HNU/ST-LLM.
comment: Accepted by MDM 2024 (Research Track)
♻ ☆ Imagination Augmented Generation: Learning to Imagine Richer Context for Question Answering over Large Language Models
Retrieval-Augmented-Generation and Gener-ation-Augmented-Generation have been proposed to enhance the knowledge required for question answering over Large Language Models (LLMs). However, the former relies on external resources, and both require incorporating explicit documents into the context, which increases execution costs and susceptibility to noise data. Recent works indicate that LLMs have modeled rich knowledge, albeit not effectively triggered or awakened. Inspired by this, we propose a novel knowledge-augmented framework, Imagination-Augmented-Generation (IAG), which simulates the human capacity to compensate for knowledge deficits while answering questions solely through imagination, thereby awakening relevant knowledge in LLMs without relying on external resources. Guided by IAG, we propose an imagine richer context method for question answering (IMcQA). IMcQA consists of two modules: explicit imagination, which generates a short dummy document by learning from long context compression, and implicit imagination, which creates flexible adapters by distilling from a teacher model with a long context. Experimental results on three datasets demonstrate that IMcQA exhibits significant advantages in both open-domain and closed-book settings, as well as in out-of-distribution generalization. Our code will be available at https://github.com/Xnhyacinth/IAG.
♻ ☆ MEIT: Multi-Modal Electrocardiogram Instruction Tuning on Large Language Models for Report Generation
Electrocardiogram (ECG) is the primary non-invasive diagnostic tool for monitoring cardiac conditions and is crucial in assisting clinicians. Recent studies have concentrated on classifying cardiac conditions using ECG data but have overlooked ECG report generation, which is time-consuming and requires clinical expertise. To automate ECG report generation and ensure its versatility, we propose the Multimodal ECG Instruction Tuning (MEIT) framework, the first attempt to tackle ECG report generation with LLMs and multimodal instructions. To facilitate future research, we establish a benchmark to evaluate MEIT with various LLMs backbones across two large-scale ECG datasets. Our approach uniquely aligns the representations of the ECG signal and the report, and we conduct extensive experiments to benchmark MEIT with nine open-source LLMs using more than 800,000 ECG reports. MEIT's results underscore the superior performance of instruction-tuned LLMs, showcasing their proficiency in quality report generation, zero-shot capabilities, and resilience to signal perturbation. These findings emphasize the efficacy of our MEIT framework and its potential for real-world clinical application.
comment: Under review
♻ ☆ Evaluating Text-to-Visual Generation with Image-to-Text Generation
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
comment: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore
♻ ☆ Can LLMs Learn New Concepts Incrementally without Forgetting?
Large Language Models (LLMs) have achieved remarkable success across various tasks, yet their ability to learn incrementally without forgetting remains underexplored. Incremental learning (IL) is crucial as it enables models to acquire new knowledge while retaining previously learned information, akin to human learning. Existing benchmarks for IL are insufficient due to data leakage issues and the overqualification of LLMs. To address these challenges, we introduce Concept-1K, a novel dataset comprising 1,023 recently emerged concepts across diverse domains. The concepts in Concept-1K are discrete, interpretable units of knowledge that allow for fine-grained analysis of learning and forgetting processes. Using Concept-1K as a testbed, we aim to answer the question: ``Can LLMs learn new concepts incrementally without forgetting like humans?'' Our investigation reveals that LLMs still suffer from catastrophic forgetting and that LoRA, despite fine-tuning fewer parameters, may lead to more forgetting on training data. Additionally, we explore the roles of in-context learning, model scale, buffer size, and pretraining in IL performance. These findings highlight the strengths and limitations of LLMs in IL scenarios and provide a robust benchmark for future research.
comment: 28 pages
♻ ☆ SelfCP: Compressing Over-Limit Prompt via the Frozen Large Language Model Itself
Long prompt leads to huge hardware costs when using transformer-based Large Language Models (LLMs). Unfortunately, many tasks, such as summarization, inevitably introduce long documents, and the wide application of in-context learning easily makes the prompt length explode. This paper proposes a Self-Compressor (SelfCP), which employs the target LLM itself to compress over-limit prompts into dense vectors while keeping the allowed prompts unmodified. Dense vectors are then projected into dense tokens via a learnable connector to make the same LLM unburden to understand. The connector is supervised-tuned under the language modeling objective of the LLM on relatively long texts selected from publicly accessed datasets, involving an instruction dataset to make SelfCP respond to various prompts, while the target LLM keeps frozen during training. We build the lightweight SelfCP upon 2 different backbones with merely 17M learnable parameters originating from the connector and a learnable embedding. Evaluation on both English and Chinese benchmarks demonstrate that SelfCP effectively substitutes 12$\times$ over-limit prompts with dense tokens to reduce memory costs and booster inference throughputs, yet improving response quality. The outstanding performance brings an efficient solution for LLMs to tackle long prompts without training LLMs from scratch.
♻ ☆ Investigating Annotator Bias in Large Language Models for Hate Speech Detection
Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: https://github.com/AmitDasRup123/HateSpeechCorpus
♻ ☆ HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction MICCAI2024
Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in https://github.com/dddavid4real/HistGen.
comment: Accepted by MICCAI2024
♻ ☆ Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
The rise of Multimodal Large Language Models (MLLMs), renowned for their advanced instruction-following and reasoning capabilities, has significantly propelled the field of visual reasoning. However, due to limitations in their image tokenization processes, most MLLMs struggle to capture fine details of text and objects in images, especially in high-resolution samples. To overcome this limitation, we introduce P2G, a novel framework for plug-and-play grounding in MLLMs. P2G utilizes the tool-usage potential of MLLMs to employ expert agents for on-the-fly grounding of reasoning into critical visual and textual elements in images, thereby enabling deliberate reasoning through multimodal prompting. Additionally, we develop P2GB, a benchmark designed to evaluate MLLMs' proficiency in understanding inter-object relationships and textual content in challenging high-resolution images. Extensive experiments on visual reasoning tasks demonstrate the superiority of P2G, achieving performance comparable to GPT-4V on P2GB with a 7B backbone. Our work underscores the potential of grounding reasoning with external agents in MLLMs, presenting a promising alternative to mere model scaling.
comment: 15 pages, 8 figures
♻ ☆ Causal Graph Discovery with Retrieval-Augmented Generation based Large Language Models
Causal graph recovery is traditionally done using statistical estimation-based methods or based on individual's knowledge about variables of interests. They often suffer from data collection biases and limitations of individuals' knowledge. The advance of large language models (LLMs) provides opportunities to address these problems. We propose a novel method that leverages LLMs to deduce causal relationships in general causal graph recovery tasks. This method leverages knowledge compressed in LLMs and knowledge LLMs extracted from scientific publication database as well as experiment data about factors of interest to achieve this goal. Our method gives a prompting strategy to extract associational relationships among those factors and a mechanism to perform causality verification for these associations. Comparing to other LLM-based methods that directly instruct LLMs to do the highly complex causal reasoning, our method shows clear advantage on causal graph quality on benchmark datasets. More importantly, as causality among some factors may change as new research results emerge, our method show sensitivity to new evidence in the literature and can provide useful information for updating causal graphs accordingly.
♻ ☆ SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis
Recent breakthroughs in Large Language Models (LLMs) have revolutionized natural language understanding and generation, sparking significant interest in applying them to scientific literature analysis. However, existing benchmarks fail to adequately evaluate the proficiency of LLMs in this domain, particularly in scenarios requiring higher-level abilities beyond mere memorization and the handling of multimodal data. In response to this gap, we introduce SciAssess, a benchmark specifically designed for the comprehensive evaluation of LLMs in scientific literature analysis. SciAssess aims to thoroughly assess the efficacy of LLMs by focusing on their capabilities in Memorization (L1), Comprehension (L2), and Analysis \& Reasoning (L3). It encompasses a variety of tasks drawn from diverse scientific fields, including fundamental science, alloy materials, biomedicine, drug discovery, and organic materials. To ensure the reliability of SciAssess, rigorous quality control measures have been implemented, ensuring accuracy, anonymization, and compliance with copyright standards. SciAssess evaluates 11 LLMs, including GPT, Claude, and Gemini, highlighting their strengths and areas for improvement. This evaluation supports the ongoing development of LLM applications in the analysis of scientific literature. SciAssess and its resources are available at \url{https://sci-assess.github.io/}.
♻ ☆ Formal-LLM: Integrating Formal Language and Natural Language for Controllable LLM-based Agents
Recent advancements on Large Language Models (LLMs) enable AI Agents to automatically generate and execute multi-step plans to solve complex tasks. However, since LLM's content generation process is hardly controllable, current LLM-based agents frequently generate invalid or non-executable plans, which jeopardizes the performance of the generated plans and corrupts users' trust in LLM-based agents. In response, this paper proposes a novel ``Formal-LLM'' framework for LLM-based agents by integrating the expressiveness of natural language and the precision of formal language. Specifically, the framework allows human users to express their requirements or constraints for the planning process as an automaton. A stack-based LLM plan generation process is then conducted under the supervision of the automaton to ensure that the generated plan satisfies the constraints, making the planning process controllable. We conduct experiments on both benchmark tasks and practical real-life tasks, and our framework achieves over 50% overall performance increase, which validates the feasibility and effectiveness of employing Formal-LLM to guide the plan generation of agents, preventing the agents from generating invalid and unsuccessful plans. Further, more controllable LLM-based agents can facilitate the broader utilization of LLM in application scenarios where high validity of planning is essential. The work is open-sourced at https://github.com/agiresearch/Formal-LLM.
comment: 21 pages, 6 figures; comments and suggestions are welcome
♻ ☆ Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation
With the rapid advancement of large language models (LLMs) for handling complex language tasks, an increasing number of studies are employing LLMs as agents to emulate the sequential decision-making processes of humans often represented as Markov decision-making processes (MDPs). The actions in MDPs adhere to specific probability distributions and require iterative sampling. This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions, thereby guiding the agent's behavioral decision-making through probabilistic sampling and generating behavioral sequences. To answer the above question, we divide the problem into two main aspects: sequence simulation with known probability distribution and sequence simulation with unknown probability distribution. Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling. Their ability to perform probabilistic sampling can be improved to some extent by integrating coding tools, but this level of sampling precision still makes it difficult to simulate human behavior as agents.
♻ ☆ CrossVoice: Crosslingual Prosody Preserving Cascade-S2ST using Transfer Learning ICLR 2024
This paper presents CrossVoice, a novel cascade-based Speech-to-Speech Translation (S2ST) system employing advanced ASR, MT, and TTS technologies with cross-lingual prosody preservation through transfer learning. We conducted comprehensive experiments comparing CrossVoice with direct-S2ST systems, showing improved BLEU scores on tasks such as Fisher Es-En, VoxPopuli Fr-En and prosody preservation on benchmark datasets CVSS-T and IndicTTS. With an average mean opinion score of 3.75 out of 4, speech synthesized by CrossVoice closely rivals human speech on the benchmark, highlighting the efficacy of cascade-based systems and transfer learning in multilingual S2ST with prosody transfer.
comment: 8 pages, Accepted at ICLR 2024 - Tiny Track
♻ ☆ Multilingual Prosody Transfer: Comparing Supervised & Transfer Learning ICLR 2024
The field of prosody transfer in speech synthesis systems is rapidly advancing. This research is focused on evaluating learning methods for adapting pre-trained monolingual text-to-speech (TTS) models to multilingual conditions, i.e., Supervised Fine-Tuning (SFT) and Transfer Learning (TL). This comparison utilizes three distinct metrics: Mean Opinion Score (MOS), Recognition Accuracy (RA), and Mel Cepstral Distortion (MCD). Results demonstrate that, in comparison to SFT, TL leads to significantly enhanced performance, with an average MOS higher by 1.53 points, a 37.5% increase in RA, and approximately a 7.8-point improvement in MCD. These findings are instrumental in helping build TTS models for low-resource languages.
comment: 7 pages, Accepted to ICLR 2024 - Tiny Track
♻ ☆ How Susceptible are Large Language Models to Ideological Manipulation?
Large Language Models (LLMs) possess the potential to exert substantial influence on public perceptions and interactions with information. This raises concerns about the societal impact that could arise if the ideologies within these models can be easily manipulated. In this work, we investigate how effectively LLMs can learn and generalize ideological biases from their instruction-tuning data. Our findings reveal a concerning vulnerability: exposure to only a small amount of ideologically driven samples significantly alters the ideology of LLMs. Notably, LLMs demonstrate a startling ability to absorb ideology from one topic and generalize it to even unrelated ones. The ease with which LLMs' ideologies can be skewed underscores the risks associated with intentionally poisoned training data by malicious actors or inadvertently introduced biases by data annotators. It also emphasizes the imperative for robust safeguards to mitigate the influence of ideological manipulations on LLMs.
Machine Learning 150
☆ Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, we believe they should be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model.
comment: Technical report v1. Code and model are released at https://github.com/RLHFlow/RLHF-Reward-Modeling/
☆ Synergizing Foundation Models and Federated Learning: A Survey
The recent development of Foundation Models (FMs), represented by large language models, vision transformers, and multimodal models, has been making a significant impact on both academia and industry. Compared with small-scale models, FMs have a much stronger demand for high-volume data during the pre-training phase. Although general FMs can be pre-trained on data collected from open sources such as the Internet, domain-specific FMs need proprietary data, posing a practical challenge regarding the amount of data available due to privacy concerns. Federated Learning (FL) is a collaborative learning paradigm that breaks the barrier of data availability from different participants. Therefore, it provides a promising solution to customize and adapt FMs to a wide range of domain-specific tasks using distributed datasets whilst preserving privacy. This survey paper discusses the potentials and challenges of synergizing FL and FMs and summarizes core techniques, future directions, and applications. A periodically updated paper collection on FM-FL is available at https://github.com/lishenghui/awesome-fm-fl.
☆ Can Go AIs be adversarially robust?
Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai.
comment: 67 pages
☆ Demystifying Higher-Order Graph Neural Networks
Higher-order graph neural networks (HOGNNs) are an important class of GNN models that harness polyadic relations between vertices beyond plain edges. They have been used to eliminate issues such as over-smoothing or over-squashing, to significantly enhance the accuracy of GNN predictions, to improve the expressiveness of GNN architectures, and for numerous other goals. A plethora of HOGNN models have been introduced, and they come with diverse neural architectures, and even with different notions of what the "higher-order" means. This richness makes it very challenging to appropriately analyze and compare HOGNN models, and to decide in what scenario to use specific ones. To alleviate this, we first design an in-depth taxonomy and a blueprint for HOGNNs. This facilitates designing models that maximize performance. Then, we use our taxonomy to analyze and compare the available HOGNN models. The outcomes of our analysis are synthesized in a set of insights that help to select the most beneficial GNN model in a given scenario, and a comprehensive list of challenges and opportunities for further research into more powerful HOGNNs.
☆ Evaluating the design space of diffusion-based generative models
Most existing theoretical investigations of the accuracy of diffusion models, albeit significant, assume the score function has been approximated to a certain accuracy, and then use this a priori bound to control the error of generation. This article instead provides a first quantitative understanding of the whole generation process, i.e., both training and sampling. More precisely, it conducts a non-asymptotic convergence analysis of denoising score matching under gradient descent. In addition, a refined sampling error analysis for variance exploding models is also provided. The combination of these two results yields a full error analysis, which elucidates (again, but this time theoretically) how to design the training and sampling processes for effective generation. For instance, our theory implies a preference toward noise distribution and loss weighting that qualitatively agree with the ones used in [Karras et al. 2022]. It also provides some perspectives on why the time and variance schedule used in [Karras et al. 2022] could be better tuned than the pioneering version in [Song et al. 2020].
comment: Comments are welcome
☆ LayerMerge: Neural Network Depth Compression through Layer Pruning and Merging ICML 2024
Recent works show that reducing the number of layers in a convolutional neural network can enhance efficiency while maintaining the performance of the network. Existing depth compression methods remove redundant non-linear activation functions and merge the consecutive convolution layers into a single layer. However, these methods suffer from a critical drawback; the kernel size of the merged layers becomes larger, significantly undermining the latency reduction gained from reducing the depth of the network. We show that this problem can be addressed by jointly pruning convolution layers and activation functions. To this end, we propose LayerMerge, a novel depth compression method that selects which activation layers and convolution layers to remove, to achieve a desired inference speed-up while minimizing performance loss. Since the corresponding selection problem involves an exponential search space, we formulate a novel surrogate optimization problem and efficiently solve it via dynamic programming. Empirical results demonstrate that our method consistently outperforms existing depth compression and layer pruning methods on various network architectures, both on image classification and generation tasks. We release the code at https://github.com/snu-mllab/LayerMerge.
comment: ICML 2024
☆ Influence Maximization via Graph Neural Bandits KDD
We consider a ubiquitous scenario in the study of Influence Maximization (IM), in which there is limited knowledge about the topology of the diffusion network. We set the IM problem in a multi-round diffusion campaign, aiming to maximize the number of distinct users that are influenced. Leveraging the capability of bandit algorithms to effectively balance the objectives of exploration and exploitation, as well as the expressivity of neural networks, our study explores the application of neural bandit algorithms to the IM problem. We propose the framework IM-GNB (Influence Maximization with Graph Neural Bandits), where we provide an estimate of the users' probabilities of being influenced by influencers (also known as diffusion seeds). This initial estimate forms the basis for constructing both an exploitation graph and an exploration one. Subsequently, IM-GNB handles the exploration-exploitation tradeoff, by selecting seed nodes in real-time using Graph Convolutional Networks (GCN), in which the pre-estimated graphs are employed to refine the influencers' estimated rewards in each contextual setting. Through extensive experiments on two large real-world datasets, we demonstrate the effectiveness of IM-GNB compared with other baseline methods, significantly improving the spread outcome of such diffusion campaigns, when the underlying network is unknown.
comment: To appear at the 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
☆ LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.
☆ Neural Approximate Mirror Maps for Constrained Diffusion Models
Diffusion models excel at creating visually-convincing images, but they often struggle to meet subtle constraints inherent in the training data. Such constraints could be physics-based (e.g., satisfying a PDE), geometric (e.g., respecting symmetry), or semantic (e.g., including a particular number of objects). When the training data all satisfy a certain constraint, enforcing this constraint on a diffusion model not only improves its distribution-matching accuracy but also makes it more reliable for generating valid synthetic data and solving constrained inverse problems. However, existing methods for constrained diffusion models are inflexible with different types of constraints. Recent work proposed to learn mirror diffusion models (MDMs) in an unconstrained space defined by a mirror map and to impose the constraint with an inverse mirror map, but analytical mirror maps are challenging to derive for complex constraints. We propose neural approximate mirror maps (NAMMs) for general constraints. Our approach only requires a differentiable distance function from the constraint set. We learn an approximate mirror map that pushes data into an unconstrained space and a corresponding approximate inverse that maps data back to the constraint set. A generative model, such as an MDM, can then be trained in the learned mirror space and its samples restored to the constraint set by the inverse map. We validate our approach on a variety of constraints, showing that compared to an unconstrained diffusion model, a NAMM-based MDM substantially improves constraint satisfaction. We also demonstrate how existing diffusion-based inverse-problem solvers can be easily applied in the learned mirror space to solve constrained inverse problems.
☆ Privacy Preserving Federated Learning in Medical Imaging with Uncertainty Estimation
Machine learning (ML) and Artificial Intelligence (AI) have fueled remarkable advancements, particularly in healthcare. Within medical imaging, ML models hold the promise of improving disease diagnoses, treatment planning, and post-treatment monitoring. Various computer vision tasks like image classification, object detection, and image segmentation are poised to become routine in clinical analysis. However, privacy concerns surrounding patient data hinder the assembly of large training datasets needed for developing and training accurate, robust, and generalizable models. Federated Learning (FL) emerges as a compelling solution, enabling organizations to collaborate on ML model training by sharing model training information (gradients) rather than data (e.g., medical images). FL's distributed learning framework facilitates inter-institutional collaboration while preserving patient privacy. However, FL, while robust in privacy preservation, faces several challenges. Sensitive information can still be gleaned from shared gradients that are passed on between organizations during model training. Additionally, in medical imaging, quantifying model confidence\uncertainty accurately is crucial due to the noise and artifacts present in the data. Uncertainty estimation in FL encounters unique hurdles due to data heterogeneity across organizations. This paper offers a comprehensive review of FL, privacy preservation, and uncertainty estimation, with a focus on medical imaging. Alongside a survey of current research, we identify gaps in the field and suggest future directions for FL research to enhance privacy and address noisy medical imaging data challenges.
comment: 31 pages, 5 figures, 3 tables, Journal preprint
☆ Adversarial Attacks on Multimodal Agents
Vision-enabled language models (VLMs) are now used to build autonomous multimodal agents capable of taking actions in real environments. In this paper, we show that multimodal agents raise new safety risks, even though attacking agents is more challenging than prior attacks due to limited access to and knowledge about the environment. Our attacks use adversarial text strings to guide gradient-based perturbation over one trigger image in the environment: (1) our captioner attack attacks white-box captioners if they are used to process images into captions as additional inputs to the VLM; (2) our CLIP attack attacks a set of CLIP models jointly, which can transfer to proprietary VLMs. To evaluate the attacks, we curated VisualWebArena-Adv, a set of adversarial tasks based on VisualWebArena, an environment for web-based multimodal agent tasks. Within an L-infinity norm of $16/256$ on a single image, the captioner attack can make a captioner-augmented GPT-4V agent execute the adversarial goals with a 75% success rate. When we remove the captioner or use GPT-4V to generate its own captions, the CLIP attack can achieve success rates of 21% and 43%, respectively. Experiments on agents based on other VLMs, such as Gemini-1.5, Claude-3, and GPT-4o, show interesting differences in their robustness. Further analysis reveals several key factors contributing to the attack's success, and we also discuss the implications for defenses as well. Project page: https://chenwu.io/attack-agent Code and data: https://github.com/ChenWu98/agent-attack
comment: 19 pages
☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and exploring emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
☆ Probabilistic Temporal Prediction of Continuous Disease Trajectories and Treatment Effects Using Neural SDEs
Personalized medicine based on medical images, including predicting future individualized clinical disease progression and treatment response, would have an enormous impact on healthcare and drug development, particularly for diseases (e.g. multiple sclerosis (MS)) with long term, complex, heterogeneous evolutions and no cure. In this work, we present the first stochastic causal temporal framework to model the continuous temporal evolution of disease progression via Neural Stochastic Differential Equations (NSDE). The proposed causal inference model takes as input the patient's high dimensional images (MRI) and tabular data, and predicts both factual and counterfactual progression trajectories on different treatments in latent space. The NSDE permits the estimation of high-confidence personalized trajectories and treatment effects. Extensive experiments were performed on a large, multi-centre, proprietary dataset of patient 3D MRI and clinical data acquired during several randomized clinical trials for MS treatments. Our results present the first successful uncertainty-based causal Deep Learning (DL) model to: (a) accurately predict future patient MS disability evolution (e.g. EDSS) and treatment effects leveraging baseline MRI, and (b) permit the discovery of subgroups of patients for which the model has high confidence in their response to treatment even in clinical trials which did not reach their clinical endpoints.
☆ Scalable Rule Lists Learning with Sampling KDD 2024
Learning interpretable models has become a major focus of machine learning research, given the increasing prominence of machine learning in socially important decision-making. Among interpretable models, rule lists are among the best-known and easily interpretable ones. However, finding optimal rule lists is computationally challenging, and current approaches are impractical for large datasets. We present a novel and scalable approach to learn nearly optimal rule lists from large datasets. Our algorithm uses sampling to efficiently obtain an approximation of the optimal rule list with rigorous guarantees on the quality of the approximation. In particular, our algorithm guarantees to find a rule list with accuracy very close to the optimal rule list when a rule list with high accuracy exists. Our algorithm builds on the VC-dimension of rule lists, for which we prove novel upper and lower bounds. Our experimental evaluation on large datasets shows that our algorithm identifies nearly optimal rule lists with a speed-up up to two orders of magnitude over state-of-the-art exact approaches. Moreover, our algorithm is as fast as, and sometimes faster than, recent heuristic approaches, while reporting higher quality rule lists. In addition, the rules reported by our algorithm are more similar to the rules in the optimal rule list than the rules from heuristic approaches.
comment: Accepted to KDD 2024
☆ The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough
The problem of pure exploration in Markov decision processes has been cast as maximizing the entropy over the state distribution induced by the agent's policy, an objective that has been extensively studied. However, little attention has been dedicated to state entropy maximization under partial observability, despite the latter being ubiquitous in applications, e.g., finance and robotics, in which the agent only receives noisy observations of the true state governing the system's dynamics. How can we address state entropy maximization in those domains? In this paper, we study the simple approach of maximizing the entropy over observations in place of true latent states. First, we provide lower and upper bounds to the approximation of the true state entropy that only depends on some properties of the observation function. Then, we show how knowledge of the latter can be exploited to compute a principled regularization of the observation entropy to improve performance. With this work, we provide both a flexible approach to bring advances in state entropy maximization to the POMDP setting and a theoretical characterization of its intrinsic limits.
☆ In-Context Learning of Energy Functions
In-context learning is a powerful capability of certain machine learning models that arguably underpins the success of today's frontier AI models. However, in-context learning is critically limited to settings where the in-context distribution of interest $p_{\theta}^{ICL}( x|\mathcal{D})$ can be straightforwardly expressed and/or parameterized by the model; for instance, language modeling relies on expressing the next-token distribution as a categorical distribution parameterized by the network's output logits. In this work, we present a more general form of in-context learning without such a limitation that we call \textit{in-context learning of energy functions}. The idea is to instead learn the unconstrained and arbitrary in-context energy function $E_{\theta}^{ICL}(x|\mathcal{D})$ corresponding to the in-context distribution $p_{\theta}^{ICL}(x|\mathcal{D})$. To do this, we use classic ideas from energy-based modeling. We provide preliminary evidence that our method empirically works on synthetic data. Interestingly, our work contributes (to the best of our knowledge) the first example of in-context learning where the input space and output space differ from one another, suggesting that in-context learning is a more-general capability than previously realized.
comment: Proceedings of the 1st Workshop on In-Context Learning at the 41st International Conference on Machine Learning, Vienna, Austria. 2024. arXiv admin note: text overlap with arXiv:2402.10202
☆ Towards Exact Gradient-based Training on Analog In-memory Computing
Given the high economic and environmental costs of using large vision or language models, analog in-memory accelerators present a promising solution for energy-efficient AI. While inference on analog accelerators has been studied recently, the training perspective is underexplored. Recent studies have shown that the "workhorse" of digital AI training - stochastic gradient descent (SGD) algorithm converges inexactly when applied to model training on non-ideal devices. This paper puts forth a theoretical foundation for gradient-based training on analog devices. We begin by characterizing the non-convergent issue of SGD, which is caused by the asymmetric updates on the analog devices. We then provide a lower bound of the asymptotic error to show that there is a fundamental performance limit of SGD-based analog training rather than an artifact of our analysis. To address this issue, we study a heuristic analog algorithm called Tiki-Taka that has recently exhibited superior empirical performance compared to SGD and rigorously show its ability to exactly converge to a critical point and hence eliminates the asymptotic error. The simulations verify the correctness of the analyses.
comment: 10 pages, 5 figures,2 tables
☆ First-Order Methods for Linearly Constrained Bilevel Optimization
Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplored. We present first-order linearly constrained optimization methods with finite-time hypergradient stationarity guarantees. For linear equality constraints, we attain $\epsilon$-stationarity in $\widetilde{O}(\epsilon^{-2})$ gradient oracle calls, which is nearly-optimal. For linear inequality constraints, we attain $(\delta,\epsilon)$-Goldstein stationarity in $\widetilde{O}(d{\delta^{-1} \epsilon^{-3}})$ gradient oracle calls, where $d$ is the upper-level dimension. Finally, we obtain for the linear inequality setting dimension-free rates of $\widetilde{O}({\delta^{-1} \epsilon^{-4}})$ oracle complexity under the additional assumption of oracle access to the optimal dual variable. Along the way, we develop new nonsmooth nonconvex optimization methods with inexact oracles. We verify these guarantees with preliminary numerical experiments.
☆ Quasi-Bayes meets Vines
Recently proposed quasi-Bayesian (QB) methods initiated a new era in Bayesian computation by directly constructing the Bayesian predictive distribution through recursion, removing the need for expensive computations involved in sampling the Bayesian posterior distribution. This has proved to be data-efficient for univariate predictions, but extensions to multiple dimensions rely on a conditional decomposition resulting from predefined assumptions on the kernel of the Dirichlet Process Mixture Model, which is the implicit nonparametric model used. Here, we propose a different way to extend Quasi-Bayesian prediction to high dimensions through the use of Sklar's theorem by decomposing the predictive distribution into one-dimensional predictive marginals and a high-dimensional copula. Thus, we use the efficient recursive QB construction for the one-dimensional marginals and model the dependence using highly expressive vine copulas. Further, we tune hyperparameters using robust divergences (eg. energy score) and show that our proposed Quasi-Bayesian Vine (QB-Vine) is a fully non-parametric density estimator with \emph{an analytical form} and convergence rate independent of the dimension of data in some situations. Our experiments illustrate that the QB-Vine is appropriate for high dimensional distributions ($\sim$64), needs very few samples to train ($\sim$200) and outperforms state-of-the-art methods with analytical forms for density estimation and supervised tasks by a considerable margin.
comment: 36 pages, 2 figures
☆ Implicit Bias of Mirror Flow on Separable Data
We examine the continuous-time counterpart of mirror descent, namely mirror flow, on classification problems which are linearly separable. Such problems are minimised `at infinity' and have many possible solutions; we study which solution is preferred by the algorithm depending on the mirror potential. For exponential tailed losses and under mild assumptions on the potential, we show that the iterates converge in direction towards a $\phi_\infty$-maximum margin classifier. The function $\phi_\infty$ is the $\textit{horizon function}$ of the mirror potential and characterises its shape `at infinity'. When the potential is separable, a simple formula allows to compute this function. We analyse several examples of potentials and provide numerical experiments highlighting our results.
☆ Unsupervised explainable activity prediction in competitive Nordic Walking from experimental data
Artificial Intelligence (AI) has found application in Human Activity Recognition (HAR) in competitive sports. To date, most Machine Learning (ML) approaches for HAR have relied on offline (batch) training, imposing higher computational and tagging burdens compared to online processing unsupervised approaches. Additionally, the decisions behind traditional ML predictors are opaque and require human interpretation. In this work, we apply an online processing unsupervised clustering approach based on low-cost wearable Inertial Measurement Units (IMUs). The outcomes generated by the system allow for the automatic expansion of limited tagging available (e.g., by referees) within those clusters, producing pertinent information for the explainable classification stage. Specifically, our work focuses on achieving automatic explainability for predictions related to athletes' activities, distinguishing between correct, incorrect, and cheating practices in Nordic Walking. The proposed solution achieved performance metrics of close to 100 % on average.
☆ GFM4MPM: Towards Geospatial Foundation Models for Mineral Prospectivity Mapping
Machine Learning (ML) for Mineral Prospectivity Mapping (MPM) remains a challenging problem as it requires the analysis of associations between large-scale multi-modal geospatial data and few historical mineral commodity observations (positive labels). Recent MPM works have explored Deep Learning (DL) as a modeling tool with more representation capacity. However, these overparameterized methods may be more prone to overfitting due to their reliance on scarce labeled data. While a large quantity of unlabeled geospatial data exists, no prior MPM works have considered using such information in a self-supervised manner. Our MPM approach uses a masked image modeling framework to pretrain a backbone neural network in a self-supervised manner using unlabeled geospatial data alone. After pretraining, the backbone network provides feature extraction for downstream MPM tasks. We evaluated our approach alongside existing methods to assess mineral prospectivity of Mississippi Valley Type (MVT) and Clastic-Dominated (CD) Lead-Zinc deposits in North America and Australia. Our results demonstrate that self-supervision promotes robustness in learned features, improving prospectivity predictions. Additionally, we leverage explainable artificial intelligence techniques to demonstrate that individual predictions can be interpreted from a geological perspective.
comment: 12 pages, 16 figures, 7 tables
☆ Extracting Training Data from Unconditional Diffusion Models
As diffusion probabilistic models (DPMs) are being employed as mainstream models for generative artificial intelligence (AI), the study of their memorization of the raw training data has attracted growing attention. Existing works in this direction aim to establish an understanding of whether or to what extent DPMs learn by memorization. Such an understanding is crucial for identifying potential risks of data leakage and copyright infringement in diffusion models and, more importantly, for more controllable generation and trustworthy application of Artificial Intelligence Generated Content (AIGC). While previous works have made important observations of when DPMs are prone to memorization, these findings are mostly empirical, and the developed data extraction methods only work for conditional diffusion models. In this work, we aim to establish a theoretical understanding of memorization in DPMs with 1) a memorization metric for theoretical analysis, 2) an analysis of conditional memorization with informative and random labels, and 3) two better evaluation metrics for measuring memorization. Based on the theoretical analysis, we further propose a novel data extraction method called \textbf{Surrogate condItional Data Extraction (SIDE)} that leverages a classifier trained on generated data as a surrogate condition to extract training data directly from unconditional diffusion models. Our empirical results demonstrate that SIDE can extract training data from diffusion models where previous methods fail, and it is on average over 50\% more effective across different scales of the CelebA dataset.
☆ TSI-Bench: Benchmarking Time Series Imputation
Effective imputation is a crucial preprocessing step for time series analysis. Despite the development of numerous deep learning algorithms for time series imputation, the community lacks standardized and comprehensive benchmark platforms to effectively evaluate imputation performance across different settings. Moreover, although many deep learning forecasting algorithms have demonstrated excellent performance, whether their modeling achievements can be transferred to time series imputation tasks remains unexplored. To bridge these gaps, we develop TSI-Bench, the first (to our knowledge) comprehensive benchmark suite for time series imputation utilizing deep learning techniques. The TSI-Bench pipeline standardizes experimental settings to enable fair evaluation of imputation algorithms and identification of meaningful insights into the influence of domain-appropriate missingness ratios and patterns on model performance. Furthermore, TSI-Bench innovatively provides a systematic paradigm to tailor time series forecasting algorithms for imputation purposes. Our extensive study across 34,804 experiments, 28 algorithms, and 8 datasets with diverse missingness scenarios demonstrates TSI-Bench's effectiveness in diverse downstream tasks and potential to unlock future directions in time series imputation research and analysis. The source code and experiment logs are available at https://github.com/WenjieDu/AwesomeImputation.
☆ Automatic generation of insights from workers' actions in industrial workflows with explainable Machine Learning
New technologies such as Machine Learning (ML) gave great potential for evaluating industry workflows and automatically generating key performance indicators (KPIs). However, despite established standards for measuring the efficiency of industrial machinery, there is no precise equivalent for workers' productivity, which would be highly desirable given the lack of a skilled workforce for the next generation of industry workflows. Therefore, an ML solution combining data from manufacturing processes and workers' performance for that goal is required. Additionally, in recent times intense effort has been devoted to explainable ML approaches that can automatically explain their decisions to a human operator, thus increasing their trustworthiness. We propose to apply explainable ML solutions to differentiate between expert and inexpert workers in industrial workflows, which we validate at a quality assessment industrial workstation. Regarding the methodology used, input data are captured by a manufacturing machine and stored in a NoSQL database. Data are processed to engineer features used in automatic classification and to compute workers' KPIs to predict their level of expertise (with all classification metrics exceeding 90 %). These KPIs, and the relevant features in the decisions are textually explained by natural language expansion on an explainability dashboard. These automatic explanations made it possible to infer knowledge from expert workers for inexpert workers. The latter illustrates the interest of research in self-explainable ML for automatically generating insights to improve productivity in industrial workflows.
comment: IEEE Industrial Electronics Magazine (2023)
☆ Predicting the energetic proton flux with a machine learning regression algorithm
The need of real-time of monitoring and alerting systems for Space Weather hazards has grown significantly in the last two decades. One of the most important challenge for space mission operations and planning is the prediction of solar proton events (SPEs). In this context, artificial intelligence and machine learning techniques have opened a new frontier, providing a new paradigm for statistical forecasting algorithms. The great majority of these models aim to predict the occurrence of a SPE, i.e., they are based on the classification approach. In this work we present a simple and efficient machine learning regression algorithm which is able to forecast the energetic proton flux up to 1 hour ahead by exploiting features derived from the electron flux only. This approach could be helpful to improve monitoring systems of the radiation risk in both deep space and near-Earth environments. The model is very relevant for mission operations and planning, especially when flare characteristics and source location are not available in real time, as at Mars distance.
☆ BIOSCAN-5M: A Multimodal Dataset for Insect Biodiversity
As part of an ongoing worldwide effort to comprehend and monitor insect biodiversity, this paper presents the BIOSCAN-5M Insect dataset to the machine learning community and establish several benchmark tasks. BIOSCAN-5M is a comprehensive dataset containing multi-modal information for over 5 million insect specimens, and it significantly expands existing image-based biological datasets by including taxonomic labels, raw nucleotide barcode sequences, assigned barcode index numbers, and geographical information. We propose three benchmark experiments to demonstrate the impact of the multi-modal data types on the classification and clustering accuracy. First, we pretrain a masked language model on the DNA barcode sequences of the \mbox{BIOSCAN-5M} dataset, and demonstrate the impact of using this large reference library on species- and genus-level classification performance. Second, we propose a zero-shot transfer learning task applied to images and DNA barcodes to cluster feature embeddings obtained from self-supervised learning, to investigate whether meaningful clusters can be derived from these representation embeddings. Third, we benchmark multi-modality by performing contrastive learning on DNA barcodes, image data, and taxonomic information. This yields a general shared embedding space enabling taxonomic classification using multiple types of information and modalities. The code repository of the BIOSCAN-5M Insect dataset is available at {\url{https://github.com/zahrag/BIOSCAN-5M}}
☆ Enhancing Spatio-temporal Quantile Forecasting with Curriculum Learning: Lessons Learned
Training models on spatio-temporal (ST) data poses an open problem due to the complicated and diverse nature of the data itself, and it is challenging to ensure the model's performance directly trained on the original ST data. While limiting the variety of training data can make training easier, it can also lead to a lack of knowledge and information for the model, resulting in a decrease in performance. To address this challenge, we presented an innovative paradigm that incorporates three separate forms of curriculum learning specifically targeting from spatial, temporal, and quantile perspectives. Furthermore, our framework incorporates a stacking fusion module to combine diverse information from three types of curriculum learning, resulting in a strong and thorough learning process. We demonstrated the effectiveness of this framework with extensive empirical evaluations, highlighting its better performance in addressing complex ST challenges. We provided thorough ablation studies to investigate the effectiveness of our curriculum and to explain how it contributes to the improvement of learning efficiency on ST data.
☆ SUPER: Selfie Undistortion and Head Pose Editing with Identity Preservation
Self-portraits captured from a short distance might look unnatural or even unattractive due to heavy distortions making facial features malformed, and ill-placed head poses. In this paper, we propose SUPER, a novel method of eliminating distortions and adjusting head pose in a close-up face crop. We perform 3D GAN inversion for a facial image by optimizing camera parameters and face latent code, which gives a generated image. Besides, we estimate depth from the obtained latent code, create a depth-induced 3D mesh, and render it with updated camera parameters to obtain a warped portrait. Finally, we apply the visibility-based blending so that visible regions are reprojected, and occluded parts are restored with a generative model. Experiments on face undistortion benchmarks and on our self-collected Head Rotation dataset (HeRo), show that SUPER outperforms previous approaches both qualitatively and quantitatively, opening new possibilities for photorealistic selfie editing.
☆ XXLTraffic: Expanding and Extremely Long Traffic Dataset for Ultra-Dynamic Forecasting Challenges
Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the ultra-dynamic nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, the largest available public traffic dataset with the longest timespan and increasing number of sensor nodes over the multiple years observed in the data, curated to support research in ultra-dynamic forecasting. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle ultra-dynamic and extremely long forecasting problems. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.
☆ Spatial Sequence Attention Network for Schizophrenia Classification from Structural Brain MR Images
Schizophrenia is a debilitating, chronic mental disorder that significantly impacts an individual's cognitive abilities, behavior, and social interactions. It is characterized by subtle morphological changes in the brain, particularly in the gray matter. These changes are often imperceptible through manual observation, demanding an automated approach to diagnosis. This study introduces a deep learning methodology for the classification of individuals with Schizophrenia. We achieve this by implementing a diversified attention mechanism known as Spatial Sequence Attention (SSA) which is designed to extract and emphasize significant feature representations from structural MRI (sMRI). Initially, we employ the transfer learning paradigm by leveraging pre-trained DenseNet to extract initial feature maps from the final convolutional block which contains morphological alterations associated with Schizophrenia. These features are further processed by the proposed SSA to capture and emphasize intricate spatial interactions and relationships across volumes within the brain. Our experimental studies conducted on a clinical dataset have revealed that the proposed attention mechanism outperforms the existing Squeeze & Excitation Network for Schizophrenia classification.
comment: This paper has been accepted for the 21st IEEE International Symposium on Biomedical Imaging (ISBI 2024)
☆ Contraction rates for conjugate gradient and Lanczos approximate posteriors in Gaussian process regression
Due to their flexibility and theoretical tractability Gaussian process (GP) regression models have become a central topic in modern statistics and machine learning. While the true posterior in these models is given explicitly, numerical evaluations depend on the inversion of the augmented kernel matrix $ K + \sigma^2 I $, which requires up to $ O(n^3) $ operations. For large sample sizes n, which are typically given in modern applications, this is computationally infeasible and necessitates the use of an approximate version of the posterior. Although such methods are widely used in practice, they typically have very limtied theoretical underpinning. In this context, we analyze a class of recently proposed approximation algorithms from the field of Probabilistic numerics. They can be interpreted in terms of Lanczos approximate eigenvectors of the kernel matrix or a conjugate gradient approximation of the posterior mean, which are particularly advantageous in truly large scale applications, as they are fundamentally only based on matrix vector multiplications amenable to the GPU acceleration of modern software frameworks. We combine result from the numerical analysis literature with state of the art concentration results for spectra of kernel matrices to obtain minimax contraction rates. Our theoretical findings are illustrated by numerical experiments.
☆ Stealth edits for provably fixing or attacking large language models
We reveal new methods and the theoretical foundations of techniques for editing large language models. We also show how the new theory can be used to assess the editability of models and to expose their susceptibility to previously unknown malicious attacks. Our theoretical approach shows that a single metric (a specific measure of the intrinsic dimensionality of the model's features) is fundamental to predicting the success of popular editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these approaches as stealth editing methods, because they aim to directly and inexpensively update a model's weights to correct the model's responses to known hallucinating prompts without otherwise affecting the model's behaviour, without requiring retraining. By carefully applying the insight gleaned from our theoretical investigation, we are able to introduce a new network block -- named a jet-pack block -- which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. The intrinsic dimensionality metric also determines the vulnerability of a language model to a stealth attack: a small change to a model's weights which changes its response to a single attacker-chosen prompt. Stealth attacks do not require access to or knowledge of the model's training data, therefore representing a potent yet previously unrecognised threat to redistributed foundation models. They are computationally simple enough to be implemented in malware in many cases. Extensive experimental results illustrate and support the method and its theoretical underpinnings. Demos and source code for editing language models are available at https://github.com/qinghua-zhou/stealth-edits.
comment: 24 pages, 9 figures. Open source implementation: https://github.com/qinghua-zhou/stealth-edits
☆ A Systematization of the Wagner Framework: Graph Theory Conjectures and Reinforcement Learning
In 2021, Adam Zsolt Wagner proposed an approach to disprove conjectures in graph theory using Reinforcement Learning (RL). Wagner's idea can be framed as follows: consider a conjecture, such as a certain quantity f(G) < 0 for every graph G; one can then play a single-player graph-building game, where at each turn the player decides whether to add an edge or not. The game ends when all edges have been considered, resulting in a certain graph G_T, and f(G_T) is the final score of the game; RL is then used to maximize this score. This brilliant idea is as simple as innovative, and it lends itself to systematic generalization. Several different single-player graph-building games can be employed, along with various RL algorithms. Moreover, RL maximizes the cumulative reward, allowing for step-by-step rewards instead of a single final score, provided the final cumulative reward represents the quantity of interest f(G_T). In this paper, we discuss these and various other choices that can be significant in Wagner's framework. As a contribution to this systematization, we present four distinct single-player graph-building games. Each game employs both a step-by-step reward system and a single final score. We also propose a principled approach to select the most suitable neural network architecture for any given conjecture, and introduce a new dataset of graphs labeled with their Laplacian spectra. Furthermore, we provide a counterexample for a conjecture regarding the sum of the matching number and the spectral radius, which is simpler than the example provided in Wagner's original paper. The games have been implemented as environments in the Gymnasium framework, and along with the dataset, are available as open-source supplementary materials.
☆ SCORE: A 1D Reparameterization Technique to Break Bayesian Optimization's Curse of Dimensionality
Bayesian optimization (BO) has emerged as a powerful tool for navigating complex search spaces, showcasing practical applications in the fields of science and engineering.However, since it typically relies on a surrogate model to approximate the objective function, BO grapples with heightened computational costs that tend to escalate as the number of parameters and experiments grows. Several methods such as parallelization, surrogate model approximations, and memory pruning have been proposed to cut down computing time, but they all fall short of resolving the core issue behind BO's curse of dimensionality. In this paper, a 1D reparametrization trick is proposed to break this curse and sustain linear time complexity for BO in high-dimensional landscapes. This fast and scalable approach named SCORE can successfully find the global minimum of needle-in-a-haystack optimization functions and fit real-world data without the high-performance computing resources typically required by state-of-the-art techniques.
☆ A variational Bayes approach to debiased inference for low-dimensional parameters in high-dimensional linear regression
We propose a scalable variational Bayes method for statistical inference for a single or low-dimensional subset of the coordinates of a high-dimensional parameter in sparse linear regression. Our approach relies on assigning a mean-field approximation to the nuisance coordinates and carefully modelling the conditional distribution of the target given the nuisance. This requires only a preprocessing step and preserves the computational advantages of mean-field variational Bayes, while ensuring accurate and reliable inference for the target parameter, including for uncertainty quantification. We investigate the numerical performance of our algorithm, showing that it performs competitively with existing methods. We further establish accompanying theoretical guarantees for estimation and uncertainty quantification in the form of a Bernstein--von Mises theorem.
comment: 46 pages, 5 figures
☆ Federated Learning with a Single Shared Image CVPR
Federated Learning (FL) enables multiple machines to collaboratively train a machine learning model without sharing of private training data. Yet, especially for heterogeneous models, a key bottleneck remains the transfer of knowledge gained from each client model with the server. One popular method, FedDF, uses distillation to tackle this task with the use of a common, shared dataset on which predictions are exchanged. However, in many contexts such a dataset might be difficult to acquire due to privacy and the clients might not allow for storage of a large shared dataset. To this end, in this paper, we introduce a new method that improves this knowledge distillation method to only rely on a single shared image between clients and server. In particular, we propose a novel adaptive dataset pruning algorithm that selects the most informative crops generated from only a single image. With this, we show that federated learning with distillation under a limited shared dataset budget works better by using a single image compared to multiple individual ones. Finally, we extend our approach to allow for training heterogeneous client architectures by incorporating a non-uniform distillation schedule and client-model mirroring on the server side.
comment: 8 Pages, 3 Figures, Appendix 4 Pages, CVPRW 2024
☆ Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models
Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.
☆ Research and Implementation of Data Enhancement Techniques for Graph Neural Networks
Data, algorithms, and arithmetic power are the three foundational conditions for deep learning to be effective in the application domain. Data is the focus for developing deep learning algorithms. In practical engineering applications, some data are affected by the conditions under which more data cannot be obtained or the cost of obtaining data is too high, resulting in smaller data sets (generally several hundred to several thousand) and data sizes that are far smaller than the size of large data sets (tens of thousands). The above two methods are based on the original dataset to generate, in the case of insufficient data volume of the original data may not reflect all the real environment, such as the real environment of the light, silhouette and other information, if the amount of data is not enough, it is difficult to use a simple transformation or neural network generative model to generate the required data. The research in this paper firstly analyses the key points of the data enhancement technology of graph neural network, and at the same time introduces the composition foundation of graph neural network in depth, on the basis of which the data enhancement technology of graph neural network is optimized and analysed.
comment: 7 pages, 7 figures, to be published in IEEE International Conference on Artificial Intelligence and Electromechanical Automation
☆ Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model KDD 2024
Pre-trained vision-language models like CLIP have shown powerful zero-shot inference ability via image-text matching and prove to be strong few-shot learners in various downstream tasks. However, in real-world scenarios, adapting CLIP to downstream tasks may encounter the following challenges: 1) data may exhibit long-tailed data distributions and might not have abundant samples for all the classes; 2) There might be emerging tasks with new classes that contain no samples at all. To overcome them, we propose a novel framework to achieve efficient and long-tailed generalization, which can be termed as Candle. During the training process, we propose compensating logit-adjusted loss to encourage large margins of prototypes and alleviate imbalance both within the base classes and between the base and new classes. For efficient adaptation, we treat the CLIP model as a black box and leverage the extracted features to obtain visual and textual prototypes for prediction. To make full use of multi-modal information, we also propose cross-modal attention to enrich the features from both modalities. For effective generalization, we introduce virtual prototypes for new classes to make up for their lack of training images. Candle achieves state-of-the-art performance over extensive experiments on 11 diverse datasets while substantially reducing the training time, demonstrating the superiority of our approach. The source code is available at https://github.com/shijxcs/Candle.
comment: Accepted by KDD 2024
☆ Learning Diffusion at Lightspeed
Diffusion regulates a phenomenal number of natural processes and the dynamics of many successful generative models. Existing models to learn the diffusion terms from observational data rely on complex bilevel optimization problems and properly model only the drift of the system. We propose a new simple model, JKOnet*, which bypasses altogether the complexity of existing architectures while presenting significantly enhanced representational capacity: JKOnet* recovers the potential, interaction, and internal energy components of the underlying diffusion process. JKOnet* minimizes a simple quadratic loss, runs at lightspeed, and drastically outperforms other baselines in practice. Additionally, JKOnet* provides a closed-form optimal solution for linearly parametrized functionals. Our methodology is based on the interpretation of diffusion processes as energy-minimizing trajectories in the probability space via the so-called JKO scheme, which we study via its first-order optimality conditions, in light of few-weeks-old advancements in optimization in the probability space.
☆ When Are Bias-Free ReLU Networks Like Linear Networks? ICML 2024
We investigate the expressivity and learning dynamics of bias-free ReLU networks. We firstly show that two-layer bias-free ReLU networks have limited expressivity: the only odd function two-layer bias-free ReLU networks can express is a linear one. We then show that, under symmetry conditions on the data, these networks have the same learning dynamics as linear networks. This allows us to give closed-form time-course solutions to certain two-layer bias-free ReLU networks, which has not been done for nonlinear networks outside the lazy learning regime. While deep bias-free ReLU networks are more expressive than their two-layer counterparts, they still share a number of similarities with deep linear networks. These similarities enable us to leverage insights from linear networks, leading to a novel understanding of bias-free ReLU networks. Overall, our results show that some properties established for bias-free ReLU networks arise due to equivalence to linear networks, and suggest that including bias or considering asymmetric data are avenues to engage with nonlinear behaviors.
comment: HiLD Workshop at ICML 2024
☆ EUvsDisinfo: a Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles
This work introduces EUvsDisinfo, a multilingual dataset of trustworthy and disinformation articles related to pro-Kremlin themes. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting specific disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
comment: 4 pages, 3 figures, 2 tables
☆ Attack and Defense of Deep Learning Models in the Field of Web Attack Detection
The challenge of WAD (web attack detection) is growing as hackers continuously refine their methods to evade traditional detection. Deep learning models excel in handling complex unknown attacks due to their strong generalization and adaptability. However, they are vulnerable to backdoor attacks, where contextually irrelevant fragments are inserted into requests, compromising model stability. While backdoor attacks are well studied in image recognition, they are largely unexplored in WAD. This paper introduces backdoor attacks in WAD, proposing five methods and corresponding defenses. Testing on textCNN, biLSTM, and tinybert models shows an attack success rate over 87%, reducible through fine-tuning. Future research should focus on backdoor defenses in WAD. All the code and data of this paper can be obtained at https://anonymous.4open.science/r/attackDefenceinDL-7E05
comment: 26 pages, 4 figures
☆ Reinforcement-Learning based routing for packet-optical networks with hybrid telemetry
This article provides a methodology and open-source implementation of Reinforcement Learning algorithms for finding optimal routes in a packet-optical network scenario. The algorithm uses measurements provided by the physical layer (pre-FEC bit error rate and propagation delay) and the link layer (link load) to configure a set of latency-based rewards and penalties based on such measurements. Then, the algorithm executes Q-learning based on this set of rewards for finding the optimal routing strategies. It is further shown that the algorithm dynamically adapts to changing network conditions by re-calculating optimal policies upon either link load changes or link degradation as measured by pre-FEC BER.
☆ Generalization bounds for mixing processes via delayed online-to-PAC conversions
We study the generalization error of statistical learning algorithms in a non-i.i.d. setting, where the training data is sampled from a stationary mixing process. We develop an analytic framework for this scenario based on a reduction to online learning with delayed feedback. In particular, we show that the existence of an online learning algorithm with bounded regret (against a fixed statistical learning algorithm in a specially constructed game of online learning with delayed feedback) implies low generalization error of said statistical learning method even if the data sequence is sampled from a mixing time series. The rates demonstrate a trade-off between the amount of delay in the online learning game and the degree of dependence between consecutive data points, with near-optimal rates recovered in a number of well-studied settings when the delay is tuned appropriately as a function of the mixing time of the process.
PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs) for efficient document retrieval without relying on external indexes. However, DSIs need full re-training to handle updates in dynamic corpora, causing significant computational inefficiencies. We introduce PromptDSI, a rehearsal-free, prompt-based approach for instance-wise incremental learning in document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of DSI, leveraging its powerful representation to efficiently index new corpora while maintaining a balance between stability and plasticity. We eliminate the initial forward pass of prompt-based continual learning methods that doubles training and inference time. Moreover, we propose a topic-aware prompt pool that employs neural topic embeddings as fixed keys. This strategy ensures diverse and effective prompt usage, addressing the challenge of parameter underutilization caused by the collapse of the query-key matching mechanism. Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing forgetting while significantly enhancing recall by over 4% on new corpora.
comment: 21 pages
☆ Discovering Minimal Reinforcement Learning Environments
Reinforcement learning (RL) agents are commonly trained and evaluated in the same environment. In contrast, humans often train in a specialized environment before being evaluated, such as studying a book before taking an exam. The potential of such specialized training environments is still vastly underexplored, despite their capacity to dramatically speed up training. The framework of synthetic environments takes a first step in this direction by meta-learning neural network-based Markov decision processes (MDPs). The initial approach was limited to toy problems and produced environments that did not transfer to unseen RL algorithms. We extend this approach in three ways: Firstly, we modify the meta-learning algorithm to discover environments invariant towards hyperparameter configurations and learning algorithms. Secondly, by leveraging hardware parallelism and introducing a curriculum on an agent's evaluation episode horizon, we can achieve competitive results on several challenging continuous control problems. Thirdly, we surprisingly find that contextual bandits enable training RL agents that transfer well to their evaluation environment, even if it is a complex MDP. Hence, we set up our experiments to train synthetic contextual bandits, which perform on par with synthetic MDPs, yield additional insights into the evaluation environment, and can speed up downstream applications.
comment: 10 pages, 7 figures
☆ UIFV: Data Reconstruction Attack in Vertical Federated Learning
Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.
☆ Training Diffusion Models with Federated Learning
The training of diffusion-based models for image generation is predominantly controlled by a select few Big Tech companies, raising concerns about privacy, copyright, and data authority due to their lack of transparency regarding training data. To ad-dress this issue, we propose a federated diffusion model scheme that enables the independent and collaborative training of diffusion models without exposing local data. Our approach adapts the Federated Averaging (FedAvg) algorithm to train a Denoising Diffusion Model (DDPM). Through a novel utilization of the underlying UNet backbone, we achieve a significant reduction of up to 74% in the number of parameters exchanged during training,compared to the naive FedAvg approach, whilst simultaneously maintaining image quality comparable to the centralized setting, as evaluated by the FID score.
comment: Replacement of: http://resolver.tudelft.nl/uuid:49e11cf3-5a0a-40bc-9a62-1d7fe05fbe4d. Name of the algorithm has been changed slightly due to a name collision with another paper
☆ Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 5th graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
☆ MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs
Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant 'Impossible Trinity': struggling to simultaneously maintain model performance, enhance inference speed, and extend applicability across various architectures. Due to the theoretical ambiguities surrounding MOYU, this paper elucidates the root cause of the MOYU property and outlines the mechanisms behind two primary limitations encountered by current DA methods: 1) history-related activation uncertainty, and 2) semantic-irrelevant activation inertia. Our analysis not only underscores the limitations of current dynamic activation strategies within large-scale LLaMA models but also proposes opportunities for refining the design of future sparsity schemes.
☆ Low-Resource Machine Translation through the Lens of Personalized Federated Learning
We present a new approach based on the Personalized Federated Learning algorithm MeritFed that can be applied to Natural Language Tasks with heterogeneous data. We evaluate it on the Low-Resource Machine Translation task, using the dataset from the Large-Scale Multilingual Machine Translation Shared Task (Small Track #2) and the subset of Sami languages from the multilingual benchmark for Finno-Ugric languages. In addition to its effectiveness, MeritFed is also highly interpretable, as it can be applied to track the impact of each language used for training. Our analysis reveals that target dataset size affects weight distribution across auxiliary languages, that unrelated languages do not interfere with the training, and auxiliary optimizer parameters have minimal impact. Our approach is easy to apply with a few lines of code, and we provide scripts for reproducing the experiments at https://github.com/VityaVitalich/MeritFed
comment: 18 pages, 7 figures
☆ A Super-human Vision-based Reinforcement Learning Agent for Autonomous Racing in Gran Turismo
Racing autonomous cars faster than the best human drivers has been a longstanding grand challenge for the fields of Artificial Intelligence and robotics. Recently, an end-to-end deep reinforcement learning agent met this challenge in a high-fidelity racing simulator, Gran Turismo. However, this agent relied on global features that require instrumentation external to the car. This paper introduces, to the best of our knowledge, the first super-human car racing agent whose sensor input is purely local to the car, namely pixels from an ego-centric camera view and quantities that can be sensed from on-board the car, such as the car's velocity. By leveraging global features only at training time, the learned agent is able to outperform the best human drivers in time trial (one car on the track at a time) races using only local input features. The resulting agent is evaluated in Gran Turismo 7 on multiple tracks and cars. Detailed ablation experiments demonstrate the agent's strong reliance on visual inputs, making it the first vision-based super-human car racing agent.
comment: Accepted at the Reinforcement Learning Conference (RLC) 2024
☆ Bayesian Data Selection ICML 2024
A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
comment: 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024
☆ Offline Imitation Learning with Model-based Reverse Augmentation KDD2024
In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify the latent expert-similar samples to augment the reliable samples during learning. Model-based solutions build forward dynamic models with conservatism quantification and then generate additional trajectories in the neighborhood of expert demonstrations. However, without reward supervision, these methods are often over-conservative in the out-of-expert-support regions, because only in states close to expert-observed states can there be a preferred action enabling policy optimization. To encourage more exploration on expert-unobserved states, we propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation (SRA). Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states in a self-paced style. Then, we use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states. This framework not only explores the expert-unobserved states but also guides maximizing long-term returns on these states, ultimately enabling generalization beyond the expert data. Empirical results show that our proposal could effectively mitigate the covariate shift and achieve the state-of-the-art performance on the offline imitation learning benchmarks. Project website: \url{https://www.lamda.nju.edu.cn/shaojj/KDD24_SRA/}.
comment: Accepted by KDD2024
☆ The Heterophilic Snowflake Hypothesis: Training and Empowering GNNs for Heterophilic Graphs KDD 2024
Graph Neural Networks (GNNs) have become pivotal tools for a range of graph-based learning tasks. Notably, most current GNN architectures operate under the assumption of homophily, whether explicitly or implicitly. While this underlying assumption is frequently adopted, it is not universally applicable, which can result in potential shortcomings in learning effectiveness. In this paper, \textbf{for the first time}, we transfer the prevailing concept of ``one node one receptive field" to the heterophilic graph. By constructing a proxy label predictor, we enable each node to possess a latent prediction distribution, which assists connected nodes in determining whether they should aggregate their associated neighbors. Ultimately, every node can have its own unique aggregation hop and pattern, much like each snowflake is unique and possesses its own characteristics. Based on observations, we innovatively introduce the Heterophily Snowflake Hypothesis and provide an effective solution to guide and facilitate research on heterophilic graphs and beyond. We conduct comprehensive experiments including (1) main results on 10 graphs with varying heterophily ratios across 10 backbones; (2) scalability on various deep GNN backbones (SGC, JKNet, etc.) across various large number of layers (2,4,6,8,16,32 layers); (3) comparison with conventional snowflake hypothesis; (4) efficiency comparison with existing graph pruning algorithms. Our observations show that our framework acts as a versatile operator for diverse tasks. It can be integrated into various GNN frameworks, boosting performance in-depth and offering an explainable approach to choosing the optimal network depth. The source code is available at \url{https://github.com/bingreeky/HeteroSnoH}.
comment: KDD 2024
☆ Variational Distillation of Diffusion Policies into Mixture of Experts
This work introduces Variational Diffusion Distillation (VDD), a novel method that distills denoising diffusion policies into Mixtures of Experts (MoE) through variational inference. Diffusion Models are the current state-of-the-art in generative modeling due to their exceptional ability to accurately learn and represent complex, multi-modal distributions. This ability allows Diffusion Models to replicate the inherent diversity in human behavior, making them the preferred models in behavior learning such as Learning from Human Demonstrations (LfD). However, diffusion models come with some drawbacks, including the intractability of likelihoods and long inference times due to their iterative sampling process. The inference times, in particular, pose a significant challenge to real-time applications such as robot control. In contrast, MoEs effectively address the aforementioned issues while retaining the ability to represent complex distributions but are notoriously difficult to train. VDD is the first method that distills pre-trained diffusion models into MoE models, and hence, combines the expressiveness of Diffusion Models with the benefits of Mixture Models. Specifically, VDD leverages a decompositional upper bound of the variational objective that allows the training of each expert separately, resulting in a robust optimization scheme for MoEs. VDD demonstrates across nine complex behavior learning tasks, that it is able to: i) accurately distill complex distributions learned by the diffusion model, ii) outperform existing state-of-the-art distillation methods, and iii) surpass conventional methods for training MoE.
☆ TREE: Tree Regularization for Efficient Execution
The rise of machine learning methods on heavily resource constrained devices requires not only the choice of a suitable model architecture for the target platform, but also the optimization of the chosen model with regard to execution time consumption for inference in order to optimally utilize the available resources. Random forests and decision trees are shown to be a suitable model for such a scenario, since they are not only heavily tunable towards the total model size, but also offer a high potential for optimizing their executions according to the underlying memory architecture. In addition to the straightforward strategy of enforcing shorter paths through decision trees and hence reducing the execution time for inference, hardware-aware implementations can optimize the execution time in an orthogonal manner. One particular hardware-aware optimization is to layout the memory of decision trees in such a way, that higher probably paths are less likely to be evicted from system caches. This works particularly well when splits within tree nodes are uneven and have a high probability to visit one of the child nodes. In this paper, we present a method to reduce path lengths by rewarding uneven probability distributions during the training of decision trees at the cost of a minimal accuracy degradation. Specifically, we regularize the impurity computation of the CART algorithm in order to favor not only low impurity, but also highly asymmetric distributions for the evaluation of split criteria and hence offer a high optimization potential for a memory architecture-aware implementation. We show that especially for binary classification data sets and data sets with many samples, this form of regularization can lead to an reduction of up to approximately four times in the execution time with a minimal accuracy degradation.
☆ Update Selective Parameters: Federated Machine Unlearning Based on Model Explanation
Federated learning is a promising privacy-preserving paradigm for distributed machine learning. In this context, there is sometimes a need for a specialized process called machine unlearning, which is required when the effect of some specific training samples needs to be removed from a learning model due to privacy, security, usability, and/or legislative factors. However, problems arise when current centralized unlearning methods are applied to existing federated learning, in which the server aims to remove all information about a class from the global model. Centralized unlearning usually focuses on simple models or is premised on the ability to access all training data at a central node. However, training data cannot be accessed on the server under the federated learning paradigm, conflicting with the requirements of the centralized unlearning process. Additionally, there are high computation and communication costs associated with accessing clients' data, especially in scenarios involving numerous clients or complex global models. To address these concerns, we propose a more effective and efficient federated unlearning scheme based on the concept of model explanation. Model explanation involves understanding deep networks and individual channel importance, so that this understanding can be used to determine which model channels are critical for classes that need to be unlearned. We select the most influential channels within an already-trained model for the data that need to be unlearned and fine-tune only influential channels to remove the contribution made by those data. In this way, we can simultaneously avoid huge consumption costs and ensure that the unlearned model maintains good performance. Experiments with different training models on various datasets demonstrate the effectiveness of the proposed approach.
comment: Accepted by IEEE Transactions on Big Data
☆ Improving the Evaluation and Actionability of Explanation Methods for Multivariate Time Series Classification
Explanation for Multivariate Time Series Classification (MTSC) is an important topic that is under explored. There are very few quantitative evaluation methodologies and even fewer examples of actionable explanation, where the explanation methods are shown to objectively improve specific computational tasks on time series data. In this paper we focus on analyzing InterpretTime, a recent evaluation methodology for attribution methods applied to MTSC. We reproduce the original paper results, showcase some significant weaknesses of the methodology and propose ideas to improve both its accuracy and efficiency. Unlike related work, we go beyond evaluation and also showcase the actionability of the produced explainer ranking, by using the best attribution methods for the task of channel selection in MTSC. We find that perturbation-based methods such as SHAP and Feature Ablation work well across a set of datasets, classifiers and tasks and outperform gradient-based methods. We apply the best ranked explainers to channel selection for MTSC and show significant data size reduction and improved classifier accuracy.
☆ Autonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learning
Purpose: Autonomous navigation of catheters and guidewires can enhance endovascular surgery safety and efficacy, reducing procedure times and operator radiation exposure. Integrating tele-operated robotics could widen access to time-sensitive emergency procedures like mechanical thrombectomy (MT). Reinforcement learning (RL) shows potential in endovascular navigation, yet its application encounters challenges without a reward signal. This study explores the viability of autonomous navigation in MT vasculature using inverse RL (IRL) to leverage expert demonstrations. Methods: This study established a simulation-based training and evaluation environment for MT navigation. We used IRL to infer reward functions from expert behaviour when navigating a guidewire and catheter. We utilized soft actor-critic to train models with various reward functions and compared their performance in silico. Results: We demonstrated feasibility of navigation using IRL. When evaluating single versus dual device (i.e. guidewire versus catheter and guidewire) tracking, both methods achieved high success rates of 95% and 96%, respectively. Dual-tracking, however, utilized both devices mimicking an expert. A success rate of 100% and procedure time of 22.6 s were obtained when training with a reward function obtained through reward shaping. This outperformed a dense reward function (96%, 24.9 s) and an IRL-derived reward function (48%, 59.2 s). Conclusions: We have contributed to the advancement of autonomous endovascular intervention navigation, particularly MT, by employing IRL. The results underscore the potential of using reward shaping to train models, offering a promising avenue for enhancing the accessibility and precision of MT. We envisage that future research can extend our methodology to diverse anatomical structures to enhance generalizability.
comment: Abstract shortened for arXiv character limit
☆ The Power of LLM-Generated Synthetic Data for Stance Detection in Online Political Discussions
Stance detection holds great potential for enhancing the quality of online political discussions, as it has shown to be useful for summarizing discussions, detecting misinformation, and evaluating opinion distributions. Usually, transformer-based models are used directly for stance detection, which require large amounts of data. However, the broad range of debate questions in online political discussion creates a variety of possible scenarios that the model is faced with and thus makes data acquisition for model training difficult. In this work, we show how to leverage LLM-generated synthetic data to train and improve stance detection agents for online political discussions:(i) We generate synthetic data for specific debate questions by prompting a Mistral-7B model and show that fine-tuning with the generated synthetic data can substantially improve the performance of stance detection. (ii) We examine the impact of combining synthetic data with the most informative samples from an unlabelled dataset. First, we use the synthetic data to select the most informative samples, second, we combine both these samples and the synthetic data for fine-tuning. This approach reduces labelling effort and consistently surpasses the performance of the baseline model that is trained with fully labeled data. Overall, we show in comprehensive experiments that LLM-generated data greatly improves stance detection performance for online political discussions.
☆ Accelerating Depthwise Separable Convolutions on Ultra-Low-Power Devices
Depthwise separable convolutions are a fundamental component in efficient Deep Neural Networks, as they reduce the number of parameters and operations compared to traditional convolutions while maintaining comparable accuracy. However, their low data reuse opportunities make deploying them notoriously difficult. In this work, we perform an extensive exploration of alternatives to fuse the depthwise and pointwise kernels that constitute the separable convolutional block. Our approach aims to minimize time-consuming memory transfers by combining different data layouts. When targeting a commercial ultra-low-power device with a three-level memory hierarchy, the GreenWaves GAP8 SoC, we reduce the latency of end-to-end network execution by up to 11.40%. Furthermore, our kernels reduce activation data movements between L2 and L1 memories by up to 52.97%.
comment: Accepted at the XXIV International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS2024), June 29 - July 4, 2024
☆ Adversarial Multi-dueling Bandits
We introduce the problem of regret minimization in adversarial multi-dueling bandits. While adversarial preferences have been studied in dueling bandits, they have not been explored in multi-dueling bandits. In this setting, the learner is required to select $m \geq 2$ arms at each round and observes as feedback the identity of the most preferred arm which is based on an arbitrary preference matrix chosen obliviously. We introduce a novel algorithm, MiDEX (Multi Dueling EXP3), to learn from such preference feedback that is assumed to be generated from a pairwise-subset choice model. We prove that the expected cumulative $T$-round regret of MiDEX compared to a Borda-winner from a set of $K$ arms is upper bounded by $O((K \log K)^{1/3} T^{2/3})$. Moreover, we prove a lower bound of $\Omega(K^{1/3} T^{2/3})$ for the expected regret in this setting which demonstrates that our proposed algorithm is near-optimal.
☆ Insect Identification in the Wild: The AMI Dataset
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.
☆ A data-centric approach for assessing progress of Graph Neural Networks
Graph Neural Networks (GNNs) have achieved state-of-the-art results in node classification tasks. However, most improvements are in multi-class classification, with less focus on the cases where each node could have multiple labels. The first challenge in studying multi-label node classification is the scarcity of publicly available datasets. To address this, we collected and released three real-world biological datasets and developed a multi-label graph generator with tunable properties. We also argue that traditional notions of homophily and heterophily do not apply well to multi-label scenarios. Therefore, we define homophily and Cross-Class Neighborhood Similarity for multi-label classification and investigate $9$ collected multi-label datasets. Lastly, we conducted a large-scale comparative study with $8$ methods across nine datasets to evaluate current progress in multi-label node classification. We release our code at \url{https://github.com/Tianqi-py/MLGNC}.
☆ Federated Learning with Limited Node Labels
Subgraph federated learning (SFL) is a research methodology that has gained significant attention for its potential to handle distributed graph-structured data. In SFL, the local model comprises graph neural networks (GNNs) with a partial graph structure. However, some SFL models have overlooked the significance of missing cross-subgraph edges, which can lead to local GNNs being unable to message-pass global representations to other parties' GNNs. Moreover, existing SFL models require substantial labeled data, which limits their practical applications. To overcome these limitations, we present a novel SFL framework called FedMpa that aims to learn cross-subgraph node representations. FedMpa first trains a multilayer perceptron (MLP) model using a small amount of data and then propagates the federated feature to the local structures. To further improve the embedding representation of nodes with local subgraphs, we introduce the FedMpae method, which reconstructs the local graph structure with an innovation view that applies pooling operation to form super-nodes. Our extensive experiments on six graph datasets demonstrate that FedMpa is highly effective in node classification. Furthermore, our ablation experiments verify the effectiveness of FedMpa.
☆ Towards Audio Codec-based Speech Separation
Recent improvements in neural audio codec (NAC) models have generated interest in adopting pre-trained codecs for a variety of speech processing applications to take advantage of the efficiencies gained from high compression, but these have yet been applied to the speech separation (SS) task. SS can benefit from high compression because the compute required for traditional SS models makes them impractical for many edge computing use cases. However, SS is a waveform-masking task where compression tends to introduce distortions that severely impact performance. Here we propose a novel task of Audio Codec-based SS, where SS is performed within the embedding space of a NAC, and propose a new model, Codecformer, to address this task. At inference, Codecformer achieves a 52x reduction in MAC while producing separation performance comparable to a cloud deployment of Sepformer. This method charts a new direction for performing efficient SS in practical scenarios.
comment: This paper was accepted by Interspeech 2024, Blue Sky Track
☆ Exploring Sensing Devices for Heart and Lung Sound Monitoring
This paper presents a comprehensive review of cardiorespiratory auscultation sensing devices which is useful for understanding the theoretical aspects of sensing devices, as well as practical notes to design novel sensing devices. One of the methods to design a stethoscope is using electret condenser microphones (ECM). In this paper, we first introduce the acoustic properties of the heart and lungs, as well as a brief history of stethoscope evolution. Then, we discuss the basic concept of ECM sensors and a recent stethoscope based on this technology. In response to the limitations of ECM-based systems, we explore the potential of microelectromechanical systems (MEMS), particularly focusing on piezoelectric transducer (PZT) sensors. This paper comprehensively reviews sensing technologies, emphasizing innovative MEMS-based designs for wearable cardiopulmonary auscultation in the past decade. To our knowledge, this is the first paper to summarize ECM and MEMS applications for heart and lung sound analysis. Keywords: Micro-electro-mechanical Systems (MEMS); Electret Condenser Microphone (ECM); Wearable Sensing Devices; Cardiorespiratory Auscultation; Phonocardiography (PCG); Heart Sound; Lung Sound
☆ PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers NAACL 2024
In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.
comment: NAACL 2024
☆ PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
comment: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM
☆ Deep Temporal Deaggregation: Large-Scale Spatio-Temporal Generative Models
Many of today's data is time-series data originating from various sources, such as sensors, transaction systems, or production systems. Major challenges with such data include privacy and business sensitivity. Generative time-series models have the potential to overcome these problems, allowing representative synthetic data, such as people's movement in cities, to be shared openly and be used to the benefit of society at large. However, contemporary approaches are limited to prohibitively short sequences and small scales. Aside from major memory limitations, the models generate less accurate and less representative samples the longer the sequences are. This issue is further exacerbated by the lack of a comprehensive and accessible benchmark. Furthermore, a common need in practical applications is what-if analysis and dynamic adaptation to data distribution changes, for usage in decision making and to manage a changing world: What if this road is temporarily blocked or another road is added? The focus of this paper is on mobility data, such as people's movement in cities, requiring all these issues to be addressed. To this end, we propose a transformer-based diffusion model, TDDPM, for time-series which outperforms and scales substantially better than state-of-the-art. This is evaluated in a new comprehensive benchmark across several sequence lengths, standard datasets, and evaluation measures. We also demonstrate how the model can be conditioned on a prior over spatial occupancy frequency information, allowing the model to generate mobility data for previously unseen environments and for hypothetical scenarios where the underlying road network and its usage changes. This is evaluated by training on mobility data from part of a city. Then, using only aggregate spatial information as prior, we demonstrate out-of-distribution generalization to the unobserved remainder of the city.
☆ MMUTF: Multimodal Multimedia Event Argument Extraction with Unified Template Filling
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the capabilities of natural language-formulated event templates for the challenging Event Argument Extraction (EAE) task. In this work, we focus on EAE and address this issue by introducing a unified template filling model that connects the textual and visual modalities via textual prompts. This approach enables the exploitation of cross-ontology transfer and the incorporation of event-specific semantics. Experiments on the M2E2 benchmark demonstrate the effectiveness of our approach. Our system surpasses the current SOTA on textual EAE by +7% F1, and performs generally better than the second-best systems for multimedia EAE.
☆ TADM: Temporally-Aware Diffusion Model for Neurodegenerative Progression on Brain MRI
Generating realistic images to accurately predict changes in the structure of brain MRI is a crucial tool for clinicians. Such applications help assess patients' outcomes and analyze how diseases progress at the individual level. However, existing methods for this task present some limitations. Some approaches attempt to model the distribution of MRI scans directly by conditioning the model on patients' ages, but they fail to explicitly capture the relationship between structural changes in the brain and time intervals, especially on age-unbalanced datasets. Other approaches simply rely on interpolation between scans, which limits their clinical application as they do not predict future MRIs. To address these challenges, we propose a Temporally-Aware Diffusion Model (TADM), which introduces a novel approach to accurately infer progression in brain MRIs. TADM learns the distribution of structural changes in terms of intensity differences between scans and combines the prediction of these changes with the initial baseline scans to generate future MRIs. Furthermore, during training, we propose to leverage a pre-trained Brain-Age Estimator (BAE) to refine the model's training process, enhancing its ability to produce accurate MRIs that match the expected age gap between baseline and generated scans. Our assessment, conducted on the OASIS-3 dataset, uses similarity metrics and region sizes computed by comparing predicted and real follow-up scans on 3 relevant brain regions. TADM achieves large improvements over existing approaches, with an average decrease of 24% in region size error and an improvement of 4% in similarity metrics. These evaluations demonstrate the improvement of our model in mimicking temporal brain neurodegenerative progression compared to existing methods. Our approach will benefit applications, such as predicting patient outcomes or improving treatments for patients.
☆ Translation Equivariant Transformer Neural Processes
The effectiveness of neural processes (NPs) in modelling posterior prediction maps -- the mapping from data to posterior predictive distributions -- has significantly improved since their inception. This improvement can be attributed to two principal factors: (1) advancements in the architecture of permutation invariant set functions, which are intrinsic to all NPs; and (2) leveraging symmetries present in the true posterior predictive map, which are problem dependent. Transformers are a notable development in permutation invariant set functions, and their utility within NPs has been demonstrated through the family of models we refer to as TNPs. Despite significant interest in TNPs, little attention has been given to incorporating symmetries. Notably, the posterior prediction maps for data that are stationary -- a common assumption in spatio-temporal modelling -- exhibit translation equivariance. In this paper, we introduce of a new family of translation equivariant TNPs that incorporate translation equivariance. Through an extensive range of experiments on synthetic and real-world spatio-temporal data, we demonstrate the effectiveness of TE-TNPs relative to their non-translation-equivariant counterparts and other NP baselines.
☆ Fast Rates for Bandit PAC Multiclass Classification
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(\varepsilon,\delta)$-PAC version of the problem, with sample complexity of $O\big( (\operatorname{poly}(K) + 1 / \varepsilon^2) \log (|H| / \delta) \big)$ for any finite hypothesis class $H$. In terms of the leading dependence on $\varepsilon$, this improves upon existing bounds for the problem, that are of the form $O(K/\varepsilon^2)$. We also provide an extension of this result to general classes and establish similar sample complexity bounds in which $\log |H|$ is replaced by the Natarajan dimension. This matches the optimal rate in the full-information version of the problem and resolves an open question studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011) who demonstrated that the multiplicative price of bandit feedback in realizable PAC learning is $\Theta(K)$. We complement this by revealing a stark contrast with the agnostic case, where the price of bandit feedback is only $O(1)$ as $\varepsilon \to 0$. Our algorithm utilizes a stochastic optimization technique to minimize a log-barrier potential based on Frank-Wolfe updates for computing a low-variance exploration distribution over the hypotheses, and is made computationally efficient provided access to an ERM oracle over $H$.
☆ QOG:Question and Options Generation based on Language Model
Question-Options Generation (QOG) is a task that involves generating a set of question-options pairs given context. This task has various applications, including fine-tuning large models, information retrieval, and automated multiple-choice question generation for education. In this paper, we develop QOG models using three different methods based on fine-tuning sequence-to-sequence language models (LMs). Experiments demonstrate that the end-to-end QOG model is computationally efficient and stable during both training and inference, outperforming other methods. Furthermore, our analysis indicates that our QOG models are competitive on the QOG task compared to the large language model Llama 3-8B.
comment: 8 pages, 3 figures, 4 tables
☆ Efficient mapping of phase diagrams with conditional normalizing flows
The accurate prediction of phase diagrams is of central importance for both the fundamental understanding of materials as well as for technological applications in material sciences. However, the computational prediction of the relative stability between phases based on their free energy is a daunting task, as traditional free energy estimators require a large amount of simulation data to obtain uncorrelated equilibrium samples over a grid of thermodynamic states. In this work, we develop deep generative machine learning models for entire phase diagrams, employing normalizing flows conditioned on the thermodynamic states, e.g., temperature and pressure, that they map to. By training a single normalizing flow to transform the equilibrium distribution sampled at only one reference thermodynamic state to a wide range of target temperatures and pressures, we can efficiently generate equilibrium samples across the entire phase diagram. Using a permutation-equivariant architecture allows us, thereby, to treat solid and liquid phases on the same footing. We demonstrate our approach by predicting the solid-liquid coexistence line for a Lennard-Jones system in excellent agreement with state-of-the-art free energy methods while significantly reducing the number of energy evaluations needed.
☆ GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).
☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
☆ Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines ICIP 2024
This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filters dynamically, in a fuzzy manner, which further optimizes the winning filter on the given sample. Inspired by simulated annealing optimization techniques, we employ a softmax function with a temperature variable as the weight allocation function to mitigate the effects of random initialization. Our evaluation, conducted on a system utilizing multiple post-processing filters within a Versatile Video Coding (VVC) codec framework, demonstrates the superiority of content-specific filters trained with our proposed strategies, specifically, when images are processed in blocks. Using VVC reference software VTM 12.0 as the anchor, experiments on the OpenImages dataset show an improvement in the BD-rate reduction from -41.3% and -44.6% to -42.3% and -44.7% for object detection and instance segmentation tasks, respectively, compared to independently trained filters. The statistics of the filter usage align with our hypothesis and underscore the importance of jointly optimizing filters for both content and reconstruction quality. Our findings pave the way for further improving the performance of video/image codecs.
comment: Accepted to be preseneted in ICIP 2024
☆ Structured Prediction in Online Learning
We study a theoretical and algorithmic framework for structured prediction in the online learning setting. The problem of structured prediction, i.e. estimating function where the output space lacks a vectorial structure, is well studied in the literature of supervised statistical learning. We show that our algorithm is a generalisation of optimal algorithms from the supervised learning setting, and achieves the same excess risk upper bound also when data are not i.i.d. Moreover, we consider a second algorithm designed especially for non-stationary data distributions, including adversarial data. We bound its stochastic regret in function of the variation of the data distributions.
comment: 29 pages
☆ UrbanLLM: Autonomous Urban Activity Planning and Management with Large Language Models EMNLP 2024
Location-based services play an critical role in improving the quality of our daily lives. Despite the proliferation of numerous specialized AI models within spatio-temporal context of location-based services, these models struggle to autonomously tackle problems regarding complex urban planing and management. To bridge this gap, we introduce UrbanLLM, a fine-tuned large language model (LLM) designed to tackle diverse problems in urban scenarios. UrbanLLM functions as a problem-solver by decomposing urban-related queries into manageable sub-tasks, identifying suitable spatio-temporal AI models for each sub-task, and generating comprehensive responses to the given queries. Our experimental results indicate that UrbanLLM significantly outperforms other established LLMs, such as Llama and the GPT series, in handling problems concerning complex urban activity planning and management. UrbanLLM exhibits considerable potential in enhancing the effectiveness of solving problems in urban scenarios, reducing the workload and reliance for human experts.
comment: Submitted for EMNLP 2024
☆ Memory Sequence Length of Data Sampling Impacts the Adaptation of Meta-Reinforcement Learning Agents
Fast adaptation to new tasks is extremely important for embodied agents in the real world. Meta-reinforcement learning (meta-RL) has emerged as an effective method to enable fast adaptation in unknown environments. Compared to on-policy meta-RL algorithms, off-policy algorithms rely heavily on efficient data sampling strategies to extract and represent the historical trajectories. However, little is known about how different data sampling methods impact the ability of meta-RL agents to represent unknown environments. Here, we investigate the impact of data sampling strategies on the exploration and adaptability of meta-RL agents. Specifically, we conducted experiments with two types of off-policy meta-RL algorithms based on Thompson sampling and Bayes-optimality theories in continuous control tasks within the MuJoCo environment and sparse reward navigation tasks. Our analysis revealed the long-memory and short-memory sequence sampling strategies affect the representation and adaptive capabilities of meta-RL agents. We found that the algorithm based on Bayes-optimality theory exhibited more robust and better adaptability than the algorithm based on Thompson sampling, highlighting the importance of appropriate data sampling strategies for the agent's representation of an unknown environment, especially in the case of sparse rewards.
♻ ☆ BTS: Building Timeseries Dataset: Empowering Large-Scale Building Analytics
Buildings play a crucial role in human well-being, influencing occupant comfort, health, and safety. Additionally, they contribute significantly to global energy consumption, accounting for one-third of total energy usage, and carbon emissions. Optimizing building performance presents a vital opportunity to combat climate change and promote human flourishing. However, research in building analytics has been hampered by the lack of accessible, available, and comprehensive real-world datasets on multiple building operations. In this paper, we introduce the Building TimeSeries (BTS) dataset. Our dataset covers three buildings over a three-year period, comprising more than ten thousand timeseries data points with hundreds of unique ontologies. Moreover, the metadata is standardized using the Brick schema. To demonstrate the utility of this dataset, we performed benchmarks on two tasks: timeseries ontology classification and zero-shot forecasting. These tasks represent an essential initial step in addressing challenges related to interoperability in building analytics. Access to the dataset and the code used for benchmarking are available here: https://github.com/cruiseresearchgroup/DIEF_BTS .
comment: 21 pages, 2 figures, 9 tables, under review
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in \textit{a single autoregressive step}, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of an LLM to control a symbolic architecture which performs arithmetic. Our implementation using Llama 3 8B Instruct with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o and on par with GPT 4o using a code interpreter. OccamLlama also outperforms GPT 4o both with and without a code interpreter on mathematical problem solving benchmarks involving challenging arithmetic, thus enabling small LLMs to match the arithmetic performance of even much larger models. We will make our code public shortly.
♻ ☆ DeformTime: Capturing Variable Dependencies with Deformable Attention for Time Series Forecasting
In multivariate time series (MTS) forecasting, existing state-of-the-art deep learning approaches tend to focus on autoregressive formulations and overlook the information within exogenous indicators. To address this limitation, we present DeformTime, a neural network architecture that attempts to capture correlated temporal patterns from the input space, and hence, improve forecasting accuracy. It deploys two core operations performed by deformable attention blocks (DABs): learning dependencies across variables from different time steps (variable DAB), and preserving temporal dependencies in data from previous time steps (temporal DAB). Input data transformation is explicitly designed to enhance learning from the deformed series of information while passing through a DAB. We conduct extensive experiments on 6 MTS data sets, using previously established benchmarks as well as challenging infectious disease modelling tasks with more exogenous variables. The results demonstrate that DeformTime improves accuracy against previous competitive methods across the vast majority of MTS forecasting tasks, reducing the mean absolute error by 10% on average. Notably, performance gains remain consistent across longer forecasting horizons.
comment: The code is available at https://github.com/ClaudiaShu/DeformTime
♻ ☆ Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.
comment: Updates in the v2: more models (Llama3, Phi-3, Nemotron-4-340B), jailbreak artifacts for all attacks are available, evaluation of generalization to a different judge (Llama-3-70B and Llama Guard 2), more experiments (convergence plots over iterations, ablation on the suffix length for random search), improved exposition of the paper, examples of jailbroken generation
♻ ☆ Gap-Free Clustering: Sensitivity and Robustness of SDP
We study graph clustering in the Stochastic Block Model (SBM) in the presence of both large clusters and small, unrecoverable clusters. Previous convex relaxation approaches achieving exact recovery do not allow any small clusters of size $o(\sqrt{n})$, or require a size gap between the smallest recovered cluster and the largest non-recovered cluster. We provide an algorithm based on semidefinite programming (SDP) which removes these requirements and provably recovers large clusters regardless of the remaining cluster sizes. Mid-sized clusters pose unique challenges to the analysis, since their proximity to the recovery threshold makes them highly sensitive to small noise perturbations and precludes a closed-form candidate solution. We develop novel techniques, including a leave-one-out-style argument which controls the correlation between SDP solutions and noise vectors even when the removal of one row of noise can drastically change the SDP solution. We also develop improved eigenvalue perturbation bounds of potential independent interest. Our results are robust to certain semirandom settings that are challenging for alternative algorithms. Using our gap-free clustering procedure, we obtain efficient algorithms for the problem of clustering with a faulty oracle with superior query complexities, notably achieving $o(n^2)$ sample complexity even in the presence of a large number of small clusters. Our gap-free clustering procedure also leads to improved algorithms for recursive clustering.
♻ ☆ ROOT-SGD: Sharp Nonasymptotics and Near-Optimal Asymptotics in a Single Algorithm
We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (\textsf{ROOT-SGD}), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \textsf{ROOT-SGD} with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of $O(n^{-3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) \textsf{ROOT-SGD} converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices.
♻ ☆ Knowledge Graphs in Practice: Characterizing their Users, Challenges, and Visualization Opportunities
This study presents insights from interviews with nineteen Knowledge Graph (KG) practitioners who work in both enterprise and academic settings on a wide variety of use cases. Through this study, we identify critical challenges experienced by KG practitioners when creating, exploring, and analyzing KGs that could be alleviated through visualization design. Our findings reveal three major personas among KG practitioners - KG Builders, Analysts, and Consumers - each of whom have their own distinct expertise and needs. We discover that KG Builders would benefit from schema enforcers, while KG Analysts need customizable query builders that provide interim query results. For KG Consumers, we identify a lack of efficacy for node-link diagrams, and the need for tailored domain-specific visualizations to promote KG adoption and comprehension. Lastly, we find that implementing KGs effectively in practice requires both technical and social solutions that are not addressed with current tools, technologies, and collaborative workflows. From the analysis of our interviews, we distill several visualization research directions to improve KG usability, including knowledge cards that balance digestibility and discoverability, timeline views to track temporal changes, interfaces that support organic discovery, and semantic explanations for AI and machine learning predictions.
♻ ☆ Plasma Surrogate Modelling using Fourier Neural Operators
Predicting plasma evolution within a Tokamak reactor is crucial to realizing the goal of sustainable fusion. Capabilities in forecasting the spatio-temporal evolution of plasma rapidly and accurately allow us to quickly iterate over design and control strategies on current Tokamak devices and future reactors. Modelling plasma evolution using numerical solvers is often expensive, consuming many hours on supercomputers, and hence, we need alternative inexpensive surrogate models. We demonstrate accurate predictions of plasma evolution both in simulation and experimental domains using deep learning-based surrogate modelling tools, viz., Fourier Neural Operators (FNO). We show that FNO has a speedup of six orders of magnitude over traditional solvers in predicting the plasma dynamics simulated from magnetohydrodynamic models, while maintaining a high accuracy (MSE in the normalised domain $\approx$ $10^{-5}$). Our modified version of the FNO is capable of solving multi-variable Partial Differential Equations (PDE), and can capture the dependence among the different variables in a single model. FNOs can also predict plasma evolution on real-world experimental data observed by the cameras positioned within the MAST Tokamak, i.e., cameras looking across the central solenoid and the divertor in the Tokamak. We show that FNOs are able to accurately forecast the evolution of plasma and have the potential to be deployed for real-time monitoring. We also illustrate their capability in forecasting the plasma shape, the locations of interactions of the plasma with the central solenoid and the divertor for the full (available) duration of the plasma shot within MAST. The FNO offers a viable alternative for surrogate modelling as it is quick to train and infer, and requires fewer data points, while being able to do zero-shot super-resolution and getting high-fidelity solutions.
♻ ☆ OPFData: Large-scale datasets for AC optimal power flow with topological perturbations
Solving the AC optimal power flow problem (AC-OPF) is critical to the efficient and safe planning and operation of power grids. Small efficiency improvements in this domain have the potential to lead to billions of dollars of cost savings, and significant reductions in emissions from fossil fuel generators. Recent work on data-driven solution methods for AC-OPF shows the potential for large speed improvements compared to traditional solvers; however, no large-scale open datasets for this problem exist. We present the largest readily-available collection of solved AC-OPF problems to date. This collection is orders of magnitude larger than existing readily-available datasets, allowing training of high-capacity data-driven models. Uniquely, it includes topological perturbations - a critical requirement for usage in realistic power grid operations. We hope this resource will spur the community to scale research to larger grid sizes with variable topology.
♻ ☆ Inference via Interpolation: Contrastive Representations Provably Enable Planning and Inference
Given time series data, how can we answer questions like "what will happen in the future?" and "how did we get here?" These sorts of probabilistic inference questions are challenging when observations are high-dimensional. In this paper, we show how these questions can have compact, closed form solutions in terms of learned representations. The key idea is to apply a variant of contrastive learning to time series data. Prior work already shows that the representations learned by contrastive learning encode a probability ratio. By extending prior work to show that the marginal distribution over representations is Gaussian, we can then prove that joint distribution of representations is also Gaussian. Taken together, these results show that representations learned via temporal contrastive learning follow a Gauss-Markov chain, a graphical model where inference (e.g., prediction, planning) over representations corresponds to inverting a low-dimensional matrix. In one special case, inferring intermediate representations will be equivalent to interpolating between the learned representations. We validate our theory using numerical simulations on tasks up to 46-dimensions.
comment: Code: https://github.com/vivekmyers/contrastive_planning
♻ ☆ DataComp-LM: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
comment: Project page: https://www.datacomp.ai/dclm/
♻ ☆ High-Performance Hybrid Algorithm for Minimum Sum-of-Squares Clustering of Infinitely Tall Data
This paper introduces a novel formulation of the clustering problem, namely the Minimum Sum-of-Squares Clustering of Infinitely Tall Data (MSSC-ITD), and presents HPClust, an innovative set of hybrid parallel approaches for its effective solution. By utilizing modern high-performance computing techniques, HPClust enhances key clustering metrics: effectiveness, computational efficiency, and scalability. In contrast to vanilla data parallelism, which only accelerates processing time through the MapReduce framework, our approach unlocks superior performance by leveraging the multi-strategy competitive-cooperative parallelism and intricate properties of the objective function landscape. Unlike other available algorithms that struggle to scale, our algorithm is inherently parallel in nature, improving solution quality through increased scalability and parallelism, and outperforming even advanced algorithms designed for small and medium-sized datasets. Our evaluation of HPClust, featuring four parallel strategies, demonstrates its superiority over traditional and cutting-edge methods by offering better performance in the key metrics. These results also show that parallel processing not only enhances the clustering efficiency, but the accuracy as well. Additionally, we explore the balance between computational efficiency and clustering quality, providing insights into optimal parallel strategies based on dataset specifics and resource availability. This research advances our understanding of parallelism in clustering algorithms, demonstrating that a judicious hybridization of advanced parallel approaches yields optimal results for MSSC-ITD. Experiments on synthetic data further confirm HPClust's exceptional scalability and robustness to noise.
comment: The MDPI "Mathematics" journal
♻ ☆ Zero-Shot Neural Architecture Search: Challenges, Solutions, and Opportunities
Recently, zero-shot (or training-free) Neural Architecture Search (NAS) approaches have been proposed to liberate NAS from the expensive training process. The key idea behind zero-shot NAS approaches is to design proxies that can predict the accuracy of some given networks without training the network parameters. The proxies proposed so far are usually inspired by recent progress in theoretical understanding of deep learning and have shown great potential on several datasets and NAS benchmarks. This paper aims to comprehensively review and compare the state-of-the-art (SOTA) zero-shot NAS approaches, with an emphasis on their hardware awareness. To this end, we first review the mainstream zero-shot proxies and discuss their theoretical underpinnings. We then compare these zero-shot proxies through large-scale experiments and demonstrate their effectiveness in both hardware-aware and hardware-oblivious NAS scenarios. Finally, we point out several promising ideas to design better proxies. Our source code and the list of related papers are available on https://github.com/SLDGroup/survey-zero-shot-nas.
comment: IEEE T-PAMI
♻ ☆ To smooth a cloud or to pin it down: Guarantees and Insights on Score Matching in Denoising Diffusion Models
Denoising diffusion models are a class of generative models which have recently achieved state-of-the-art results across many domains. Gradual noise is added to the data using a diffusion process, which transforms the data distribution into a Gaussian. Samples from the generative model are then obtained by simulating an approximation of the time reversal of this diffusion initialized by Gaussian samples. Recent research has explored adapting diffusion models for sampling and inference tasks. In this paper, we leverage known connections to stochastic control akin to the F\"ollmer drift to extend established neural network approximation results for the F\"ollmer drift to denoising diffusion models and samplers.
comment: arXiv admin note: text overlap with arXiv:1903.01608 by other authors
♻ ☆ Leveraging Generative Models for Covert Messaging: Challenges and Tradeoffs for "Dead-Drop" Deployments
State of the art generative models of human-produced content are the focus of many recent papers that explore their use for steganographic communication. In particular, generative models of natural language text. Loosely, these works (invertibly) encode message-carrying bits into a sequence of samples from the model, ultimately yielding a plausible natural language covertext. By focusing on this narrow steganographic piece, prior work has largely ignored the significant algorithmic challenges, and performance-security tradeoffs, that arise when one actually tries to build a messaging pipeline around it. We make these challenges concrete, by considering the natural application of such a pipeline: namely, "dead-drop" covert messaging over large, public internet platforms (e.g. social media sites). We explicate the challenges and describe approaches to overcome them, surfacing in the process important performance and security tradeoffs that must be carefully tuned. We implement a system around this model-based format-transforming encryption pipeline, and give an empirical analysis of its performance and (heuristic) security.
♻ ☆ Estimating class separability of text embeddings with persistent homology
This paper introduces an unsupervised method to estimate the class separability of text datasets from a topological point of view. Using persistent homology, we demonstrate how tracking the evolution of embedding manifolds during training can inform about class separability. More specifically, we show how this technique can be applied to detect when the training process stops improving the separability of the embeddings. Our results, validated across binary and multi-class text classification tasks, show that the proposed method's estimates of class separability align with those obtained from supervised methods. This approach offers a novel perspective on monitoring and improving the fine-tuning of sentence transformers for classification tasks, particularly in scenarios where labeled data is scarce. We also discuss how tracking these quantities can provide additional insights into the properties of the trained classifier.
comment: Updated version of the article; pre-print of the version published at Transactions of Machine Learning Research, https://openreview.net/forum?id=8DWrIMuLya
♻ ☆ On Differentially Private Subspace Estimation in a Distribution-Free Setting
Private data analysis faces a significant challenge known as the curse of dimensionality, leading to increased costs. However, many datasets possess an inherent low-dimensional structure. For instance, during optimization via gradient descent, the gradients frequently reside near a low-dimensional subspace. If the low-dimensional structure could be privately identified using a small amount of points, we could avoid paying for the high ambient dimension. On the negative side, Dwork, Talwar, Thakurta, and Zhang (STOC 2014) proved that privately estimating subspaces, in general, requires an amount of points that has a polynomial dependency on the dimension. However, their bound do not rule out the possibility to reduce the number of points for "easy'' instances. Yet, providing a measure that captures how much a given dataset is "easy'' for this task turns out to be challenging, and was not properly addressed in prior works. Inspired by the work of Singhal and Steinke (NeurIPS 2021), we provide the first measures that quantify easiness as a function of multiplicative singular-value gaps in the input dataset, and support them with new upper and lower bounds. In particular, our results determine the first type of gap that is sufficient and necessary for estimating a subspace with an amount of points that is independent of the dimension. Furthermore, we realize our upper bounds using a practical algorithm and demonstrate its advantage in high-dimensional regimes compared to prior approaches.
♻ ☆ Learning Useful Representations of Recurrent Neural Network Weight Matrices
Recurrent Neural Networks (RNNs) are general-purpose parallel-sequential computers. The program of an RNN is its weight matrix. How to learn useful representations of RNN weights that facilitate RNN analysis as well as downstream tasks? While the mechanistic approach directly looks at some RNN's weights to predict its behavior, the functionalist approach analyzes its overall functionality-specifically, its input-output mapping. We consider several mechanistic approaches for RNN weights and adapt the permutation equivariant Deep Weight Space layer for RNNs. Our two novel functionalist approaches extract information from RNN weights by 'interrogating' the RNN through probing inputs. We develop a theoretical framework that demonstrates conditions under which the functionalist approach can generate rich representations that help determine RNN behavior. We release the first two 'model zoo' datasets for RNN weight representation learning. One consists of generative models of a class of formal languages, and the other one of classifiers of sequentially processed MNIST digits.With the help of an emulation-based self-supervised learning technique we compare and evaluate the different RNN weight encoding techniques on multiple downstream applications. On the most challenging one, namely predicting which exact task the RNN was trained on, functionalist approaches show clear superiority.
♻ ☆ Hypergraph: A Unified and Uniform Definition with Application to Chemical Hypergraph
The conventional definition of hypergraph has two major issues: (1) there is not a standard definition of directed hypergraph and (2) there is not a formal definition of nested hypergraph. To resolve these issues, we propose a new definition of hypergraph that unifies the concepts of undirected, directed and nested hypergraphs, and that is uniform in using hyperedge as a single construct for representing high-order correlations among things, i.e., nodes and hyperedges. Specifically, we define a hyperedge to be a simple hyperedge, a nesting hyperedge, or a directed hyperedge. With this new definition, a hypergraph is nested if it has nesting hyperedge(s), and is directed if it has directed hyperedge(s). Otherwise, a hypergraph is a simple hypergraph. The uniformity and power of this new definition, with visualization, should facilitate the use of hypergraph for representing (hierarchical) high-order correlations in general and chemical systems in particular. Graph has been widely used as a mathematical structure for machine learning on molecular structures and 3D molecular geometries. However, graph has a major limitation: it can represent only pairwise correlations between nodes. Hypergraph extends graph with high-order correlations among nodes. This extension is significant or essential for machine learning on chemical systems. For molecules, this is significant as it allows the direct, explicit representation of multicenter bonds and molecular substructures. For chemical reactions, this is essential since most chemical reactions involve multiple participants. We propose the use of chemical hypergraph, a multilevel hypergraph with simple, nesting and directed hyperedges, as a single mathematical structure for representing chemical systems. We apply the new definition of hypergraph to chemical hypergraph and, as simplified versions, molecular hypergraph and chemical reaction hypergraph.
comment: arXiv admin note: text overlap with arXiv:2310.03623 by other authors
♻ ☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
♻ ☆ On Efficiently Representing Regular Languages as RNNs
Recent work by Hewitt et al. (2020) provides an interpretation of the empirical success of recurrent neural networks (RNNs) as language models (LMs). It shows that RNNs can efficiently represent bounded hierarchical structures that are prevalent in human language. This suggests that RNNs' success might be linked to their ability to model hierarchy. However, a closer inspection of Hewitt et al.'s (2020) construction shows that it is not inherently limited to hierarchical structures. This poses a natural question: What other classes of LMs can RNNs efficiently represent? To this end, we generalize Hewitt et al.'s (2020) construction and show that RNNs can efficiently represent a larger class of LMs than previously claimed -- specifically, those that can be represented by a pushdown automaton with a bounded stack and a specific stack update function. Altogether, the efficiency of representing this diverse class of LMs with RNN LMs suggests novel interpretations of their inductive bias.
♻ ☆ LoRA-drop: Efficient LoRA Parameter Pruning based on Output Evaluation
Low-Rank Adaptation (LoRA) is currently the most commonly used Parameter-efficient fine-tuning (PEFT) method, it introduces auxiliary parameters for each layer to fine-tune the pre-trained model under limited computing resources. However, it still faces resource consumption challenges during training when scaling up to larger models. Most previous studies have tackled this issue by using pruning techniques, which involve removing LoRA parameters deemed unimportant. Nonetheless, these efforts only analyze LoRA parameter features to evaluate their importance, such as parameter count, size, and gradient. In fact, the output of LoRA (product of LoRA parameter and hidden state), directly impacts the final results. Preliminary experiments indicate that a fraction of LoRA elements possesses significantly high output values, substantially influencing the layer output. Motivated by the observation, we propose LoRA-drop. Concretely, LoRA-drop evaluates the importance of LoRA based on the LoRA output. Then we retain LoRA for important layers and the other layers share the same LoRA. We conduct abundant experiments with models of different scales on NLU and NLG tasks. Results demonstrate that LoRA-drop can achieve performance comparable to full fine-tuning and LoRA, while retaining 50\% of the LoRA parameters on average.
comment: 15 pages, 12 figures
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ The Lie Derivative for Measuring Learned Equivariance ICLR 2023
Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.
comment: ICLR 2023. Code available at: https://github.com/ngruver/lie-deriv
♻ ☆ Memorization in Self-Supervised Learning Improves Downstream Generalization ICLR 2024
Self-supervised learning (SSL) has recently received significant attention due to its ability to train high-performance encoders purely on unlabeled data-often scraped from the internet. This data can still be sensitive and empirical evidence suggests that SSL encoders memorize private information of their training data and can disclose them at inference time. Since existing theoretical definitions of memorization from supervised learning rely on labels, they do not transfer to SSL. To address this gap, we propose SSLMem, a framework for defining memorization within SSL. Our definition compares the difference in alignment of representations for data points and their augmented views returned by both encoders that were trained on these data points and encoders that were not. Through comprehensive empirical analysis on diverse encoder architectures and datasets we highlight that even though SSL relies on large datasets and strong augmentations-both known in supervised learning as regularization techniques that reduce overfitting-still significant fractions of training data points experience high memorization. Through our empirical results, we show that this memorization is essential for encoders to achieve higher generalization performance on different downstream tasks.
comment: Accepted at ICLR 2024
♻ ☆ Large Language Models Are Zero-Shot Time Series Forecasters NeurIPS 2023
By encoding time series as a string of numerical digits, we can frame time series forecasting as next-token prediction in text. Developing this approach, we find that large language models (LLMs) such as GPT-3 and LLaMA-2 can surprisingly zero-shot extrapolate time series at a level comparable to or exceeding the performance of purpose-built time series models trained on the downstream tasks. To facilitate this performance, we propose procedures for effectively tokenizing time series data and converting discrete distributions over tokens into highly flexible densities over continuous values. We argue the success of LLMs for time series stems from their ability to naturally represent multimodal distributions, in conjunction with biases for simplicity, and repetition, which align with the salient features in many time series, such as repeated seasonal trends. We also show how LLMs can naturally handle missing data without imputation through non-numerical text, accommodate textual side information, and answer questions to help explain predictions. While we find that increasing model size generally improves performance on time series, we show GPT-4 can perform worse than GPT-3 because of how it tokenizes numbers, and poor uncertainty calibration, which is likely the result of alignment interventions such as RLHF.
comment: NeurIPS 2023. Code available at: https://github.com/ngruver/llmtime
♻ ☆ $S^3$ -- Semantic Signal Separation
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Topic modeling historically relied on bag-of-words representations of language. This approach makes models sensitive to the presence of stop words and noise, and does not utilize potentially useful contextual information. Recent efforts have been oriented at incorporating contextual neural representations in topic modeling and have been shown to outperform classical topic models. These approaches are, however, typically slow, volatile and still require preprocessing for optimal results. We present Semantic Signal Separation ($S^3$), a theory-driven topic modeling approach in neural embedding spaces. $S^3$ conceptualizes topics as independent axes of semantic space, and uncovers these with blind-source separation. Our approach provides the most diverse, highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextually sensitive topic model to date. We offer an implementation of $S^3$, among other approaches, in the Turftopic Python package.
comment: 26 pages, 9 figures (main manuscript has 9 pages and 4 figures)
♻ ☆ STG4Traffic: A Survey and Benchmark of Spatial-Temporal Graph Neural Networks for Traffic Prediction
Traffic prediction has been an active research topic in the domain of spatial-temporal data mining. Accurate real-time traffic prediction is essential to improve the safety, stability, and versatility of smart city systems, i.e., traffic control and optimal routing. The complex and highly dynamic spatial-temporal dependencies make effective predictions still face many challenges. Recent studies have shown that spatial-temporal graph neural networks exhibit great potential applied to traffic prediction, which combines sequential models with graph convolutional networks to jointly model temporal and spatial correlations. However, a survey study of graph learning, spatial-temporal graph models for traffic, as well as a fair comparison of baseline models are pending and unavoidable issues. In this paper, we first provide a systematic review of graph learning strategies and commonly used graph convolution algorithms. Then we conduct a comprehensive analysis of the strengths and weaknesses of recently proposed spatial-temporal graph network models. Furthermore, we build a study called STG4Traffic using the deep learning framework PyTorch to establish a standardized and scalable benchmark on two types of traffic datasets. We can evaluate their performance by personalizing the model settings with uniform metrics. Finally, we point out some problems in the current study and discuss future directions. Source codes are available at https://github.com/trainingl/STG4Traffic.
♻ ☆ Mixing Artificial and Natural Intelligence: From Statistical Mechanics to AI and Back to Turbulence
The paper reflects on the future role of AI in scientific research, with a special focus on turbulence studies, and examines the evolution of AI, particularly through Diffusion Models rooted in non-equilibrium statistical mechanics. It underscores the significant impact of AI on advancing reduced, Lagrangian models of turbulence through innovative use of deep neural networks. Additionally, the paper reviews various other AI applications in turbulence research and outlines potential challenges and opportunities in the concurrent advancement of AI and statistical hydrodynamics. This discussion sets the stage for a future where AI and turbulence research are intricately intertwined, leading to more profound insights and advancements in both fields.
comment: 38 pages, 9 figures
♻ ☆ A Single Graph Convolution Is All You Need: Efficient Grayscale Image Classification ICIP 2024
Image classifiers often rely on convolutional neural networks (CNN) for their tasks, which are inherently more heavyweight than multilayer perceptrons (MLPs), which can be problematic in real-time applications. Additionally, many image classification models work on both RGB and grayscale datasets. Classifiers that operate solely on grayscale images are much less common. Grayscale image classification has diverse applications, including but not limited to medical image classification and synthetic aperture radar (SAR) automatic target recognition (ATR). Thus, we present a novel grayscale (single channel) image classification approach using a vectorized view of images. We exploit the lightweightness of MLPs by viewing images as a vector and reducing our problem setting to the grayscale image classification setting. We find that using a single graph convolutional layer batch-wise increases accuracy and reduces variance in the performance of our model. Moreover, we develop a customized accelerator on FPGA for the proposed model with several optimizations to improve its performance. Our experimental results on benchmark grayscale image datasets demonstrate the effectiveness of the proposed model, achieving vastly lower latency (up to 16$\times$ less) and competitive or leading performance compared to other state-of-the-art image classification models on various domain-specific grayscale image classification datasets.
comment: Accepted to IEEE ICIP 2024
♻ ☆ Erase to Enhance: Data-Efficient Machine Unlearning in MRI Reconstruction
Machine unlearning is a promising paradigm for removing unwanted data samples from a trained model, towards ensuring compliance with privacy regulations and limiting harmful biases. Although unlearning has been shown in, e.g., classification and recommendation systems, its potential in medical image-to-image translation, specifically in image recon-struction, has not been thoroughly investigated. This paper shows that machine unlearning is possible in MRI tasks and has the potential to benefit for bias removal. We set up a protocol to study how much shared knowledge exists between datasets of different organs, allowing us to effectively quantify the effect of unlearning. Our study reveals that combining training data can lead to hallucinations and reduced image quality in the reconstructed data. We use unlearning to remove hallucinations as a proxy exemplar of undesired data removal. Indeed, we show that machine unlearning is possible without full retraining. Furthermore, our observations indicate that maintaining high performance is feasible even when using only a subset of retain data. We have made our code publicly accessible.
comment: The paper is accpeted by MIDL 2024
♻ ☆ The Future of Consumer Edge-AI Computing
In the last decade, Deep Learning has rapidly infiltrated the consumer end, mainly thanks to hardware acceleration across devices. However, as we look towards the future, it is evident that isolated hardware will be insufficient. Increasingly complex AI tasks demand shared resources, cross-device collaboration, and multiple data types, all without compromising user privacy or quality of experience. To address this, we introduce a novel paradigm centered around EdgeAI-Hub devices, designed to reorganise and optimise compute resources and data access at the consumer edge. To this end, we lay a holistic foundation for the transition from on-device to Edge-AI serving systems in consumer environments, detailing their components, structure, challenges and opportunities.
comment: Extended version of accepted paper at IEEE Pervasive Computing
♻ ☆ Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data
Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL). While theoretical studies, e.g., \cite{zhang2023trained}, attempt to explain the mechanism of ICL, they assume the input $x_i$ and the output $y_i$ of each demonstration example are in the same token (i.e., structured data). However, in real practice, the examples are usually text input, and all words, regardless of their logic relationship, are stored in different tokens (i.e., unstructured data \cite{wibisono2023role}). To understand how LLMs learn from the unstructured data in ICL, this paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture. In particular, we consider a simple transformer with one/two attention layers and linear regression tasks for the ICL prediction. We observe that (1) a transformer with two layers of (self-)attentions with a look-ahead attention mask can learn from the prompt in the unstructured data, and (2) positional encoding can match the $x_i$ and $y_i$ tokens to achieve a better ICL performance.
♻ ☆ Benchmarking Label Noise in Instance Segmentation: Spatial Noise Matters
Obtaining accurate labels for instance segmentation is particularly challenging due to the complex nature of the task. Each image necessitates multiple annotations, encompassing not only the object's class but also its precise spatial boundaries. These requirements elevate the likelihood of errors and inconsistencies in both manual and automated annotation processes. By simulating different noise conditions, we provide a realistic scenario for assessing the robustness and generalization capabilities of instance segmentation models in different segmentation tasks, introducing COCO-N and Cityscapes-N. We also propose a benchmark for weakly annotation noise, dubbed COCO-WAN, which utilizes foundation models and weak annotations to simulate semi-automated annotation tools and their noisy labels. This study sheds light on the quality of segmentation masks produced by various models and challenges the efficacy of popular methods designed to address learning with label noise.
♻ ☆ Scalable and Flexible Causal Discovery with an Efficient Test for Adjacency ICML 2024
To make accurate predictions, understand mechanisms, and design interventions in systems of many variables, we wish to learn causal graphs from large scale data. Unfortunately the space of all possible causal graphs is enormous so scalably and accurately searching for the best fit to the data is a challenge. In principle we could substantially decrease the search space, or learn the graph entirely, by testing the conditional independence of variables. However, deciding if two variables are adjacent in a causal graph may require an exponential number of tests. Here we build a scalable and flexible method to evaluate if two variables are adjacent in a causal graph, the Differentiable Adjacency Test (DAT). DAT replaces an exponential number of tests with a provably equivalent relaxed problem. It then solves this problem by training two neural networks. We build a graph learning method based on DAT, DAT-Graph, that can also learn from data with interventions. DAT-Graph can learn graphs of 1000 variables with state of the art accuracy. Using the graph learned by DAT-Graph, we also build models that make much more accurate predictions of the effects of interventions on large scale RNA sequencing data.
comment: ICML 2024; Code at https://github.com/AlanNawzadAmin/DAT-graph
♻ ☆ Improving global awareness of linkset predictions using Cross-Attentive Modulation tokens
Most of multiple link prediction or graph generation techniques rely on the attention mechanism or on Graph Neural Networks (GNNs), which consist in leveraging node-level information exchanges in order to form proper link predictions. Such node-level interactions do not process nodes as an ordered sequence, which would imply some kind of natural ordering of the nodes: they are said to be permutation invariant mechanisms. They are well suited for graph problems, but struggle at providing a global orchestration of the predicted links, which can result in a loss of performance. Some typical issues can be the difficulty to ensure high-level properties such as global connectedness, fixed diameter or to avoid information bottleneck effects such as oversmoothing and oversquashing, which respectively consist in abundant smoothing in dense areas leading to a loss of information and a tendency to exclude isolated nodes from the message passing scheme, and often result in irrelevant, unbalanced link predictions. To tackle this problem, we hereby present Cross-Attentive Modulation (CAM) tokens, which introduce cross-attentive units used to condition node and edge-level modulations in order to enable context-aware computations that improve the global consistency of the prediction links. We will implement it on a few permutation invariant architectures, and showcase benchmarks that prove the merits of our work.
comment: 17 pages, 2 figures, not published nor submitted yet
♻ ☆ A Rate-Distortion View of Uncertainty Quantification
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck.
♻ ☆ Navigating Complexity: Toward Lossless Graph Condensation via Expanding Window Matching
Graph condensation aims to reduce the size of a large-scale graph dataset by synthesizing a compact counterpart without sacrificing the performance of Graph Neural Networks (GNNs) trained on it, which has shed light on reducing the computational cost for training GNNs. Nevertheless, existing methods often fall short of accurately replicating the original graph for certain datasets, thereby failing to achieve the objective of lossless condensation. To understand this phenomenon, we investigate the potential reasons and reveal that the previous state-of-the-art trajectory matching method provides biased and restricted supervision signals from the original graph when optimizing the condensed one. This significantly limits both the scale and efficacy of the condensed graph. In this paper, we make the first attempt toward \textit{lossless graph condensation} by bridging the previously neglected supervision signals. Specifically, we employ a curriculum learning strategy to train expert trajectories with more diverse supervision signals from the original graph, and then effectively transfer the information into the condensed graph with expanding window matching. Moreover, we design a loss function to further extract knowledge from the expert trajectories. Theoretical analysis justifies the design of our method and extensive experiments verify its superiority across different datasets. Code is released at https://github.com/NUS-HPC-AI-Lab/GEOM.
comment: Lossless graph condensation method
♻ ☆ Provable Guarantees for Model Performance via Mechanistic Interpretability
In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
♻ ☆ A generalizable framework for low-rank tensor completion with numerical priors
Low-Rank Tensor Completion, a method which exploits the inherent structure of tensors, has been studied extensively as an effective approach to tensor completion. Whilst such methods attained great success, none have systematically considered exploiting the numerical priors of tensor elements. Ignoring numerical priors causes loss of important information regarding the data, and therefore prevents the algorithms from reaching optimal accuracy. Despite the existence of some individual works which consider ad hoc numerical priors for specific tasks, no generalizable frameworks for incorporating numerical priors have appeared. We present the Generalized CP Decomposition Tensor Completion (GCDTC) framework, the first generalizable framework for low-rank tensor completion that takes numerical priors of the data into account. We test GCDTC by further proposing the Smooth Poisson Tensor Completion (SPTC) algorithm, an instantiation of the GCDTC framework, whose performance exceeds current state-of-the-arts by considerable margins in the task of non-negative tensor completion, exemplifying GCDTC's effectiveness. Our code is open-source.
comment: Accepted to Pattern Recognition
♻ ☆ Local Recovery of Two-layer Neural Networks at Overparameterization
Under mild assumptions, we investigate the structure of loss landscape of two-layer neural networks near global minima, determine the set of parameters which recovers the target function, and characterize the gradient flows around it. With novel techniques, our work uncovers some simple aspects of the complicated loss landscape and reveals how model, target function, samples and initialization affect the training dynamics differently. These results concludes that two-layer neural networks can be recovered locally at overparameterization.
♻ ☆ Enhanced Gradient Boosting for Zero-Inflated Insurance Claims and Comparative Analysis of CatBoost, XGBoost, and LightGBM
The property and casualty (P&C) insurance industry faces challenges in developing claim predictive models due to the highly right-skewed distribution of positive claims with excess zeros. To address this, actuarial science researchers have employed "zero-inflated" models that combine a traditional count model and a binary model. This paper investigates the use of boosting algorithms to process insurance claim data, including zero-inflated telematics data, to construct claim frequency models. Three popular gradient boosting libraries - XGBoost, LightGBM, and CatBoost - are evaluated and compared to determine the most suitable library for training insurance claim data and fitting actuarial frequency models. Through a comprehensive analysis of two distinct datasets, it is determined that CatBoost is the best for developing auto claim frequency models based on predictive performance. Furthermore, we propose a new zero-inflated Poisson boosted tree model, with variation in the assumption about the relationship between inflation probability $p$ and distribution mean $\mu$, and find that it outperforms others depending on data characteristics. This model enables us to take advantage of particular CatBoost tools, which makes it easier and more convenient to investigate the effects and interactions of various risk features on the frequency model when using telematics data.
comment: 26pages, 6tables, 7figures
♻ ☆ Equivariant Frames and the Impossibility of Continuous Canonicalization
Canonicalization provides an architecture-agnostic method for enforcing equivariance, with generalizations such as frame-averaging recently gaining prominence as a lightweight and flexible alternative to equivariant architectures. Recent works have found an empirical benefit to using probabilistic frames instead, which learn weighted distributions over group elements. In this work, we provide strong theoretical justification for this phenomenon: for commonly-used groups, there is no efficiently computable choice of frame that preserves continuity of the function being averaged. In other words, unweighted frame-averaging can turn a smooth, non-symmetric function into a discontinuous, symmetric function. To address this fundamental robustness problem, we formally define and construct \emph{weighted} frames, which provably preserve continuity, and demonstrate their utility by constructing efficient and continuous weighted frames for the actions of $SO(2)$, $SO(3)$, and $S_n$ on point clouds.
♻ ☆ ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset
Automated medical image analysis systems often require large amounts of training data with high quality labels, which are difficult and time consuming to generate. This paper introduces Radiology Object in COntext version 2 (ROCOv2), a multimodal dataset consisting of radiological images and associated medical concepts and captions extracted from the PMC Open Access subset. It is an updated version of the ROCO dataset published in 2018, and adds 35,705 new images added to PMC since 2018. It further provides manually curated concepts for imaging modalities with additional anatomical and directional concepts for X-rays. The dataset consists of 79,789 images and has been used, with minor modifications, in the concept detection and caption prediction tasks of ImageCLEFmedical Caption 2023. The dataset is suitable for training image annotation models based on image-caption pairs, or for multi-label image classification using Unified Medical Language System (UMLS) concepts provided with each image. In addition, it can serve for pre-training of medical domain models, and evaluation of deep learning models for multi-task learning.
comment: Accepted for Scientific Data
♻ ☆ Variational optimization of the amplitude of neural-network quantum many-body ground states
Neural-network quantum states (NQSs), variationally optimized by combining traditional methods and deep learning techniques, is a new way to find quantum many-body ground states and gradually becomes a competitor of traditional variational methods. However, there are still some difficulties in the optimization of NQSs, such as local minima, slow convergence, and sign structure optimization. Here, we split a quantum many-body variational wave function into a multiplication of a real-valued amplitude neural network and a sign structure, and focus on the optimization of the amplitude network while keeping the sign structure fixed. The amplitude network is a convolutional neural network (CNN) with residual blocks, namely a ResNet. Our method is tested on three typical quantum many-body systems. The obtained ground state energies are lower than or comparable to those from traditional variational Monte Carlo (VMC) methods and density matrix renormalization group (DMRG). Surprisingly, for the frustrated Heisenberg $J_1$-$J_2$ model, our results are better than those of the complex-valued CNN in the literature, implying that the sign structure of the complex-valued NQS is difficult to be optimized. We will study the optimization of the sign structure of NQSs in the future.
comment: 9 pages, 3 figures, 4 tables, published version
♻ ☆ Segmentation and Characterization of Macerated Fibers and Vessels Using Deep Learning
Wood comprises different cell types, such as fibers, tracheids and vessels, defining its properties. Studying cells' shape, size, and arrangement in microscopy images is crucial for understanding wood characteristics. Typically, this involves macerating (soaking) samples in a solution to separate cells, then spreading them on slides for imaging with a microscope that covers a wide area, capturing thousands of cells. However, these cells often cluster and overlap in images, making the segmentation difficult and time-consuming using standard image-processing methods. In this work, we developed an automatic deep learning segmentation approach that utilizes the one-stage YOLOv8 model for fast and accurate segmentation and characterization of macerated fiber and vessel form aspen trees in microscopy images. The model can analyze 32,640 x 25,920 pixels images and demonstrate effective cell detection and segmentation, achieving a mAP_{0.5-0.95} of 78 %. To assess the model's robustness, we examined fibers from a genetically modified tree line known for longer fibers. The outcomes were comparable to previous manual measurements. Additionally, we created a user-friendly web application for image analysis and provided the code for use on Google Colab. By leveraging YOLOv8's advances, this work provides a deep learning solution to enable efficient quantification and analysis of wood cells suitable for practical applications.
comment: 7 figures
♻ ☆ Retrieve to Explain: Evidence-driven Predictions with Language Models ICML
Language models hold incredible promise for enabling scientific discovery by synthesizing massive research corpora. Many complex scientific research questions have multiple plausible answers, each supported by evidence of varying strength. However, existing language models lack the capability to quantitatively and faithfully compare answer plausibility in terms of supporting evidence. To address this issue, we introduce Retrieve to Explain (R2E), a retrieval-based language model. R2E scores and ranks all possible answers to a research question based on evidence retrieved from a document corpus. The architecture represents each answer only in terms of its supporting evidence, with the answer itself masked. This allows us to extend feature attribution methods, such as Shapley values, to transparently attribute each answer's score back to its supporting evidence at inference time. The architecture also allows R2E to incorporate new evidence without retraining, including non-textual data modalities templated into natural language. We assess on the challenging task of drug target identification from scientific literature, a human-in-the-loop process where failures are extremely costly and explainability is paramount. When predicting whether drug targets will subsequently be confirmed as efficacious in clinical trials, R2E not only matches non-explainable literature-based models but also surpasses a genetics-based target identification approach used throughout the pharmaceutical industry.
comment: ICML AI for Science 2024
♻ ☆ Connected Speech-Based Cognitive Assessment in Chinese and English
We present a novel benchmark dataset and prediction tasks for investigating approaches to assess cognitive function through analysis of connected speech. The dataset consists of speech samples and clinical information for speakers of Mandarin Chinese and English with different levels of cognitive impairment as well as individuals with normal cognition. These data have been carefully matched by age and sex by propensity score analysis to ensure balance and representativity in model training. The prediction tasks encompass mild cognitive impairment diagnosis and cognitive test score prediction. This framework was designed to encourage the development of approaches to speech-based cognitive assessment which generalise across languages. We illustrate it by presenting baseline prediction models that employ language-agnostic and comparable features for diagnosis and cognitive test score prediction. The models achieved unweighted average recall was 59.2% in diagnosis, and root mean squared error of 2.89 in score prediction.
comment: To appear in Proceedings of Interspeech 2024
♻ ☆ Instruction Fine-Tuning: Does Prompt Loss Matter?
We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.
comment: 8 pages of content. 13 pages of appendices. 45 figures
♻ ☆ Fixed Design Analysis of Regularization-Based Continual Learning
We consider a continual learning (CL) problem with two linear regression tasks in the fixed design setting, where the feature vectors are assumed fixed and the labels are assumed to be random variables. We consider an $\ell_2$-regularized CL algorithm, which computes an Ordinary Least Squares parameter to fit the first dataset, then computes another parameter that fits the second dataset under an $\ell_2$-regularization penalizing its deviation from the first parameter, and outputs the second parameter. For this algorithm, we provide tight bounds on the average risk over the two tasks. Our risk bounds reveal a provable trade-off between forgetting and intransigence of the $\ell_2$-regularized CL algorithm: with a large regularization parameter, the algorithm output forgets less information about the first task but is intransigent to extract new information from the second task; and vice versa. Our results suggest that catastrophic forgetting could happen for CL with dissimilar tasks (under a precise similarity measurement) and that a well-tuned $\ell_2$-regularization can partially mitigate this issue by introducing intransigence.
comment: CoLLAs 2023 camera-ready version
♻ ☆ Gaussian Process on the Product of Directional Manifolds
We present a principled study on defining Gaussian processes (GPs) with inputs on the product of directional manifolds. A circular kernel is first presented according to the von Mises distribution. Based thereon, the hypertoroidal von Mises (HvM) kernel is proposed to establish GPs on hypertori with consideration of correlated circular components. The proposed HvM kernel is demonstrated with multi-output GP regression for learning vector-valued functions on hypertori using the intrinsic coregionalization model. Analytic derivatives for hyperparameter optimization are provided for runtime-critical applications. For evaluation, we synthesize a ranging-based sensor network and employ the HvM-based GPs for data-driven recursive localization. Numerical results show that the HvM-based GP achieves superior tracking accuracy compared to parametric model and GPs of conventional kernel designs.
comment: 6 pages
♻ ☆ Fuzzy Convolution Neural Networks for Tabular Data Classification
Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.
comment: 10 pages, 16 figures, Submitted to IEEE Access
♻ ☆ MLEM: Generative and Contrastive Learning as Distinct Modalities for Event Sequences
This study explores the application of self-supervised learning techniques for event sequences. It is a key modality in various applications such as banking, e-commerce, and healthcare. However, there is limited research on self-supervised learning for event sequences, and methods from other domains like images, texts, and speech may not easily transfer. To determine the most suitable approach, we conduct a detailed comparative analysis of previously identified best-performing methods. We find that neither the contrastive nor generative method is superior. Our assessment includes classifying event sequences, predicting the next event, and evaluating embedding quality. These results further highlight the potential benefits of combining both methods. Given the lack of research on hybrid models in this domain, we initially adapt the baseline model from another domain. However, upon observing its underperformance, we develop a novel method called the Multimodal-Learning Event Model (MLEM). MLEM treats contrastive learning and generative modeling as distinct yet complementary modalities, aligning their embeddings. The results of our study demonstrate that combining contrastive and generative approaches into one procedure with MLEM achieves superior performance across multiple metrics.
comment: 11 pages, 9 figures
♻ ☆ STEMO: Early Spatio-temporal Forecasting with Multi-Objective Reinforcement Learning KDD 2024
Accuracy and timeliness are indeed often conflicting goals in prediction tasks. Premature predictions may yield a higher rate of false alarms, whereas delaying predictions to gather more information can render them too late to be useful. In applications such as wildfires, crimes, and traffic jams, timely forecasting are vital for safeguarding human life and property. Consequently, finding a balance between accuracy and timeliness is crucial. In this paper, we propose an early spatio-temporal forecasting model based on Multi-Objective reinforcement learning that can either implement an optimal policy given a preference or infer the preference based on a small number of samples. The model addresses two primary challenges: 1) enhancing the accuracy of early forecasting and 2) providing the optimal policy for determining the most suitable prediction time for each area. Our method demonstrates superior performance on three large-scale real-world datasets, surpassing existing methods in early spatio-temporal forecasting tasks.
comment: Accepted paper in KDD 2024
♻ ☆ Preventing Model Collapse in Gaussian Process Latent Variable Models ICML
Gaussian process latent variable models (GPLVMs) are a versatile family of unsupervised learning models commonly used for dimensionality reduction. However, common challenges in modeling data with GPLVMs include inadequate kernel flexibility and improper selection of the projection noise, leading to a type of model collapse characterized by vague latent representations that do not reflect the underlying data structure. This paper addresses these issues by, first, theoretically examining the impact of projection variance on model collapse through the lens of a linear GPLVM. Second, we tackle model collapse due to inadequate kernel flexibility by integrating the spectral mixture (SM) kernel and a differentiable random Fourier feature (RFF) kernel approximation, which ensures computational scalability and efficiency through off-the-shelf automatic differentiation tools for learning the kernel hyperparameters, projection variance, and latent representations within the variational inference framework. The proposed GPLVM, named advisedRFLVM, is evaluated across diverse datasets and consistently outperforms various salient competing models, including state-of-the-art variational autoencoders (VAEs) and other GPLVM variants, in terms of informative latent representations and missing data imputation.
comment: International Conference on Machine Learning (ICML), 2024
♻ ☆ Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation
Proxy causal learning (PCL) is a method for estimating the causal effect of treatments on outcomes in the presence of unobserved confounding, using proxies (structured side information) for the confounder. This is achieved via two-stage regression: in the first stage, we model relations among the treatment and proxies; in the second stage, we use this model to learn the effect of treatment on the outcome, given the context provided by the proxies. PCL guarantees recovery of the true causal effect, subject to identifiability conditions. We propose a novel method for PCL, the deep feature proxy variable method (DFPV), to address the case where the proxies, treatments, and outcomes are high-dimensional and have nonlinear complex relationships, as represented by deep neural network features. We show that DFPV outperforms recent state-of-the-art PCL methods on challenging synthetic benchmarks, including settings involving high dimensional image data. Furthermore, we show that PCL can be applied to off-policy evaluation for the confounded bandit problem, in which DFPV also exhibits competitive performance.
comment: arXiv admin note: text overlap with arXiv:2010.07154
♻ ☆ Data Poisoning to Fake a Nash Equilibrium in Markov Games
We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside the set. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.
♻ ☆ SNN4Agents: A Framework for Developing Energy-Efficient Embodied Spiking Neural Networks for Autonomous Agents
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset. In this manner, our SNN4Agents framework paves the way toward enabling energy-efficient embodied SNN deployments for autonomous agents.
comment: Accepted for publication at Frontiers in Robotics and AI (FROBT) - Section Robot Vision and Artificial Perception
♻ ☆ Embeddings between Barron spaces with higher order activation functions
The approximation properties of infinitely wide shallow neural networks heavily depend on the choice of the activation function. To understand this influence, we study embeddings between Barron spaces with different activation functions. These embeddings are proven by providing push-forward maps on the measures $\mu$ used to represent functions $f$. An activation function of particular interest is the rectified power unit ($\operatorname{RePU}$) given by $\operatorname{RePU}_s(x)=\max(0,x)^s$. For many commonly used activation functions, the well-known Taylor remainder theorem can be used to construct a push-forward map, which allows us to prove the embedding of the associated Barron space into a Barron space with a $\operatorname{RePU}$ as activation function. Moreover, the Barron spaces associated with the $\operatorname{RePU}_s$ have a hierarchical structure similar to the Sobolev spaces $H^m$.
comment: 21 pages, 1 figure; revision adds extension to fractional RePU and fractional Taylor expansion
♻ ☆ Vanishing Variance Problem in Fully Decentralized Neural-Network Systems
Federated learning and gossip learning are emerging methodologies designed to mitigate data privacy concerns by retaining training data on client devices and exclusively sharing locally-trained machine learning (ML) models with others. The primary distinction between the two lies in their approach to model aggregation: federated learning employs a centralized parameter server, whereas gossip learning adopts a fully decentralized mechanism, enabling direct model exchanges among nodes. This decentralized nature often positions gossip learning as less efficient compared to federated learning. Both methodologies involve a critical step: computing a representation of received ML models and integrating this representation into the existing model. Conventionally, this representation is derived by averaging the received models, exemplified by the FedAVG algorithm. Our findings suggest that this averaging approach inherently introduces a potential delay in model convergence. We identify the underlying cause and refer to it as the "vanishing variance" problem, where averaging across uncorrelated ML models undermines the optimal variance established by the Xavier weight initialization. Unlike federated learning where the central server ensures model correlation, and unlike traditional gossip learning which circumvents this problem through model partitioning and sampling, our research introduces a variance-corrected model averaging algorithm. This novel algorithm preserves the optimal variance needed during model averaging, irrespective of network topology or non-IID data distributions. Our extensive simulation results demonstrate that our approach enables gossip learning to achieve convergence efficiency comparable to that of federated learning.
comment: 7 pages
♻ ☆ CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
♻ ☆ Foundation Models for Time Series Analysis: A Tutorial and Survey KDD
Time series analysis stands as a focal point within the data mining community, serving as a cornerstone for extracting valuable insights crucial to a myriad of real-world applications. Recent advances in Foundation Models (FMs) have fundamentally reshaped the paradigm of model design for time series analysis, boosting various downstream tasks in practice. These innovative approaches often leverage pre-trained or fine-tuned FMs to harness generalized knowledge tailored for time series analysis. This survey aims to furnish a comprehensive and up-to-date overview of FMs for time series analysis. While prior surveys have predominantly focused on either application or pipeline aspects of FMs in time series analysis, they have often lacked an in-depth understanding of the underlying mechanisms that elucidate why and how FMs benefit time series analysis. To address this gap, our survey adopts a methodology-centric classification, delineating various pivotal elements of time-series FMs, including model architectures, pre-training techniques, adaptation methods, and data modalities. Overall, this survey serves to consolidate the latest advancements in FMs pertinent to time series analysis, accentuating their theoretical underpinnings, recent strides in development, and avenues for future exploration.
comment: In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD'24)
♻ ☆ A Framework of SO(3)-equivariant Non-linear Representation Learning and its Application to Electronic-Structure Hamiltonian Prediction
We present both a theoretical and a methodological framework that addresses a critical challenge in applying deep learning to physical systems: the reconciliation of non-linear expressiveness with SO(3)-equivariance in predictions of SO(3)-equivariant quantities. Inspired by covariant theory in physics, we address this problem by exploring the mathematical relationships between SO(3)-invariant and SO(3)-equivariant quantities and their representations. We first construct theoretical SO(3)-invariant quantities derived from the SO(3)-equivariant regression targets, and use these invariant quantities as supervisory labels to guide the learning of high-quality SO(3)-invariant features. Given that SO(3)-invariance is preserved under non-linear operations, the encoding process for invariant features can extensively utilize non-linear mappings, thereby fully capturing the non-linear patterns inherent in physical systems. Building on this foundation, we propose a gradient-based mechanism to induce SO(3)-equivariant encodings of various degrees from the learned SO(3)-invariant features. This mechanism can incorporate non-linear expressive capabilities into SO(3)-equivariant representations, while theoretically preserving their equivariant properties as we prove. We apply our theory and method to the electronic-structure Hamiltonian prediction tasks, experimental results on eight benchmark databases covering multiple types of elements and challenging scenarios show dramatic breakthroughs on the state-of-the-art prediction accuracy, with improvements of up to 40% in predicting Hamiltonians and up to 76% in predicting downstream physical quantities such as occupied orbital energy. Our approach goes beyond handling physical systems and offers a promising general solution to the critical dilemma between equivariance and non-linear expressiveness for the deep learning paradigm.
♻ ☆ Spatial-Temporal Large Language Model for Traffic Prediction MDM 2024
Traffic prediction, an essential component for intelligent transportation systems, endeavours to use historical data to foresee future traffic features at specific locations. Although existing traffic prediction models often emphasize developing complex neural network structures, their accuracy has not improved. Recently, large language models have shown outstanding capabilities in time series analysis. Differing from existing models, LLMs progress mainly through parameter expansion and extensive pretraining while maintaining their fundamental structures. Motivated by these developments, we propose a Spatial-Temporal Large Language Model (ST-LLM) for traffic prediction. In the ST-LLM, we define timesteps at each location as tokens and design a spatial-temporal embedding to learn the spatial location and global temporal patterns of these tokens. Additionally, we integrate these embeddings by a fusion convolution to each token for a unified spatial-temporal representation. Furthermore, we innovate a partially frozen attention strategy to adapt the LLM to capture global spatial-temporal dependencies for traffic prediction. Comprehensive experiments on real traffic datasets offer evidence that ST-LLM is a powerful spatial-temporal learner that outperforms state-of-the-art models. Notably, the ST-LLM also exhibits robust performance in both few-shot and zero-shot prediction scenarios. The code is publicly available at https://github.com/ChenxiLiu-HNU/ST-LLM.
comment: Accepted by MDM 2024 (Research Track)
♻ ☆ CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning
Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, advances this understanding by assessing the contribution of each datum to model accuracy. However, the resource-intensive and time-consuming nature of multiple model retraining poses challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (Conduct of Hardness and Gradient) score, which approximates the utility of each data subset on model accuracy during a single model training. By deriving the closed-form expression of the Shapley value for each data point under the CHG score utility function, we reduce the computational complexity to the equivalent of a single model retraining, an exponential improvement over existing methods. Additionally, we employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data. CHG Shapley facilitates trustworthy model training through efficient data valuation, introducing a novel data-centric perspective on trustworthy machine learning.
Systems and Control 21
☆ Zeroing neural dynamics solving time-variant complex conjugate matrix equation
Complex conjugate matrix equations (CCME) have aroused the interest of many researchers because of computations and antilinear systems. Existing research is dominated by its time-invariant solving methods, but lacks proposed theories for solving its time-variant version. Moreover, artificial neural networks are rarely studied for solving CCME. In this paper, starting with the earliest CCME, zeroing neural dynamics (ZND) is applied to solve its time-variant version. Firstly, the vectorization and Kronecker product in the complex field are defined uniformly. Secondly, Con-CZND1 model and Con-CZND2 model are proposed and theoretically prove convergence and effectiveness. Thirdly, three numerical experiments are designed to illustrate the effectiveness of the two models, compare their differences, highlight the significance of neural dynamics in the complex field, and refine the theory related to ZND.
☆ Ensuring Both Positivity and Stability Using Sector-Bounded Nonlinearity for Systems with Neural Network Controllers
This paper introduces a novel method for the stability analysis of positive feedback systems with a class of fully connected feedforward neural networks (FFNN) controllers. By establishing sector bounds for fully connected FFNNs without biases, we present a stability theorem that demonstrates the global exponential stability of linear systems under fully connected FFNN control. Utilizing principles from positive Lur'e systems and the positive Aizerman conjecture, our approach effectively addresses the challenge of ensuring stability in highly nonlinear systems. The crux of our method lies in maintaining sector bounds that preserve the positivity and Hurwitz property of the overall Lur'e system. We showcase the practical applicability of our methodology through its implementation in a linear system managed by a FFNN trained on output feedback controller data, highlighting its potential for enhancing stability in dynamic systems.
comment: 6 pages, 7 figures, to be published in IEEE Control Systems Letters (L-CSS)
☆ Mitigating Information Asymmetry in Two-Stage Contracts with Non-Myopic Agents
We consider a Stackelberg game in which a principal (she) establishes a two-stage contract with a non-myopic agent (he) whose type is unknown. The contract takes the form of an incentive function mapping the agent's first-stage action to his second-stage incentive. While the first-stage action reveals the agent's type under truthful play, a non-myopic agent could benefit from portraying a false type in the first stage to obtain a larger incentive in the second stage. The challenge is thus for the principal to design the incentive function so as to induce truthful play. We show that this is only possible with a constant, non-reactive incentive functions when the type space is continuous, whereas it can be achieved with reactive functions for discrete types. Additionally, we show that introducing an adjustment mechanism that penalizes inconsistent behavior across both stages allows the principal to design more flexible incentive functions.
☆ Large Language Models based Multi-Agent Framework for Objective Oriented Control Design in Power Electronics
Power electronics, a critical component in modern power systems, face several challenges in control design, including model uncertainties, and lengthy and costly design cycles. This paper is aiming to propose a Large Language Models (LLMs) based multi-agent framework for objective-oriented control design in power electronics. The framework leverages the reasoning capabilities of LLMs and a multi-agent workflow to develop an efficient and autonomous controller design process. The LLM agent is able to understand and respond to high-level instructions in natural language, adapting its behavior based on the task's specific requirements and constraints from a practical implementation point of view. This novel and efficient approach promises a more flexible and adaptable controller design process in power electronics that will largely facilitate the practitioners.
comment: 6 pages, 6 figures
☆ European Satellite Benchmark for Control Education and Industrial Training
To overcome the innovation gap of the Guidance, Navigation and Control (GNC) design process between research and industrial practice a benchmark of industrial relevance has been developed and is presented. This initiative is driven as well by the necessity to train future GNC engineers and the GNC space community on a set of identified complex problems. It allows to demonstrate the relevance of state-of-the-art modeling, control and analysis algorithms for future industrial adoption. The modeling philosophy for robust control synthesis, analysis including the control architecture that enables the simulation of the mission, i.e. the acquisition of a high pointing space mission, are provided.
☆ Computationally Efficient System Level Tube-MPC for Uncertain Systems
Tube-based model predictive control (MPC) is the principal robust control technique for constrained linear systems affected by additive disturbances. While tube-based methods that compute the tubes online have been successfully applied to systems with additive disturbances, their application to systems affected by additional model uncertainties is challenging. This paper introduces a new tube-based MPC method - named filter-based system level tube-MPC (SLTMPC) - which overapproximates both uncertainties with an online optimized disturbance set, while simultaneously computing the tube controller online. Extending prior work, we generalize the method to polytopic disturbance sets and for the first time provide rigorous closed-loop guarantees for the receding horizon controller. These guarantees are obtained by virtue of a new terminal controller design and an online optimized terminal set. To reduce the computational complexity of the proposed method, we additionally introduce an asynchronous computation scheme that separates the optimization of the tube controller and the nominal trajectory. Finally, we provide a comprehensive numerical evaluation of the proposed methods to demonstrate their effectiveness.
☆ Discrete-Time Conewise Linear Systems with Finitely Many Switches
We investigate discrete-time conewise linear systems (CLS) for which all the solutions exhibit a finite number of switches. By switches, we mean transitions of a solution from one cone to another. Our interest in this class of CLS comes from the optimization-based control of an insulin infusion model for which the fact that solutions switch finitely many times appears to be key to establish the global exponential stability of the origin. The stability analysis of this class of CLS greatly simplifies compared to general CLS as all solutions eventually exhibit linear dynamics. The main challenge is to characterize CLS satisfying this finite number of switches property. We first present general conditions in terms of set intersections for this purpose. To ease the testing of these conditions, we translate them as a non-negativity test of linear forms using Farkas lemma. As a result, the problem reduces to verify the non-negativity of a single solution to an auxiliary linear discrete-time system. Interestingly, this property differs from the classical non-negativity problem, where any solution to a system must remain non-negative (component-wise) for any non-negative initial condition, and thus requires novel tools to test it. We finally illustrate the relevance of the presented results on the optimal insulin infusion problem.
☆ Frequency-Domain Data-Driven Predictive Control
In this paper, we propose a data-driven predictive control scheme based on measured frequency-domain data of the plant. This novel scheme complements the well-known data-driven predictive control (DeePC) approach based on time series data. To develop this new frequency-domain data-driven predictive control (FreePC) scheme, we introduce a novel version of Willems' fundamental lemma based on frequency-domain data. By exploiting frequency-domain data, we allow recent direct data-driven (predictive) control methodologies to benefit from the available expertise and techniques for non-parametric frequency-domain identification in academia and industry. We prove that, under appropriate conditions, the new FreePC scheme is equivalent to the corresponding DeePC scheme. The strengths of FreePC are demonstrated in a numerical case study.
☆ Tracking Real-time Anomalies in Cyber-Physical Systems Through Dynamic Behavioral Analysis
Increased connectivity and remote reprogrammability/reconfigurability features of embedded devices in current-day power systems (including interconnections between information technology -- IT -- and operational technology -- OT -- networks) enable greater agility, reduced operator workload, and enhanced power system performance and capabilities. However, these features also expose a wider cyber-attack surface, underscoring need for robust real-time monitoring and anomaly detection in power systems, and more generally in Cyber-Physical Systems (CPS). The increasingly complex, diverse, and potentially untrustworthy software and hardware supply chains also make need for robust security tools more stringent. We propose a novel framework for real-time monitoring and anomaly detection in CPS, specifically smart grid substations and SCADA systems. The proposed method enables real-time signal temporal logic condition-based anomaly monitoring by processing raw captured packets from the communication network through a hierarchical semantic extraction and tag processing pipeline into time series of semantic events and observations, that are then evaluated against expected temporal properties to detect and localize anomalies. We demonstrate efficacy of our methodology on a hardware in the loop testbed, including multiple physical power equipment (real-time automation controllers and relays) and simulated devices (Phasor Measurement Units -- PMUs, relays, Phasor Data Concentrators -- PDCs), interfaced to a dynamic power system simulator. The performance and accuracy of the proposed system is evaluated on multiple attack scenarios on our testbed.
comment: 10 pages, 11 figures, 4 tables
☆ Adaptive Safe Reinforcement Learning-Enabled Optimization of Battery Fast-Charging Protocols
Optimizing charging protocols is critical for reducing battery charging time and decelerating battery degradation in applications such as electric vehicles. Recently, reinforcement learning (RL) methods have been adopted for such purposes. However, RL-based methods may not ensure system (safety) constraints, which can cause irreversible damages to batteries and reduce their lifetime. To this end, this work proposes an adaptive and safe RL framework to optimize fast charging strategies while respecting safety constraints with a high probability. In our method, any unsafe action that the RL agent decides will be projected into a safety region by solving a constrained optimization problem. The safety region is constructed using adaptive Gaussian process (GP) models, consisting of static and dynamic GPs, that learn from online experience to adaptively account for any changes in battery dynamics. Simulation results show that our method can charge the batteries rapidly with constraint satisfaction under varying operating conditions.
☆ Exploiting and Securing ML Solutions in Near-RT RIC: A Perspective of an xApp
Open Radio Access Networks (O-RAN) are emerging as a disruptive technology, revolutionising traditional mobile network architecture and deployments in the current 5G and the upcoming 6G era. Disaggregation of network architecture, inherent support for AI/ML workflows, cloud-native principles, scalability, and interoperability make O-RAN attractive to network providers for beyond-5G and 6G deployments. Notably, the ability to deploy custom applications, including Machine Learning (ML) solutions as xApps or rApps on the RAN Intelligent Controllers (RICs), has immense potential for network function and resource optimisation. However, the openness, nascent standards, and distributed architecture of O-RAN and RICs introduce numerous vulnerabilities exploitable through multiple attack vectors, which have not yet been fully explored. To address this gap and ensure robust systems before large-scale deployments, this work analyses the security of ML-based applications deployed on the RIC platform. We focus on potential attacks, defence mechanisms, and pave the way for future research towards a more robust RIC platform.
☆ ABNet: Attention BarrierNet for Safe and Scalable Robot Learning
Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Barrier-based method is one of the dominant approaches for safe robot learning. However, this method is not scalable, hard to train, and tends to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose a novel Attention BarrierNet (ABNet) that is scalable to build larger foundational safe models in an incremental manner. Each head of BarrierNet in the ABNet could learn safe robot control policies from different features and focus on specific part of the observation. In this way, we do not need to one-shotly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models.
comment: 18 pages
☆ A Flexible Cryptographic Infrastructure for High-security SDR-based Systems
Military software defined radio (SDR) systems are a major factor in future network-centric operations due to their flexibility and support for more capable radio communications systems. The inherent nature of software-based systems requires a more complex auxiliary infrastructure and multiple independent levels of security compared with typical systems: Secure booting of the SDR device, cryptographically signed software, real time operating platform software as well as radio applications. This technology raises new challenges with respect to the management. The largest impact on SDR deployments is due to the auxiliary cryptographic infrastructure for the security of the software life cycle and the cyclic update of the keys. Compared to conventional radio devices, the SDR system with the cryptographic infrastructure described in this paper reaches a higher security level and is more flexible. The advantage is the possibility to deploy trunked radio system and further waveforms, such as coalition wideband, which will be standardized in the future. Also it is possible to update cryptographic mechanisms. In this work, we analyze the requirements for a high secure SDR deployment and model the life cycle of the components of a deployed SDR node based on the Joint Program Executive Office (JPEO) Software Communication Architecture (SCA).
♻ ☆ Deep Reinforcement Learning-based Quadcopter Controller: A Practical Approach and Experiments
Quadcopters have been studied for decades thanks to their maneuverability and capability of operating in a variety of circumstances. However, quadcopters suffer from dynamical nonlinearity, actuator saturation, as well as sensor noise that make it challenging and time consuming to obtain accurate dynamic models and achieve satisfactory control performance. Fortunately, deep reinforcement learning came and has shown significant potential in system modelling and control of autonomous multirotor aerial vehicles, with recent advancements in deployment, performance enhancement, and generalization. In this paper, an end-to-end deep reinforcement learning-based controller for quadcopters is proposed that is secure for real-world implementation, data-efficient, and free of human gain adjustments. First, a novel actor-critic-based architecture is designed to map the robot states directly to the motor outputs. Then, a quadcopter dynamics-based simulator was devised to facilitate the training of the controller policy. Finally, the trained policy is deployed on a real Crazyflie nano quadrotor platform, without any additional fine-tuning process. Experimental results show that the quadcopter exhibits satisfactory performance as it tracks a given complicated trajectory, which demonstrates the effectiveness and feasibility of the proposed method and signifies its capability in filling the simulation-to-reality gap.
comment: 6 pages, 5 figures, 3 tables
♻ ☆ A Comparison Benchmark for Distributed Hybrid MPC Control Methods: Distributed Vehicle Platooning
Distributed model predictive control (MPC) is currently being investigated as a solution to the important control challenge presented by networks of hybrid dynamical systems. However, a benchmark problem for distributed hybrid MPC is absent from the literature. We propose distributed control of a platoon of autonomous vehicles as a comparison benchmark problem. The problem provides a complex and adaptable case study, upon which existing and future approaches to distributed MPC for hybrid systems can be evaluated. Two hybrid modeling frameworks are presented for the vehicle dynamics. Five hybrid MPC controllers are then evaluated and extensively assessed on the fleet of vehicles. Finally, we comment on the need for new efficient and high performing distributed MPC schemes for hybrid systems.
comment: 14 pages, 11 figures, submitted to IEEE Transactions on Control Systems Technology, code can be found at https://github.com/SamuelMallick/hybrid-vehicle-platoon/tree/paper-2023
♻ ☆ A Unified Non-Strict Finsler Lemma
In this paper, we present a unified general non-strict Finsler lemma. This result is general in the sense that it does not impose any restrictions on the involved matrices and, thereby, it encompasses all existing non-strict versions of Finsler's lemma that do impose such restrictions. To further illustrate its usefulness, we showcase applications of the non-strict Finsler's lemma in deriving a structured solution to a special case of the non-strict projection lemma, and we use the unified non-strict Finsler's lemma to prove a more general version of the matrix Finsler's lemma.
♻ ☆ Asymptotic Analysis of Near-Field Coupling in Massive MISO and Massive SIMO Systems
This paper studies the receiver to transmitter antenna coupling in near-field communications with massive arrays. Although most works in the literature consider that it is negligible and approximate it by zero, there is no rigorous analysis on its relevance for practical systems. In this work, we leverage multiport communication theory to obtain conditions for the aforementioned approximation to be valid in MISO and SIMO systems. These conditions are then particularized for arrays with fixed inter-element spacing and arrays with fixed size.
comment: Accepted version of the article published in IEEE Communications Letters, 2024. DOI: 10.1109/LCOMM.2024.3416044
♻ ☆ Online Parameter Estimation for Continuously Monitored Quantum Systems
In this work, we consider the problem of online (real-time, single-shot) estimation of static or slow-varying parameters along quantum trajectories in quantum dynamical systems. Based on the measurement signal of a continuously-monitored quantum system, we propose a recursive algorithm for computing the maximum likelihood estimate of unknown parameters using an approach based on stochastic gradient ascent on the log-likelihood function. We formulate the algorithm in both discrete-time and continuous-time and illustrate the performance of the algorithm through simulations of a simple two-level system undergoing homodyne measurement from which we are able to track multiple parameters simultaneously.
♻ ☆ Online Feedback Droop Scheduling in Distribution Grids for Frequency and Local Voltage Control
This paper presents a novel framework for collective control of Distributed Energy Resources (DERs) in active Distribution Networks (DNs). The proposed approach unifies the commonly employed local (i.e., decentralized) voltage and frequency droop control schemes into a transfer matrix relating frequency and voltage magnitude measurements to active and reactive power injection adjustments. Furthermore, the transfer matrices of individual DER units are adaptively tuned in real-time via slow communication links using a novel online gain scheduling approach to enable primary frequency support provision to the transmission system and ensure that the DN voltages are kept within the allowable limits. A global asymptomatic stability condition of the analyzed droop-controlled DN is analytically established. The considered gain scheduling problem is solved by leveraging an online primal-dual gradient-based method and a suitable linearized power flow model. Additional ancillary service providers can be trivially incorporated into the proposed framework in a plug-and-play fashion. Numerical simulations of the 37-bus IEEE test system and a realistic Swedish 533-bus DN confirm the validity and the scalability of the approach and demonstrate numerous advantages of the proposed scheme over the state-of-the-art.
♻ ☆ A nonparametric learning framework for nonlinear robust output regulation
A nonparametric learning solution framework is proposed for the global nonlinear robust output regulation problem. We first extend the assumption that the steady-state generator is linear in the exogenous signal to the more relaxed assumption that it is polynomial in the exogenous signal. Additionally, a nonparametric learning framework is proposed to eliminate the construction of an explicit regressor, as required in the adaptive method, which can potentially simplify the implementation and reduce the computational complexity of existing methods. With the help of the proposed framework, the robust nonlinear output regulation problem can be converted into a robust non-adaptive stabilization problem for the augmented system with integral input-to-state stable (iISS) inverse dynamics. Moreover, a dynamic gain approach can adaptively raise the gain to a sufficiently large constant to achieve stabilization without requiring any a priori knowledge of the uncertainties appearing in the dynamics of the exosystem and the system. Furthermore, we apply the nonparametric learning framework to globally reconstruct and estimate multiple sinusoidal signals with unknown frequencies without the need for adaptive parametric techniques. An explicit nonlinear mapping can directly provide the estimated parameters, which will exponentially converge to the unknown frequencies. Finally, a feedforward control design is proposed to solve the linear output regulation problem using the nonparametric learning framework. Two simulation examples are provided to illustrate the effectiveness of the theoretical results.
comment: 17 pages; Nonlinear control; iISS stability; output regulation; parameter estimation; Non-adaptive control
♻ ☆ GridSweep Simulation: Measuring Subsynchronous Impedance Spectra of Distribution Feeder
Peaks and troughs in the subsynchronous impedance spectrum of a distribution feeder may be a useful indication of oscillation risk, or more importantly lack of oscillation risk, if inverter-based resource (IBR) deployments are increased on that feeder. GridSweep is a new instrument for measuring the subsynchronous impedance spectra of distribution feeders. It combines an active probing device that modulates a 120-volt 1-kW load sinusoidally at a user-selected GPS-phase locked frequency from 1.0 to 40.0 Hz, and with a recorder that takes ultra-high-precision continuous point-on-wave (CPOW) 120-volt synchrowaveforms at 4 kHz. This paper presents a computer simulation of GridSweep's probing and measurement capability. We construct an electromagnetic transient (EMT) simulation of a single-phase distribution feeder equipped with multiple inverter-based resources (IBRs). We include a model of the GridSweep probing device, then demonstrate the model's capability to measure the subsynchronous apparent impedance spectrum of the feeder. Peaks in that spectrum align with the system's dominant oscillation modes caused by IBRs.
comment: 10 pages, 18 figures
Multiagent Systems 8
☆ On the Convergence of Tâtonnement for Linear Fisher Markets
T\^atonnement is a simple, intuitive market process where prices are iteratively adjusted based on the difference between demand and supply. Many variants under different market assumptions have been studied and shown to converge to a market equilibrium, in some cases at a fast rate. However, the classical case of linear Fisher markets have long eluded the analyses, and it remains unclear whether t\^atonnement converges in this case. We show that, for a sufficiently small step size, the prices given by the t\^atonnement process are guaranteed to converge to equilibrium prices, up to a small approximation radius that depends on the stepsize. To achieve this, we consider the dual Eisenberg-Gale convex program in the price space, view t\^atonnement as subgradient descent on this convex program, and utilize novel last-iterate convergence results for subgradient descent under error bound conditions. In doing so, we show that the convex program satisfies a particular error bound condition, the quadratic growth condition, and that the price sequence generated by t\^atonnement is bounded above and away from zero. We also show that a similar convergence result holds for t\^atonnement in quasi-linear Fisher markets. Numerical experiments are conducted to demonstrate that the theoretical linear convergence aligns with empirical observations.
comment: 31 pages, 16 figures
☆ A Step Towards a Universal Method for Modeling and Implementing Cross-Organizational Business Processes
The widely adopted Business Process Model and Notation (BPMN) is a cornerstone of industry standards for business process modeling. However, its ambiguous execution semantics often result in inconsistent interpretations, depending on the software used for implementation. In response, the Process Specification Language (PASS) provides formally defined semantics to overcome these interpretational challenges. Despite its clear advantages, PASS has not reached the same level of industry penetration as BPMN. This feasibility study proposes using PASS as an intermediary framework to translate and execute BPMN models. It describes the development of a prototype translator that converts specific BPMN elements into a format compatible with PASS. These models are then transformed into source code and executed in a bespoke workflow environment, marking a departure from traditional BPMN implementations. Our findings suggest that integrating PASS enhances compatibility across different modeling and execution tools and offers a more robust methodology for implementing business processes across organizations. This study lays the groundwork for more accurate and unified business process model executions, potentially transforming industry standards for process modeling and execution.
☆ On the Principles behind Opinion Dynamics in Multi-Agent Systems of Large Language Models
We study the evolution of opinions inside a population of interacting large language models (LLMs). Every LLM needs to decide how much funding to allocate to an item with three initial possibilities: full, partial, or no funding. We identify biases that drive the exchange of opinions based on the LLM's tendency to (i) find consensus with the other LLM's opinion, (ii) display caution when specifying funding, and (iii) consider ethical concerns in its opinion. We find these biases are affected by the perceived absence of compelling reasons for opinion change, the perceived willingness to engage in discussion, and the distribution of allocation values. Moreover, tensions among biases can lead to the survival of funding for items with negative connotations. We also find that the final distribution of full, partial, and no funding opinions is more diverse when an LLM freely forms its opinion after an interaction than when its opinion is a multiple-choice selection among the three allocation options. In the latter case, consensus or polarization is generally attained. When agents are aware of past opinions, they seek to maintain consistency with them, and more diverse updating rules emerge. Our study is performed using a Llama 3 LLM.
♻ ☆ $α$-Rank-Collections: Analyzing Expected Strategic Behavior with Uncertain Utilities
Game theory relies heavily on the availability of cardinal utility functions, but in fields such as matching markets, only ordinal preferences are typically elicited. The literature focuses on mechanisms with simple dominant strategies, but many real-world applications lack dominant strategies, making the intensity of preferences between outcomes important for determining strategies. Even though precise information about cardinal utilities is not available, some data about the likelihood of utility functions is often accessible. We propose to use Bayesian games to formalize uncertainty about the decision-makers' utilities by viewing them as a collection of normal-form games. Instead of searching for the Bayes-Nash equilibrium, we study how uncertainty in utilities is reflected in uncertainty of strategic play. To do this, we introduce a novel solution concept called $\alpha$-Rank-collections, which extends $\alpha$-Rank to Bayesian games. This allows us to analyze strategic play in, for example, non-strategyproof matching markets, for which appropriate solution concepts are currently lacking. $\alpha$-Rank-collections characterize the expected probability of encountering a certain strategy profile under replicator dynamics in the long run, rather than predicting a specific equilibrium strategy profile. We experimentally evaluate $\alpha$-Rank-collections using instances of the Boston mechanism, finding that our solution concept provides more nuanced predictions compared to Bayes-Nash equilibria. Additionally, we prove that $\alpha$-Rank-collections are invariant to positive affine transformations, a standard property for a solution concept, and are efficient to approximate.
comment: Accepted at the 25th ACM Conference on Economics and Computation (EC), 2024
♻ ☆ Data Poisoning to Fake a Nash Equilibrium in Markov Games
We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside the set. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.
♻ ☆ CoSLight: Co-optimizing Collaborator Selection and Decision-making to Enhance Traffic Signal Control KDD 2024
Effective multi-intersection collaboration is pivotal for reinforcement-learning-based traffic signal control to alleviate congestion. Existing work mainly chooses neighboring intersections as collaborators. However, quite an amount of congestion, even some wide-range congestion, is caused by non-neighbors failing to collaborate. To address these issues, we propose to separate the collaborator selection as a second policy to be learned, concurrently being updated with the original signal-controlling policy. Specifically, the selection policy in real-time adaptively selects the best teammates according to phase- and intersection-level features. Empirical results on both synthetic and real-world datasets provide robust validation for the superiority of our approach, offering significant improvements over existing state-of-the-art methods. The code is available at https://github.com/bonaldli/CoSLight.
comment: Accepted by KDD 2024
♻ ☆ Multi-Modal and Multi-Agent Systems Meet Rationality: A Survey
Rationality is the quality of being guided by reason, characterized by logical thinking and decision-making that align with evidence and logical rules. This quality is essential for effective problem-solving, as it ensures that solutions are well-founded and systematically derived. Despite the advancements of large language models (LLMs) in generating human-like text with remarkable accuracy, they present biases inherited from the training data, inconsistency across different contexts, and difficulty understanding complex scenarios involving multiple layers of context. Therefore, recent research attempts to leverage the strength of multiple agents working collaboratively with various types of data and tools for enhanced consistency and reliability. To that end, this paper aims to understand whether multi-modal and multi-agent systems are advancing toward rationality by surveying the state-of-the-art works, identifying advancements over single-agent and single-modal systems in terms of rationality, and discussing open problems and future directions. We maintain an open repository at https://github.com/bowen-upenn/MMMA_Rationality.
♻ ☆ City-LEO: Toward Transparent City Management Using LLM with End-to-End Optimization
Existing operations research (OR) models and tools play indispensable roles in smart-city operations, yet their practical implementation is limited by the complexity of modeling and deficiencies in optimization proficiency. To generate more relevant and accurate solutions to users' requirements, we propose a large language model (LLM)-based agent ("City-LEO") that enhances the efficiency and transparency of city management through conversational interactions. Specifically, to accommodate diverse users' requirements and enhance computational tractability, City-LEO leverages LLM's logical reasoning capabilities on prior knowledge to scope down large-scale optimization problems efficiently. In the human-like decision process, City-LEO also incorporates End-to-end (E2E) model to synergize the prediction and optimization. The E2E framework be conducive to coping with environmental uncertainties and involving more query-relevant features, and then facilitates transparent and interpretable decision-making process. In case study, we employ City-LEO in the operations management of e-bike sharing (EBS) system. The numerical results demonstrate that City-LEO has superior performance when benchmarks against the full-scale optimization problem. With less computational time, City-LEO generates more satisfactory and relevant solutions to the users' requirements, and achieves lower global suboptimality without significantly compromising accuracy. In a broader sense, our proposed agent offers promise to develop LLM-embedded OR tools for smart-city operations management.
comment: 26 pages, 8 figures, 5 tables
Artificial Intelligence 150
☆ Synergizing Foundation Models and Federated Learning: A Survey
The recent development of Foundation Models (FMs), represented by large language models, vision transformers, and multimodal models, has been making a significant impact on both academia and industry. Compared with small-scale models, FMs have a much stronger demand for high-volume data during the pre-training phase. Although general FMs can be pre-trained on data collected from open sources such as the Internet, domain-specific FMs need proprietary data, posing a practical challenge regarding the amount of data available due to privacy concerns. Federated Learning (FL) is a collaborative learning paradigm that breaks the barrier of data availability from different participants. Therefore, it provides a promising solution to customize and adapt FMs to a wide range of domain-specific tasks using distributed datasets whilst preserving privacy. This survey paper discusses the potentials and challenges of synergizing FL and FMs and summarizes core techniques, future directions, and applications. A periodically updated paper collection on FM-FL is available at https://github.com/lishenghui/awesome-fm-fl.
☆ Can Go AIs be adversarially robust?
Prior work found that superhuman Go AIs like KataGo can be defeated by simple adversarial strategies. In this paper, we study if simple defenses can improve KataGo's worst-case performance. We test three natural defenses: adversarial training on hand-constructed positions, iterated adversarial training, and changing the network architecture. We find that some of these defenses are able to protect against previously discovered attacks. Unfortunately, we also find that none of these defenses are able to withstand adaptive attacks. In particular, we are able to train new adversaries that reliably defeat our defended agents by causing them to blunder in ways humans would not. Our results suggest that building robust AI systems is challenging even in narrow domains such as Go. For interactive examples of attacks and a link to our codebase, see https://goattack.far.ai.
comment: 67 pages
☆ Demystifying Higher-Order Graph Neural Networks
Higher-order graph neural networks (HOGNNs) are an important class of GNN models that harness polyadic relations between vertices beyond plain edges. They have been used to eliminate issues such as over-smoothing or over-squashing, to significantly enhance the accuracy of GNN predictions, to improve the expressiveness of GNN architectures, and for numerous other goals. A plethora of HOGNN models have been introduced, and they come with diverse neural architectures, and even with different notions of what the "higher-order" means. This richness makes it very challenging to appropriately analyze and compare HOGNN models, and to decide in what scenario to use specific ones. To alleviate this, we first design an in-depth taxonomy and a blueprint for HOGNNs. This facilitates designing models that maximize performance. Then, we use our taxonomy to analyze and compare the available HOGNN models. The outcomes of our analysis are synthesized in a set of insights that help to select the most beneficial GNN model in a given scenario, and a comprehensive list of challenges and opportunities for further research into more powerful HOGNNs.
☆ Influence Maximization via Graph Neural Bandits KDD
We consider a ubiquitous scenario in the study of Influence Maximization (IM), in which there is limited knowledge about the topology of the diffusion network. We set the IM problem in a multi-round diffusion campaign, aiming to maximize the number of distinct users that are influenced. Leveraging the capability of bandit algorithms to effectively balance the objectives of exploration and exploitation, as well as the expressivity of neural networks, our study explores the application of neural bandit algorithms to the IM problem. We propose the framework IM-GNB (Influence Maximization with Graph Neural Bandits), where we provide an estimate of the users' probabilities of being influenced by influencers (also known as diffusion seeds). This initial estimate forms the basis for constructing both an exploitation graph and an exploration one. Subsequently, IM-GNB handles the exploration-exploitation tradeoff, by selecting seed nodes in real-time using Graph Convolutional Networks (GCN), in which the pre-estimated graphs are employed to refine the influencers' estimated rewards in each contextual setting. Through extensive experiments on two large real-world datasets, we demonstrate the effectiveness of IM-GNB compared with other baseline methods, significantly improving the spread outcome of such diffusion campaigns, when the underlying network is unknown.
comment: To appear at the 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
☆ LaMDA: Large Model Fine-Tuning via Spectrally Decomposed Low-Dimensional Adaptation
Low-rank adaptation (LoRA) has become the default approach to fine-tune large language models (LLMs) due to its significant reduction in trainable parameters. However, trainable parameter demand for LoRA increases with increasing model embedding dimensions, leading to high compute costs. Additionally, its backward updates require storing high-dimensional intermediate activations and optimizer states, demanding high peak GPU memory. In this paper, we introduce large model fine-tuning via spectrally decomposed low-dimensional adaptation (LaMDA), a novel approach to fine-tuning large language models, which leverages low-dimensional adaptation to achieve significant reductions in trainable parameters and peak GPU memory footprint. LaMDA freezes a first projection matrix (PMA) in the adaptation path while introducing a low-dimensional trainable square matrix, resulting in substantial reductions in trainable parameters and peak GPU memory usage. LaMDA gradually freezes a second projection matrix (PMB) during the early fine-tuning stages, reducing the compute cost associated with weight updates to enhance parameter efficiency further. We also present an enhancement, LaMDA++, incorporating a ``lite-weight" adaptive rank allocation for the LoRA path via normalized spectrum analysis of pre-trained model weights. We evaluate LaMDA/LaMDA++ across various tasks, including natural language understanding with the GLUE benchmark, text summarization, natural language generation, and complex reasoning on different LLMs. Results show that LaMDA matches or surpasses the performance of existing alternatives while requiring up to 17.7x fewer parameter updates and up to 1.32x lower peak GPU memory usage during fine-tuning. Code will be publicly available.
☆ VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal VIdeo Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, the foundation of VIA is a novel test-time editing adaptation method, which adapts a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that adapts consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potentials for advanced video editing tasks over long video sequences.
comment: 13 pages, 11 figures
☆ From RAGs to rich parameters: Probing how language models utilize external knowledge over parametric information for factual queries
Retrieval Augmented Generation (RAG) enriches the ability of language models to reason using external context to augment responses for a given user prompt. This approach has risen in popularity due to practical applications in various applications of language models in search, question/answering, and chat-bots. However, the exact nature of how this approach works isn't clearly understood. In this paper, we mechanistically examine the RAG pipeline to highlight that language models take shortcut and have a strong bias towards utilizing only the context information to answer the question, while relying minimally on their parametric memory. We probe this mechanistic behavior in language models with: (i) Causal Mediation Analysis to show that the parametric memory is minimally utilized when answering a question and (ii) Attention Contributions and Knockouts to show that the last token residual stream do not get enriched from the subject token in the question, but gets enriched from other informative tokens in the context. We find this pronounced shortcut behaviour true across both LLaMa and Phi family of models.
☆ Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
Large language models, particularly multilingual ones, are designed, claimed, and expected to cater to native speakers of varied languages. We hypothesise that the current practices of fine-tuning and evaluating these models may mismatch this intention owing to a heavy reliance on translation, which can introduce translation artefacts and defects. It remains unknown whether the nature of the instruction data has an impact on the model output; on the other hand, it remains questionable whether translated test sets can capture such nuances. Due to the often coupled practices of using translated data in both stages, such imperfections could have been overlooked. This work investigates these issues by using controlled native or translated data during instruction tuning and evaluation stages and observing model results. Experiments on eight base models and eight different benchmarks reveal that native or generation benchmarks display a notable difference between native and translated instruction data especially when model performance is high, whereas other types of test sets cannot. Finally, we demonstrate that regularization is beneficial to bridging this gap on structured but not generative tasks.
☆ Privacy Preserving Federated Learning in Medical Imaging with Uncertainty Estimation
Machine learning (ML) and Artificial Intelligence (AI) have fueled remarkable advancements, particularly in healthcare. Within medical imaging, ML models hold the promise of improving disease diagnoses, treatment planning, and post-treatment monitoring. Various computer vision tasks like image classification, object detection, and image segmentation are poised to become routine in clinical analysis. However, privacy concerns surrounding patient data hinder the assembly of large training datasets needed for developing and training accurate, robust, and generalizable models. Federated Learning (FL) emerges as a compelling solution, enabling organizations to collaborate on ML model training by sharing model training information (gradients) rather than data (e.g., medical images). FL's distributed learning framework facilitates inter-institutional collaboration while preserving patient privacy. However, FL, while robust in privacy preservation, faces several challenges. Sensitive information can still be gleaned from shared gradients that are passed on between organizations during model training. Additionally, in medical imaging, quantifying model confidence\uncertainty accurately is crucial due to the noise and artifacts present in the data. Uncertainty estimation in FL encounters unique hurdles due to data heterogeneity across organizations. This paper offers a comprehensive review of FL, privacy preservation, and uncertainty estimation, with a focus on medical imaging. Alongside a survey of current research, we identify gaps in the field and suggest future directions for FL research to enhance privacy and address noisy medical imaging data challenges.
comment: 31 pages, 5 figures, 3 tables, Journal preprint
☆ Graph Neural Networks in Histopathology: Emerging Trends and Future Directions
Histopathological analysis of Whole Slide Images (WSIs) has seen a surge in the utilization of deep learning methods, particularly Convolutional Neural Networks (CNNs). However, CNNs often fall short in capturing the intricate spatial dependencies inherent in WSIs. Graph Neural Networks (GNNs) present a promising alternative, adept at directly modeling pairwise interactions and effectively discerning the topological tissue and cellular structures within WSIs. Recognizing the pressing need for deep learning techniques that harness the topological structure of WSIs, the application of GNNs in histopathology has experienced rapid growth. In this comprehensive review, we survey GNNs in histopathology, discuss their applications, and exploring emerging trends that pave the way for future advancements in the field. We begin by elucidating the fundamentals of GNNs and their potential applications in histopathology. Leveraging quantitative literature analysis, we identify four emerging trends: Hierarchical GNNs, Adaptive Graph Structure Learning, Multimodal GNNs, and Higher-order GNNs. Through an in-depth exploration of these trends, we offer insights into the evolving landscape of GNNs in histopathological analysis. Based on our findings, we propose future directions to propel the field forward. Our analysis serves to guide researchers and practitioners towards innovative approaches and methodologies, fostering advancements in histopathological analysis through the lens of graph neural networks.
☆ Probabilistic Temporal Prediction of Continuous Disease Trajectories and Treatment Effects Using Neural SDEs
Personalized medicine based on medical images, including predicting future individualized clinical disease progression and treatment response, would have an enormous impact on healthcare and drug development, particularly for diseases (e.g. multiple sclerosis (MS)) with long term, complex, heterogeneous evolutions and no cure. In this work, we present the first stochastic causal temporal framework to model the continuous temporal evolution of disease progression via Neural Stochastic Differential Equations (NSDE). The proposed causal inference model takes as input the patient's high dimensional images (MRI) and tabular data, and predicts both factual and counterfactual progression trajectories on different treatments in latent space. The NSDE permits the estimation of high-confidence personalized trajectories and treatment effects. Extensive experiments were performed on a large, multi-centre, proprietary dataset of patient 3D MRI and clinical data acquired during several randomized clinical trials for MS treatments. Our results present the first successful uncertainty-based causal Deep Learning (DL) model to: (a) accurately predict future patient MS disability evolution (e.g. EDSS) and treatment effects leveraging baseline MRI, and (b) permit the discovery of subgroups of patients for which the model has high confidence in their response to treatment even in clinical trials which did not reach their clinical endpoints.
☆ Identifying Performance-Sensitive Configurations in Software Systems through Code Analysis with LLM Agents
Configuration settings are essential for tailoring software behavior to meet specific performance requirements. However, incorrect configurations are widespread, and identifying those that impact system performance is challenging due to the vast number and complexity of possible settings. In this work, we present PerfSense, a lightweight framework that leverages Large Language Models (LLMs) to efficiently identify performance-sensitive configurations with minimal overhead. PerfSense employs LLM agents to simulate interactions between developers and performance engineers using advanced prompting techniques such as prompt chaining and retrieval-augmented generation (RAG). Our evaluation of seven open-source Java systems demonstrates that PerfSense achieves an average accuracy of 64.77% in classifying performance-sensitive configurations, outperforming both our LLM baseline (50.36%) and the previous state-of-the-art method (61.75%). Notably, our prompt chaining technique improves recall by 10% to 30% while maintaining similar precision levels. Additionally, a manual analysis of 362 misclassifications reveals common issues, including LLMs' misunderstandings of requirements (26.8%). In summary, PerfSense significantly reduces manual effort in classifying performance-sensitive configurations and offers valuable insights for future LLM-based code analysis research.
☆ Formatics & dairy industry coalition: AI trends and present challenges
Artificial Intelligence (AI) can potentially transform the industry, enhancing the production process and minimizing manual, repetitive tasks. Accordingly, the synergy between high-performance computing and powerful mathematical models enables the application of sophisticated data analysis procedures like Machine Learning. However, challenges exist regarding effective, efficient, and flexible processing to generate valuable knowledge. Consequently, this work comprehensively describes industrial challenges where AI can be exploited, focusing on the dairy industry. The conclusions presented can help researchers apply novel approaches for cattle monitoring and farmers by proposing advanced technological solutions to their needs.
☆ Latent Intuitive Physics: Learning to Transfer Hidden Physics from A 3D Video ICLR 2024
We introduce latent intuitive physics, a transfer learning framework for physics simulation that can infer hidden properties of fluids from a single 3D video and simulate the observed fluid in novel scenes. Our key insight is to use latent features drawn from a learnable prior distribution conditioned on the underlying particle states to capture the invisible and complex physical properties. To achieve this, we train a parametrized prior learner given visual observations to approximate the visual posterior of inverse graphics, and both the particle states and the visual posterior are obtained from a learned neural renderer. The converged prior learner is embedded in our probabilistic physics engine, allowing us to perform novel simulations on unseen geometries, boundaries, and dynamics without knowledge of the true physical parameters. We validate our model in three ways: (i) novel scene simulation with the learned visual-world physics, (ii) future prediction of the observed fluid dynamics, and (iii) supervised particle simulation. Our model demonstrates strong performance in all three tasks.
comment: Published as a conference paper at ICLR 2024
☆ Unsupervised explainable activity prediction in competitive Nordic Walking from experimental data
Artificial Intelligence (AI) has found application in Human Activity Recognition (HAR) in competitive sports. To date, most Machine Learning (ML) approaches for HAR have relied on offline (batch) training, imposing higher computational and tagging burdens compared to online processing unsupervised approaches. Additionally, the decisions behind traditional ML predictors are opaque and require human interpretation. In this work, we apply an online processing unsupervised clustering approach based on low-cost wearable Inertial Measurement Units (IMUs). The outcomes generated by the system allow for the automatic expansion of limited tagging available (e.g., by referees) within those clusters, producing pertinent information for the explainable classification stage. Specifically, our work focuses on achieving automatic explainability for predictions related to athletes' activities, distinguishing between correct, incorrect, and cheating practices in Nordic Walking. The proposed solution achieved performance metrics of close to 100 % on average.
☆ Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Existing humor datasets and evaluations predominantly focus on English, lacking resources for culturally nuanced humor in non-English languages like Chinese. To address this gap, we construct Chumor, a dataset sourced from Ruo Zhi Ba (RZB), a Chinese Reddit-like platform dedicated to sharing intellectually challenging and culturally specific jokes. We annotate explanations for each joke and evaluate human explanations against two state-of-the-art LLMs, GPT-4o and ERNIE Bot, through A/B testing by native Chinese speakers. Our evaluation shows that Chumor is challenging even for SOTA LLMs, and the human explanations for Chumor jokes are significantly better than explanations generated by the LLMs.
☆ OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
The evolution of Artificial Intelligence (AI) has been significantly accelerated by advancements in Large Language Models (LLMs) and Large Multimodal Models (LMMs), gradually showcasing potential cognitive reasoning abilities in problem-solving and scientific discovery (i.e., AI4Science) once exclusive to human intellect. To comprehensively evaluate current models' performance in cognitive reasoning abilities, we introduce OlympicArena, which includes 11,163 bilingual problems across both text-only and interleaved text-image modalities. These challenges encompass a wide range of disciplines spanning seven fields and 62 international Olympic competitions, rigorously examined for data leakage. We argue that the challenges in Olympic competition problems are ideal for evaluating AI's cognitive reasoning due to their complexity and interdisciplinary nature, which are essential for tackling complex scientific challenges and facilitating discoveries. Beyond evaluating performance across various disciplines using answer-only criteria, we conduct detailed experiments and analyses from multiple perspectives. We delve into the models' cognitive reasoning abilities, their performance across different modalities, and their outcomes in process-level evaluations, which are vital for tasks requiring complex reasoning with lengthy solutions. Our extensive evaluations reveal that even advanced models like GPT-4o only achieve a 39.97% overall accuracy, illustrating current AI limitations in complex reasoning and multimodal integration. Through the OlympicArena, we aim to advance AI towards superintelligence, equipping it to address more complex challenges in science and beyond. We also provide a comprehensive set of resources to support AI research, including a benchmark dataset, an open-source annotation platform, a detailed evaluation tool, and a leaderboard with automatic submission features.
comment: 44 pages
☆ TSI-Bench: Benchmarking Time Series Imputation
Effective imputation is a crucial preprocessing step for time series analysis. Despite the development of numerous deep learning algorithms for time series imputation, the community lacks standardized and comprehensive benchmark platforms to effectively evaluate imputation performance across different settings. Moreover, although many deep learning forecasting algorithms have demonstrated excellent performance, whether their modeling achievements can be transferred to time series imputation tasks remains unexplored. To bridge these gaps, we develop TSI-Bench, the first (to our knowledge) comprehensive benchmark suite for time series imputation utilizing deep learning techniques. The TSI-Bench pipeline standardizes experimental settings to enable fair evaluation of imputation algorithms and identification of meaningful insights into the influence of domain-appropriate missingness ratios and patterns on model performance. Furthermore, TSI-Bench innovatively provides a systematic paradigm to tailor time series forecasting algorithms for imputation purposes. Our extensive study across 34,804 experiments, 28 algorithms, and 8 datasets with diverse missingness scenarios demonstrates TSI-Bench's effectiveness in diverse downstream tasks and potential to unlock future directions in time series imputation research and analysis. The source code and experiment logs are available at https://github.com/WenjieDu/AwesomeImputation.
☆ Ensuring Both Positivity and Stability Using Sector-Bounded Nonlinearity for Systems with Neural Network Controllers
This paper introduces a novel method for the stability analysis of positive feedback systems with a class of fully connected feedforward neural networks (FFNN) controllers. By establishing sector bounds for fully connected FFNNs without biases, we present a stability theorem that demonstrates the global exponential stability of linear systems under fully connected FFNN control. Utilizing principles from positive Lur'e systems and the positive Aizerman conjecture, our approach effectively addresses the challenge of ensuring stability in highly nonlinear systems. The crux of our method lies in maintaining sector bounds that preserve the positivity and Hurwitz property of the overall Lur'e system. We showcase the practical applicability of our methodology through its implementation in a linear system managed by a FFNN trained on output feedback controller data, highlighting its potential for enhancing stability in dynamic systems.
comment: 6 pages, 7 figures, to be published in IEEE Control Systems Letters (L-CSS)
☆ Benchmarking Multi-Image Understanding in Vision and Language Models: Perception, Knowledge, Reasoning, and Multi-Hop Reasoning
The advancement of large language models (LLMs) has significantly broadened the scope of applications in natural language processing, with multi-modal LLMs extending these capabilities to integrate and interpret visual data. However, existing benchmarks for visual language models (VLMs) predominantly focus on single-image inputs, neglecting the crucial aspect of multi-image understanding. In this paper, we introduce a Multi-Image Relational Benchmark MIRB, designed to evaluate VLMs' ability to compare, analyze, and reason across multiple images. Our benchmark encompasses four categories: perception, visual world knowledge, reasoning, and multi-hop reasoning. Through a comprehensive evaluation of a wide range of open-source and closed-source models, we demonstrate that while open-source VLMs were shown to approach the performance of GPT-4V in single-image tasks, a significant performance gap remains in multi-image reasoning tasks. Our findings also reveal that even the state-of-the-art GPT-4V model struggles with our benchmark, underscoring the need for further research and development in this area. We believe our contribution of MIRB could serve as a testbed for developing the next-generation multi-modal models.
comment: First three authors contributed equally. Dataset: https://huggingface.co/datasets/VLLMs/MIRB
☆ Large Language Model as a Universal Clinical Multi-task Decoder
The development of effective machine learning methodologies for enhancing the efficiency and accuracy of clinical systems is crucial. Despite significant research efforts, managing a plethora of diversified clinical tasks and adapting to emerging new tasks remain significant challenges. This paper presents a novel paradigm that employs a pre-trained large language model as a universal clinical multi-task decoder. This approach leverages the flexibility and diversity of language expressions to handle task topic variations and associated arguments. The introduction of a new task simply requires the addition of a new instruction template. We validate this framework across hundreds of tasks, demonstrating its robustness in facilitating multi-task predictions, performing on par with traditional multi-task learning and single-task learning approaches. Moreover, it shows exceptional adaptability to new tasks, with impressive zero-shot performance in some instances and superior data efficiency in few-shot scenarios. This novel approach offers a unified solution to manage a wide array of new and emerging tasks in clinical applications.
comment: Work in progress
☆ Beyond Visual Appearances: Privacy-sensitive Objects Identification via Hybrid Graph Reasoning
The Privacy-sensitive Object Identification (POI) task allocates bounding boxes for privacy-sensitive objects in a scene. The key to POI is settling an object's privacy class (privacy-sensitive or non-privacy-sensitive). In contrast to conventional object classes which are determined by the visual appearance of an object, one object's privacy class is derived from the scene contexts and is subject to various implicit factors beyond its visual appearance. That is, visually similar objects may be totally opposite in their privacy classes. To explicitly derive the objects' privacy class from the scene contexts, in this paper, we interpret the POI task as a visual reasoning task aimed at the privacy of each object in the scene. Following this interpretation, we propose the PrivacyGuard framework for POI. PrivacyGuard contains three stages. i) Structuring: an unstructured image is first converted into a structured, heterogeneous scene graph that embeds rich scene contexts. ii) Data Augmentation: a contextual perturbation oversampling strategy is proposed to create slightly perturbed privacy-sensitive objects in a scene graph, thereby balancing the skewed distribution of privacy classes. iii) Hybrid Graph Generation & Reasoning: the balanced, heterogeneous scene graph is then transformed into a hybrid graph by endowing it with extra "node-node" and "edge-edge" homogeneous paths. These homogeneous paths allow direct message passing between nodes or edges, thereby accelerating reasoning and facilitating the capturing of subtle context changes. Based on this hybrid graph... **For the full abstract, see the original paper.**
comment: 15 pages
☆ Automatic generation of insights from workers' actions in industrial workflows with explainable Machine Learning
New technologies such as Machine Learning (ML) gave great potential for evaluating industry workflows and automatically generating key performance indicators (KPIs). However, despite established standards for measuring the efficiency of industrial machinery, there is no precise equivalent for workers' productivity, which would be highly desirable given the lack of a skilled workforce for the next generation of industry workflows. Therefore, an ML solution combining data from manufacturing processes and workers' performance for that goal is required. Additionally, in recent times intense effort has been devoted to explainable ML approaches that can automatically explain their decisions to a human operator, thus increasing their trustworthiness. We propose to apply explainable ML solutions to differentiate between expert and inexpert workers in industrial workflows, which we validate at a quality assessment industrial workstation. Regarding the methodology used, input data are captured by a manufacturing machine and stored in a NoSQL database. Data are processed to engineer features used in automatic classification and to compute workers' KPIs to predict their level of expertise (with all classification metrics exceeding 90 %). These KPIs, and the relevant features in the decisions are textually explained by natural language expansion on an explainability dashboard. These automatic explanations made it possible to infer knowledge from expert workers for inexpert workers. The latter illustrates the interest of research in self-explainable ML for automatically generating insights to improve productivity in industrial workflows.
comment: IEEE Industrial Electronics Magazine (2023)
☆ Can Large Language Models Code Like a Linguist?: A Case Study in Low Resource Sound Law Induction
Historical linguists have long written a kind of incompletely formalized ''program'' that converts reconstructed words in an ancestor language into words in one of its attested descendants that consist of a series of ordered string rewrite functions (called sound laws). They do this by observing pairs of words in the reconstructed language (protoforms) and the descendent language (reflexes) and constructing a program that transforms protoforms into reflexes. However, writing these programs is error-prone and time-consuming. Prior work has successfully scaffolded this process computationally, but fewer researchers have tackled Sound Law Induction (SLI), which we approach in this paper by casting it as Programming by Examples. We propose a language-agnostic solution that utilizes the programming ability of Large Language Models (LLMs) by generating Python sound law programs from sound change examples. We evaluate the effectiveness of our approach for various LLMs, propose effective methods to generate additional language-agnostic synthetic data to fine-tune LLMs for SLI, and compare our method with existing automated SLI methods showing that while LLMs lag behind them they can complement some of their weaknesses.
☆ On the Robustness of Language Models for Tabular Question Answering
Large Language Models (LLMs), originally shown to ace various text comprehension tasks have also remarkably been shown to tackle table comprehension tasks without specific training. While previous research has explored LLM capabilities with tabular dataset tasks, our study assesses the influence of $\textit{in-context learning}$,$ \textit{model scale}$, $\textit{instruction tuning}$, and $\textit{domain biases}$ on Tabular Question Answering (TQA). We evaluate the robustness of LLMs on Wikipedia-based $\textbf{WTQ}$ and financial report-based $\textbf{TAT-QA}$ TQA datasets, focusing on their ability to robustly interpret tabular data under various augmentations and perturbations. Our findings indicate that instructions significantly enhance performance, with recent models like Llama3 exhibiting greater robustness over earlier versions. However, data contamination and practical reliability issues persist, especially with WTQ. We highlight the need for improved methodologies, including structure-aware self-attention mechanisms and better handling of domain-specific tabular data, to develop more reliable LLMs for table comprehension.
☆ AGLA: Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention
Despite their great success across various multimodal tasks, Large Vision-Language Models (LVLMs) are facing a prevalent problem with object hallucinations, where the generated textual responses are inconsistent with ground-truth objects in the given image. This paper investigates various LVLMs and pinpoints attention deficiency toward discriminative local image features as one root cause of object hallucinations. Specifically, LVLMs predominantly attend to prompt-independent global image features, while failing to capture prompt-relevant local features, consequently undermining the visual grounding capacity of LVLMs and leading to hallucinations. To this end, we propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates object hallucinations by exploring an ensemble of global features for response generation and local features for visual discrimination simultaneously. Our approach exhibits an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is reserved while irrelevant distractions are masked. With the augmented view, a calibrated decoding distribution can be derived by integrating generative global features from the original image and discriminative local features from the augmented image. Extensive experiments show that AGLA consistently mitigates object hallucinations and enhances general perception capability for LVLMs across various discriminative and generative benchmarks. Our code will be released at https://github.com/Lackel/AGLA.
☆ Enhancing Spatio-temporal Quantile Forecasting with Curriculum Learning: Lessons Learned
Training models on spatio-temporal (ST) data poses an open problem due to the complicated and diverse nature of the data itself, and it is challenging to ensure the model's performance directly trained on the original ST data. While limiting the variety of training data can make training easier, it can also lead to a lack of knowledge and information for the model, resulting in a decrease in performance. To address this challenge, we presented an innovative paradigm that incorporates three separate forms of curriculum learning specifically targeting from spatial, temporal, and quantile perspectives. Furthermore, our framework incorporates a stacking fusion module to combine diverse information from three types of curriculum learning, resulting in a strong and thorough learning process. We demonstrated the effectiveness of this framework with extensive empirical evaluations, highlighting its better performance in addressing complex ST challenges. We provided thorough ablation studies to investigate the effectiveness of our curriculum and to explain how it contributes to the improvement of learning efficiency on ST data.
☆ Talk With Human-like Agents: Empathetic Dialogue Through Perceptible Acoustic Reception and Reaction ACL24
Large Language Model (LLM)-enhanced agents become increasingly prevalent in Human-AI communication, offering vast potential from entertainment to professional domains. However, current multi-modal dialogue systems overlook the acoustic information present in speech, which is crucial for understanding human communication nuances. This oversight can lead to misinterpretations of speakers' intentions, resulting in inconsistent or even contradictory responses within dialogues. To bridge this gap, in this paper, we propose PerceptiveAgent, an empathetic multi-modal dialogue system designed to discern deeper or more subtle meanings beyond the literal interpretations of words through the integration of speech modality perception. Employing LLMs as a cognitive core, PerceptiveAgent perceives acoustic information from input speech and generates empathetic responses based on speaking styles described in natural language. Experimental results indicate that PerceptiveAgent excels in contextual understanding by accurately discerning the speakers' true intentions in scenarios where the linguistic meaning is either contrary to or inconsistent with the speaker's true feelings, producing more nuanced and expressive spoken dialogues. Code is publicly available at: \url{https://github.com/Haoqiu-Yan/PerceptiveAgent}.
comment: 9 pages, 3 figures, ACL24 accepted
☆ Online-Adaptive Anomaly Detection for Defect Identification in Aircraft Assembly
Anomaly detection deals with detecting deviations from established patterns within data. It has various applications like autonomous driving, predictive maintenance, and medical diagnosis. To improve anomaly detection accuracy, transfer learning can be applied to large, pre-trained models and adapt them to the specific application context. In this paper, we propose a novel framework for online-adaptive anomaly detection using transfer learning. The approach adapts to different environments by selecting visually similar training images and online fitting a normality model to EfficientNet features extracted from the training subset. Anomaly detection is then performed by computing the Mahalanobis distance between the normality model and the test image features. Different similarity measures (SIFT/FLANN, Cosine) and normality models (MVG, OCSVM) are employed and compared with each other. We evaluate the approach on different anomaly detection benchmarks and data collected in controlled laboratory settings. Experimental results showcase a detection accuracy exceeding 0.975, outperforming the state-of-the-art ET-NET approach.
comment: This is preprint for the accepted paper
☆ XXLTraffic: Expanding and Extremely Long Traffic Dataset for Ultra-Dynamic Forecasting Challenges
Traffic forecasting is crucial for smart cities and intelligent transportation initiatives, where deep learning has made significant progress in modeling complex spatio-temporal patterns in recent years. However, current public datasets have limitations in reflecting the ultra-dynamic nature of real-world scenarios, characterized by continuously evolving infrastructures, varying temporal distributions, and temporal gaps due to sensor downtimes or changes in traffic patterns. These limitations inevitably restrict the practical applicability of existing traffic forecasting datasets. To bridge this gap, we present XXLTraffic, the largest available public traffic dataset with the longest timespan and increasing number of sensor nodes over the multiple years observed in the data, curated to support research in ultra-dynamic forecasting. Our benchmark includes both typical time-series forecasting settings with hourly and daily aggregated data and novel configurations that introduce gaps and down-sample the training size to better simulate practical constraints. We anticipate the new XXLTraffic will provide a fresh perspective for the time-series and traffic forecasting communities. It would also offer a robust platform for developing and evaluating models designed to tackle ultra-dynamic and extremely long forecasting problems. Our dataset supplements existing spatio-temporal data resources and leads to new research directions in this domain.
☆ MAGIC: Generating Self-Correction Guideline for In-Context Text-to-SQL
Self-correction in text-to-SQL is the process of prompting large language model (LLM) to revise its previously incorrectly generated SQL, and commonly relies on manually crafted self-correction guidelines by human experts that are not only labor-intensive to produce but also limited by the human ability in identifying all potential error patterns in LLM responses. We introduce MAGIC, a novel multi-agent method that automates the creation of the self-correction guideline. MAGIC uses three specialized agents: a manager, a correction, and a feedback agent. These agents collaborate on the failures of an LLM-based method on the training set to iteratively generate and refine a self-correction guideline tailored to LLM mistakes, mirroring human processes but without human involvement. Our extensive experiments show that MAGIC's guideline outperforms expert human's created ones. We empirically find out that the guideline produced by MAGIC enhance the interpretability of the corrections made, providing insights in analyzing the reason behind the failures and successes of LLMs in self-correction. We make all agent interactions publicly available to the research community, to foster further research in this area, offering a synthetic dataset for future explorations into automatic self-correction guideline generation.
comment: 20 pages, 17 figures
☆ Sparsifying dimensionality reduction of PDE solution data with Bregman learning
Classical model reduction techniques project the governing equations onto a linear subspace of the original state space. More recent data-driven techniques use neural networks to enable nonlinear projections. Whilst those often enable stronger compression, they may have redundant parameters and lead to suboptimal latent dimensionality. To overcome these, we propose a multistep algorithm that induces sparsity in the encoder-decoder networks for effective reduction in the number of parameters and additional compression of the latent space. This algorithm starts with sparsely initialized a network and training it using linearized Bregman iterations. These iterations have been very successful in computer vision and compressed sensing tasks, but have not yet been used for reduced-order modelling. After the training, we further compress the latent space dimensionality by using a form of proper orthogonal decomposition. Last, we use a bias propagation technique to change the induced sparsity into an effective reduction of parameters. We apply this algorithm to three representative PDE models: 1D diffusion, 1D advection, and 2D reaction-diffusion. Compared to conventional training methods like Adam, the proposed method achieves similar accuracy with 30% less parameters and a significantly smaller latent space.
comment: 27 pages, 20 figures, 4 tables
☆ Stealth edits for provably fixing or attacking large language models
We reveal new methods and the theoretical foundations of techniques for editing large language models. We also show how the new theory can be used to assess the editability of models and to expose their susceptibility to previously unknown malicious attacks. Our theoretical approach shows that a single metric (a specific measure of the intrinsic dimensionality of the model's features) is fundamental to predicting the success of popular editing approaches, and reveals new bridges between disparate families of editing methods. We collectively refer to these approaches as stealth editing methods, because they aim to directly and inexpensively update a model's weights to correct the model's responses to known hallucinating prompts without otherwise affecting the model's behaviour, without requiring retraining. By carefully applying the insight gleaned from our theoretical investigation, we are able to introduce a new network block -- named a jet-pack block -- which is optimised for highly selective model editing, uses only standard network operations, and can be inserted into existing networks. The intrinsic dimensionality metric also determines the vulnerability of a language model to a stealth attack: a small change to a model's weights which changes its response to a single attacker-chosen prompt. Stealth attacks do not require access to or knowledge of the model's training data, therefore representing a potent yet previously unrecognised threat to redistributed foundation models. They are computationally simple enough to be implemented in malware in many cases. Extensive experimental results illustrate and support the method and its theoretical underpinnings. Demos and source code for editing language models are available at https://github.com/qinghua-zhou/stealth-edits.
comment: 24 pages, 9 figures. Open source implementation: https://github.com/qinghua-zhou/stealth-edits
☆ CollabStory: Multi-LLM Collaborative Story Generation and Authorship Analysis
The rise of unifying frameworks that enable seamless interoperability of Large Language Models (LLMs) has made LLM-LLM collaboration for open-ended tasks a possibility. Despite this, there have not been efforts to explore such collaborative writing. We take the next step beyond human-LLM collaboration to explore this multi-LLM scenario by generating the first exclusively LLM-generated collaborative stories dataset called CollabStory. We focus on single-author ($N=1$) to multi-author (up to $N=5$) scenarios, where multiple LLMs co-author stories. We generate over 32k stories using open-source instruction-tuned LLMs. Further, we take inspiration from the PAN tasks that have set the standard for human-human multi-author writing tasks and analysis. We extend their authorship-related tasks for multi-LLM settings and present baselines for LLM-LLM collaboration. We find that current baselines are not able to handle this emerging scenario. Thus, CollabStory is a resource that could help propel an understanding as well as the development of techniques to discern the use of multiple LLMs. This is crucial to study in the context of writing tasks since LLM-LLM collaboration could potentially overwhelm ongoing challenges related to plagiarism detection, credit assignment, maintaining academic integrity in educational settings, and addressing copyright infringement concerns. We make our dataset and code available at \texttt{\url{https://github.com/saranya-venkatraman/multi_llm_story_writing}}.
☆ Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
Large Vision-Language Models (LVLMs) excel in integrating visual and linguistic contexts to produce detailed content, facilitating applications such as image captioning. However, using LVLMs to generate descriptions often faces the challenge of object hallucination (OH), where the output text misrepresents actual objects in the input image. While previous studies attribute the occurrence of OH to the inclusion of more details, our study finds technical flaws in existing metrics, leading to unreliable evaluations of models and conclusions about OH. This has sparked a debate on the question: Do more details always introduce more hallucinations in LVLM-based image captioning? In this paper, we address this debate by proposing a novel decoding strategy, Differentiated Beam Decoding (DBD), along with a reliable new set of evaluation metrics: CLIP-Precision, CLIP-Recall, and CLIP-F1. DBD decodes the wealth of information hidden in visual input into distinct language representations called unit facts in parallel. This decoding is achieved via a well-designed differential score that guides the parallel search and candidate screening. The selected unit facts are then aggregated to generate the final caption. Our proposed metrics evaluate the comprehensiveness and accuracy of image captions by comparing the embedding groups of ground-truth image regions and generated text partitions. Extensive experiments on the Visual Genome dataset validate the effectiveness of our approach, demonstrating that it produces detailed descriptions while maintaining low hallucination levels.
☆ Investigating the Role of Explainability and AI Literacy in User Compliance
AI is becoming increasingly common across different domains. However, as sophisticated AI-based systems are often black-boxed, rendering the decision-making logic opaque, users find it challenging to comply with their recommendations. Although researchers are investigating Explainable AI (XAI) to increase the transparency of the underlying machine learning models, it is unclear what types of explanations are effective and what other factors increase compliance. To better understand the interplay of these factors, we conducted an experiment with 562 participants who were presented with the recommendations of an AI and two different types of XAI. We find that users' compliance increases with the introduction of XAI but is also affected by AI literacy. We also find that the relationships between AI literacy XAI and users' compliance are mediated by the users' mental model of AI. Our study has several implications for successfully designing AI-based systems utilizing XAI.
☆ Benchmarks and Metrics for Evaluations of Code Generation: A Critical Review
With the rapid development of Large Language Models (LLMs), a large number of machine learning models have been developed to assist programming tasks including the generation of program code from natural language input. However, how to evaluate such LLMs for this task is still an open problem despite of the great amount of research efforts that have been made and reported to evaluate and compare them. This paper provides a critical review of the existing work on the testing and evaluation of these tools with a focus on two key aspects: the benchmarks and the metrics used in the evaluations. Based on the review, further research directions are discussed.
comment: Accepted by the First IEEE International Workshop on Testing and Evaluation of Large Language Models (TELLMe 2024) and will be published in the proceedings of the IEEE AITest 2024 conference
☆ Transforming Surgical Interventions with Embodied Intelligence for Ultrasound Robotics MICCAI 2024
Ultrasonography has revolutionized non-invasive diagnostic methodologies, significantly enhancing patient outcomes across various medical domains. Despite its advancements, integrating ultrasound technology with robotic systems for automated scans presents challenges, including limited command understanding and dynamic execution capabilities. To address these challenges, this paper introduces a novel Ultrasound Embodied Intelligence system that synergistically combines ultrasound robots with large language models (LLMs) and domain-specific knowledge augmentation, enhancing ultrasound robots' intelligence and operational efficiency. Our approach employs a dual strategy: firstly, integrating LLMs with ultrasound robots to interpret doctors' verbal instructions into precise motion planning through a comprehensive understanding of ultrasound domain knowledge, including APIs and operational manuals; secondly, incorporating a dynamic execution mechanism, allowing for real-time adjustments to scanning plans based on patient movements or procedural errors. We demonstrate the effectiveness of our system through extensive experiments, including ablation studies and comparisons across various models, showcasing significant improvements in executing medical procedures from verbal commands. Our findings suggest that the proposed system improves the efficiency and quality of ultrasound scans and paves the way for further advancements in autonomous medical scanning technologies, with the potential to transform non-invasive diagnostics and streamline medical workflows.
comment: This work has been accepted by MICCAI 2024
☆ Probabilistic Conceptual Explainers: Trustworthy Conceptual Explanations for Vision Foundation Models
Vision transformers (ViTs) have emerged as a significant area of focus, particularly for their capacity to be jointly trained with large language models and to serve as robust vision foundation models. Yet, the development of trustworthy explanation methods for ViTs has lagged, particularly in the context of post-hoc interpretations of ViT predictions. Existing sub-image selection approaches, such as feature-attribution and conceptual models, fall short in this regard. This paper proposes five desiderata for explaining ViTs -- faithfulness, stability, sparsity, multi-level structure, and parsimony -- and demonstrates the inadequacy of current methods in meeting these criteria comprehensively. We introduce a variational Bayesian explanation framework, dubbed ProbAbilistic Concept Explainers (PACE), which models the distributions of patch embeddings to provide trustworthy post-hoc conceptual explanations. Our qualitative analysis reveals the distributions of patch-level concepts, elucidating the effectiveness of ViTs by modeling the joint distribution of patch embeddings and ViT's predictions. Moreover, these patch-level explanations bridge the gap between image-level and dataset-level explanations, thus completing the multi-level structure of PACE. Through extensive experiments on both synthetic and real-world datasets, we demonstrate that PACE surpasses state-of-the-art methods in terms of the defined desiderata.
☆ An Empirical Study on the Fairness of Foundation Models for Multi-Organ Image Segmentation MICCAI-2024
The segmentation foundation model, e.g., Segment Anything Model (SAM), has attracted increasing interest in the medical image community. Early pioneering studies primarily concentrated on assessing and improving SAM's performance from the perspectives of overall accuracy and efficiency, yet little attention was given to the fairness considerations. This oversight raises questions about the potential for performance biases that could mirror those found in task-specific deep learning models like nnU-Net. In this paper, we explored the fairness dilemma concerning large segmentation foundation models. We prospectively curate a benchmark dataset of 3D MRI and CT scans of the organs including liver, kidney, spleen, lung and aorta from a total of 1056 healthy subjects with expert segmentations. Crucially, we document demographic details such as gender, age, and body mass index (BMI) for each subject to facilitate a nuanced fairness analysis. We test state-of-the-art foundation models for medical image segmentation, including the original SAM, medical SAM and SAT models, to evaluate segmentation efficacy across different demographic groups and identify disparities. Our comprehensive analysis, which accounts for various confounding factors, reveals significant fairness concerns within these foundational models. Moreover, our findings highlight not only disparities in overall segmentation metrics, such as the Dice Similarity Coefficient but also significant variations in the spatial distribution of segmentation errors, offering empirical evidence of the nuanced challenges in ensuring fairness in medical image segmentation.
comment: Accepted to MICCAI-2024
☆ Evaluating Transparency of Machine Generated Fact Checking Explanations
An important factor when it comes to generating fact-checking explanations is the selection of evidence: intuitively, high-quality explanations can only be generated given the right evidence. In this work, we investigate the impact of human-curated vs. machine-selected evidence for explanation generation using large language models. To assess the quality of explanations, we focus on transparency (whether an explanation cites sources properly) and utility (whether an explanation is helpful in clarifying a claim). Surprisingly, we found that large language models generate similar or higher quality explanations using machine-selected evidence, suggesting carefully curated evidence (by humans) may not be necessary. That said, even with the best model, the generated explanations are not always faithful to the sources, suggesting further room for improvement in explanation generation for fact-checking.
☆ Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models
Assessing the effectiveness of large language models (LLMs) in addressing diverse tasks is essential for comprehending their strengths and weaknesses. Conventional evaluation techniques typically apply a single prompting strategy uniformly across datasets, not considering the varying degrees of task complexity. We introduce the Hierarchical Prompting Taxonomy (HPT), a taxonomy that employs a Hierarchical Prompt Framework (HPF) composed of five unique prompting strategies, arranged from the simplest to the most complex, to assess LLMs more precisely and to offer a clearer perspective. This taxonomy assigns a score, called the Hierarchical Prompting Score (HP-Score), to datasets as well as LLMs based on the rules of the taxonomy, providing a nuanced understanding of their ability to solve diverse tasks and offering a universal measure of task complexity. Additionally, we introduce the Adaptive Hierarchical Prompt framework, which automates the selection of appropriate prompting strategies for each task. This study compares manual and adaptive hierarchical prompt frameworks using four instruction-tuned LLMs, namely Llama 3 8B, Phi 3 3.8B, Mistral 7B, and Gemma 7B, across four datasets: BoolQ, CommonSenseQA (CSQA), IWSLT-2017 en-fr (IWSLT), and SamSum. Experiments demonstrate the effectiveness of HPT, providing a reliable way to compare different tasks and LLM capabilities. This paper leads to the development of a universal evaluation metric that can be used to evaluate both the complexity of the datasets and the capabilities of LLMs. The implementation of both manual HPF and adaptive HPF is publicly available.
☆ Ask-before-Plan: Proactive Language Agents for Real-World Planning
The evolution of large language models (LLMs) has enhanced the planning capabilities of language agents in diverse real-world scenarios. Despite these advancements, the potential of LLM-powered agents to comprehend ambiguous user instructions for reasoning and decision-making is still under exploration. In this work, we introduce a new task, Proactive Agent Planning, which requires language agents to predict clarification needs based on user-agent conversation and agent-environment interaction, invoke external tools to collect valid information, and generate a plan to fulfill the user's demands. To study this practical problem, we establish a new benchmark dataset, Ask-before-Plan. To tackle the deficiency of LLMs in proactive planning, we propose a novel multi-agent framework, Clarification-Execution-Planning (\texttt{CEP}), which consists of three agents specialized in clarification, execution, and planning. We introduce the trajectory tuning scheme for the clarification agent and static execution agent, as well as the memory recollection mechanism for the dynamic execution agent. Extensive evaluations and comprehensive analyses conducted on the Ask-before-Plan dataset validate the effectiveness of our proposed framework.
☆ ScenEval: A Benchmark for Scenario-Based Evaluation of Code Generation
In the scenario-based evaluation of machine learning models, a key problem is how to construct test datasets that represent various scenarios. The methodology proposed in this paper is to construct a benchmark and attach metadata to each test case. Then a test system can be constructed with test morphisms that filter the test cases based on metadata to form a dataset. The paper demonstrates this methodology with large language models for code generation. A benchmark called ScenEval is constructed from problems in textbooks, an online tutorial website and Stack Overflow. Filtering by scenario is demonstrated and the test sets are used to evaluate ChatGPT for Java code generation. Our experiments found that the performance of ChatGPT decreases with the complexity of the coding task. It is weakest for advanced topics like multi-threading, data structure algorithms and recursive methods. The Java code generated by ChatGPT tends to be much shorter than reference solution in terms of number of lines, while it is more likely to be more complex in both cyclomatic and cognitive complexity metrics, if the generated code is correct. However, the generated code is more likely to be less complex than the reference solution if the code is incorrect.
comment: Accepted for publication in the conference proceedings of IEEE AITest 2024
☆ News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation
Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.
☆ SeTAR: Out-of-Distribution Detection with Selective Low-Rank Approximation
Out-of-distribution (OOD) detection is crucial for the safe deployment of neural networks. Existing CLIP-based approaches perform OOD detection by devising novel scoring functions or sophisticated fine-tuning methods. In this work, we propose SeTAR, a novel, training-free OOD detection method that leverages selective low-rank approximation of weight matrices in vision-language and vision-only models. SeTAR enhances OOD detection via post-hoc modification of the model's weight matrices using a simple greedy search algorithm. Based on SeTAR, we further propose SeTAR+FT, a fine-tuning extension optimizing model performance for OOD detection tasks. Extensive evaluations on ImageNet1K and Pascal-VOC benchmarks show SeTAR's superior performance, reducing the false positive rate by up to 18.95% and 36.80% compared to zero-shot and fine-tuning baselines. Ablation studies further validate our approach's effectiveness, robustness, and generalizability across different model backbones. Our work offers a scalable, efficient solution for OOD detection, setting a new state-of-the-art in this area.
comment: Code are available at \url{https://github.com/X1AOX1A/SeTAR}
☆ Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Offering a promising solution to the scalability challenges associated with human evaluation, the LLM-as-a-judge paradigm is rapidly gaining traction as an approach to evaluating large language models (LLMs). However, there are still many open questions about the strengths and weaknesses of this paradigm, and what potential biases it may hold. In this paper, we present a comprehensive study of the performance of various LLMs acting as judges. We leverage TriviaQA as a benchmark for assessing objective knowledge reasoning of LLMs and evaluate them alongside human annotations which we found to have a high inter-annotator agreement. Our study includes 9 judge models and 9 exam taker models -- both base and instruction-tuned. We assess the judge model's alignment across different model sizes, families, and judge prompts. Among other results, our research rediscovers the importance of using Cohen's kappa as a metric of alignment as opposed to simple percent agreement, showing that judges with high percent agreement can still assign vastly different scores. We find that both Llama-3 70B and GPT-4 Turbo have an excellent alignment with humans, but in terms of ranking exam taker models, they are outperformed by both JudgeLM-7B and the lexical judge Contains, which have up to 34 points lower human alignment. Through error analysis and various other studies, including the effects of instruction length and leniency bias, we hope to provide valuable lessons for using LLMs as judges in the future.
☆ Bridging Local Details and Global Context in Text-Attributed Graphs
Representation learning on text-attributed graphs (TAGs) is vital for real-world applications, as they combine semantic textual and contextual structural information. Research in this field generally consist of two main perspectives: local-level encoding and global-level aggregating, respectively refer to textual node information unification (e.g., using Language Models) and structure-augmented modeling (e.g., using Graph Neural Networks). Most existing works focus on combining different information levels but overlook the interconnections, i.e., the contextual textual information among nodes, which provides semantic insights to bridge local and global levels. In this paper, we propose GraphBridge, a multi-granularity integration framework that bridges local and global perspectives by leveraging contextual textual information, enhancing fine-grained understanding of TAGs. Besides, to tackle scalability and efficiency challenges, we introduce a graphaware token reduction module. Extensive experiments across various models and datasets show that our method achieves state-of-theart performance, while our graph-aware token reduction module significantly enhances efficiency and solves scalability issues.
PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs) for efficient document retrieval without relying on external indexes. However, DSIs need full re-training to handle updates in dynamic corpora, causing significant computational inefficiencies. We introduce PromptDSI, a rehearsal-free, prompt-based approach for instance-wise incremental learning in document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of DSI, leveraging its powerful representation to efficiently index new corpora while maintaining a balance between stability and plasticity. We eliminate the initial forward pass of prompt-based continual learning methods that doubles training and inference time. Moreover, we propose a topic-aware prompt pool that employs neural topic embeddings as fixed keys. This strategy ensures diverse and effective prompt usage, addressing the challenge of parameter underutilization caused by the collapse of the query-key matching mechanism. Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing forgetting while significantly enhancing recall by over 4% on new corpora.
comment: 21 pages
☆ UIFV: Data Reconstruction Attack in Vertical Federated Learning
Vertical Federated Learning (VFL) facilitates collaborative machine learning without the need for participants to share raw private data. However, recent studies have revealed privacy risks where adversaries might reconstruct sensitive features through data leakage during the learning process. Although data reconstruction methods based on gradient or model information are somewhat effective, they reveal limitations in VFL application scenarios. This is because these traditional methods heavily rely on specific model structures and/or have strict limitations on application scenarios. To address this, our study introduces the Unified InverNet Framework into VFL, which yields a novel and flexible approach (dubbed UIFV) that leverages intermediate feature data to reconstruct original data, instead of relying on gradients or model details. The intermediate feature data is the feature exchanged by different participants during the inference phase of VFL. Experiments on four datasets demonstrate that our methods significantly outperform state-of-the-art techniques in attack precision. Our work exposes severe privacy vulnerabilities within VFL systems that pose real threats to practical VFL applications and thus confirms the necessity of further enhancing privacy protection in the VFL architecture.
☆ Breaking the Ceiling of the LLM Community by Treating Token Generation as a Classification for Ensembling
Ensembling multiple models has always been an effective approach to push the limits of existing performance and is widely used in classification tasks by simply averaging the classification probability vectors from multiple classifiers to achieve better accuracy. However, in the thriving open-source Large Language Model (LLM) community, ensembling methods are rare and typically limited to ensembling the full-text outputs of LLMs, such as selecting the best output using a ranker, which leads to underutilization of token-level probability information. In this paper, we treat the Generation of each token by LLMs as a Classification (GaC) for ensembling. This approach fully exploits the probability information at each generation step and better prevents LLMs from producing early incorrect tokens that lead to snowballing errors. In experiments, we ensemble state-of-the-art LLMs on several benchmarks, including exams, mathematics and reasoning, and observe that our method breaks the existing community performance ceiling. Furthermore, we observed that most of the tokens in the answer are simple and do not affect the correctness of the final answer. Therefore, we also experimented with ensembling only key tokens, and the results showed better performance with lower latency across benchmarks.
☆ Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models
We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 5th graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
☆ Bayesian Data Selection ICML 2024
A wide range of machine learning algorithms iteratively add data to the training sample. Examples include semi-supervised learning, active learning, multi-armed bandits, and Bayesian optimization. We embed this kind of data addition into decision theory by framing data selection as a decision problem. This paves the way for finding Bayes-optimal selections of data. For the illustrative case of self-training in semi-supervised learning, we derive the respective Bayes criterion. We further show that deploying this criterion mitigates the issue of confirmation bias by empirically assessing our method for generalized linear models, semi-parametric generalized additive models, and Bayesian neural networks on simulated and real-world data.
comment: 5th Workshop on Data-Centric Machine Learning Research (DMLR) at ICML 2024
☆ Offline Imitation Learning with Model-based Reverse Augmentation KDD2024
In offline Imitation Learning (IL), one of the main challenges is the \textit{covariate shift} between the expert observations and the actual distribution encountered by the agent, because it is difficult to determine what action an agent should take when outside the state distribution of the expert demonstrations. Recently, the model-free solutions introduce the supplementary data and identify the latent expert-similar samples to augment the reliable samples during learning. Model-based solutions build forward dynamic models with conservatism quantification and then generate additional trajectories in the neighborhood of expert demonstrations. However, without reward supervision, these methods are often over-conservative in the out-of-expert-support regions, because only in states close to expert-observed states can there be a preferred action enabling policy optimization. To encourage more exploration on expert-unobserved states, we propose a novel model-based framework, called offline Imitation Learning with Self-paced Reverse Augmentation (SRA). Specifically, we build a reverse dynamic model from the offline demonstrations, which can efficiently generate trajectories leading to the expert-observed states in a self-paced style. Then, we use the subsequent reinforcement learning method to learn from the augmented trajectories and transit from expert-unobserved states to expert-observed states. This framework not only explores the expert-unobserved states but also guides maximizing long-term returns on these states, ultimately enabling generalization beyond the expert data. Empirical results show that our proposal could effectively mitigate the covariate shift and achieve the state-of-the-art performance on the offline imitation learning benchmarks. Project website: \url{https://www.lamda.nju.edu.cn/shaojj/KDD24_SRA/}.
comment: Accepted by KDD2024
☆ MultiSocial: Multilingual Benchmark of Machine-Generated Text Detection of Social-Media Texts
Recent LLMs are able to generate high-quality multilingual texts, indistinguishable for humans from authentic human-written ones. Research in machine-generated text detection is however mostly focused on the English language and longer texts, such as news articles, scientific papers or student essays. Social-media texts are usually much shorter and often feature informal language, grammatical errors, or distinct linguistic items (e.g., emoticons, hashtags). There is a gap in studying the ability of existing methods in detection of such texts, reflected also in the lack of existing multilingual benchmark datasets. To fill this gap we propose the first multilingual (22 languages) and multi-platform (5 social media platforms) dataset for benchmarking machine-generated text detection in the social-media domain, called MultiSocial. It contains 472,097 texts, of which about 58k are human-written and approximately the same amount is generated by each of 7 multilingual LLMs. We use this benchmark to compare existing detection methods in zero-shot as well as fine-tuned form. Our results indicate that the fine-tuned detectors have no problem to be trained on social-media texts and that the platform selection for training matters.
☆ The Heterophilic Snowflake Hypothesis: Training and Empowering GNNs for Heterophilic Graphs KDD 2024
Graph Neural Networks (GNNs) have become pivotal tools for a range of graph-based learning tasks. Notably, most current GNN architectures operate under the assumption of homophily, whether explicitly or implicitly. While this underlying assumption is frequently adopted, it is not universally applicable, which can result in potential shortcomings in learning effectiveness. In this paper, \textbf{for the first time}, we transfer the prevailing concept of ``one node one receptive field" to the heterophilic graph. By constructing a proxy label predictor, we enable each node to possess a latent prediction distribution, which assists connected nodes in determining whether they should aggregate their associated neighbors. Ultimately, every node can have its own unique aggregation hop and pattern, much like each snowflake is unique and possesses its own characteristics. Based on observations, we innovatively introduce the Heterophily Snowflake Hypothesis and provide an effective solution to guide and facilitate research on heterophilic graphs and beyond. We conduct comprehensive experiments including (1) main results on 10 graphs with varying heterophily ratios across 10 backbones; (2) scalability on various deep GNN backbones (SGC, JKNet, etc.) across various large number of layers (2,4,6,8,16,32 layers); (3) comparison with conventional snowflake hypothesis; (4) efficiency comparison with existing graph pruning algorithms. Our observations show that our framework acts as a versatile operator for diverse tasks. It can be integrated into various GNN frameworks, boosting performance in-depth and offering an explainable approach to choosing the optimal network depth. The source code is available at \url{https://github.com/bingreeky/HeteroSnoH}.
comment: KDD 2024
☆ Variational Distillation of Diffusion Policies into Mixture of Experts
This work introduces Variational Diffusion Distillation (VDD), a novel method that distills denoising diffusion policies into Mixtures of Experts (MoE) through variational inference. Diffusion Models are the current state-of-the-art in generative modeling due to their exceptional ability to accurately learn and represent complex, multi-modal distributions. This ability allows Diffusion Models to replicate the inherent diversity in human behavior, making them the preferred models in behavior learning such as Learning from Human Demonstrations (LfD). However, diffusion models come with some drawbacks, including the intractability of likelihoods and long inference times due to their iterative sampling process. The inference times, in particular, pose a significant challenge to real-time applications such as robot control. In contrast, MoEs effectively address the aforementioned issues while retaining the ability to represent complex distributions but are notoriously difficult to train. VDD is the first method that distills pre-trained diffusion models into MoE models, and hence, combines the expressiveness of Diffusion Models with the benefits of Mixture Models. Specifically, VDD leverages a decompositional upper bound of the variational objective that allows the training of each expert separately, resulting in a robust optimization scheme for MoEs. VDD demonstrates across nine complex behavior learning tasks, that it is able to: i) accurately distill complex distributions learned by the diffusion model, ii) outperform existing state-of-the-art distillation methods, and iii) surpass conventional methods for training MoE.
☆ LLM4MSR: An LLM-Enhanced Paradigm for Multi-Scenario Recommendation
As the demand for more personalized recommendation grows and a dramatic boom in commercial scenarios arises, the study on multi-scenario recommendation (MSR) has attracted much attention, which uses the data from all scenarios to simultaneously improve their recommendation performance. However, existing methods tend to integrate insufficient scenario knowledge and neglect learning personalized cross-scenario preferences, thus leading to suboptimal performance and inadequate interpretability. Meanwhile, though large language model (LLM) has shown great capability of reasoning and capturing semantic information, the high inference latency and high computation cost of tuning hinder its implementation in industrial recommender systems. To fill these gaps, we propose an effective efficient interpretable LLM-enhanced paradigm LLM4MSR in this work. Specifically, we first leverage LLM to uncover multi-level knowledge including scenario correlations and users' cross-scenario interests from the designed scenario- and user-level prompt without fine-tuning the LLM, then adopt hierarchical meta networks to generate multi-level meta layers to explicitly improves the scenario-aware and personalized recommendation capability. Our experiments on KuaiSAR-small, KuaiSAR, and Amazon datasets validate two significant advantages of LLM4MSR: (i) the effectiveness and compatibility with different multi-scenario backbone models (achieving 1.5%, 1%, and 40% AUC improvement on three datasets), (ii) high efficiency and deployability on industrial recommender systems, and (iii) improved interpretability. The implemented code and data is available to ease reproduction.
☆ Autonomous navigation of catheters and guidewires in mechanical thrombectomy using inverse reinforcement learning
Purpose: Autonomous navigation of catheters and guidewires can enhance endovascular surgery safety and efficacy, reducing procedure times and operator radiation exposure. Integrating tele-operated robotics could widen access to time-sensitive emergency procedures like mechanical thrombectomy (MT). Reinforcement learning (RL) shows potential in endovascular navigation, yet its application encounters challenges without a reward signal. This study explores the viability of autonomous navigation in MT vasculature using inverse RL (IRL) to leverage expert demonstrations. Methods: This study established a simulation-based training and evaluation environment for MT navigation. We used IRL to infer reward functions from expert behaviour when navigating a guidewire and catheter. We utilized soft actor-critic to train models with various reward functions and compared their performance in silico. Results: We demonstrated feasibility of navigation using IRL. When evaluating single versus dual device (i.e. guidewire versus catheter and guidewire) tracking, both methods achieved high success rates of 95% and 96%, respectively. Dual-tracking, however, utilized both devices mimicking an expert. A success rate of 100% and procedure time of 22.6 s were obtained when training with a reward function obtained through reward shaping. This outperformed a dense reward function (96%, 24.9 s) and an IRL-derived reward function (48%, 59.2 s). Conclusions: We have contributed to the advancement of autonomous endovascular intervention navigation, particularly MT, by employing IRL. The results underscore the potential of using reward shaping to train models, offering a promising avenue for enhancing the accessibility and precision of MT. We envisage that future research can extend our methodology to diverse anatomical structures to enhance generalizability.
comment: Abstract shortened for arXiv character limit
☆ RS-GPT4V: A Unified Multimodal Instruction-Following Dataset for Remote Sensing Image Understanding
The remote sensing image intelligence understanding model is undergoing a new profound paradigm shift which has been promoted by multi-modal large language model (MLLM), i.e. from the paradigm learning a domain model (LaDM) shifts to paradigm learning a pre-trained general foundation model followed by an adaptive domain model (LaGD). Under the new LaGD paradigm, the old datasets, which have led to advances in RSI intelligence understanding in the last decade, are no longer suitable for fire-new tasks. We argued that a new dataset must be designed to lighten tasks with the following features: 1) Generalization: training model to learn shared knowledge among tasks and to adapt to different tasks; 2) Understanding complex scenes: training model to understand the fine-grained attribute of the objects of interest, and to be able to describe the scene with natural language; 3) Reasoning: training model to be able to realize high-level visual reasoning. In this paper, we designed a high-quality, diversified, and unified multimodal instruction-following dataset for RSI understanding produced by GPT-4V and existing datasets, which we called RS-GPT4V. To achieve generalization, we used a (Question, Answer) which was deduced from GPT-4V via instruction-following to unify the tasks such as captioning and localization; To achieve complex scene, we proposed a hierarchical instruction description with local strategy in which the fine-grained attributes of the objects and their spatial relationships are described and global strategy in which all the local information are integrated to yield detailed instruction descript; To achieve reasoning, we designed multiple-turn QA pair to provide the reasoning ability for a model. The empirical results show that the fine-tuned MLLMs by RS-GPT4V can describe fine-grained information. The dataset is available at: https://github.com/GeoX-Lab/RS-GPT4V.
comment: 14 pages, 6 figures, 4 tables
☆ Adaptive Token Biaser: Knowledge Editing via Biasing Key Entities
The parametric knowledge memorized by large language models (LLMs) becomes outdated quickly. In-context editing (ICE) is currently the most effective method for updating the knowledge of LLMs. Recent advancements involve enhancing ICE by modifying the decoding strategy, obviating the need for altering internal model structures or adjusting external prompts. However, this enhancement operates across the entire sequence generation, encompassing a plethora of non-critical tokens. In this work, we introduce $\textbf{A}$daptive $\textbf{T}$oken $\textbf{Bias}$er ($\textbf{ATBias}$), a new decoding technique designed to enhance ICE. It focuses on the tokens that are mostly related to knowledge during decoding, biasing their logits by matching key entities related to new and parametric knowledge. Experimental results show that ATBias significantly enhances ICE performance, achieving up to a 32.3% improvement over state-of-the-art ICE methods while incurring only half the latency. ATBias not only improves the knowledge editing capabilities of ICE but can also be widely applied to LLMs with negligible cost.
☆ RIGL: A Unified Reciprocal Approach for Tracing the Independent and Group Learning Processes KDD 2024
In the realm of education, both independent learning and group learning are esteemed as the most classic paradigms. The former allows learners to self-direct their studies, while the latter is typically characterized by teacher-directed scenarios. Recent studies in the field of intelligent education have leveraged deep temporal models to trace the learning process, capturing the dynamics of students' knowledge states, and have achieved remarkable performance. However, existing approaches have primarily focused on modeling the independent learning process, with the group learning paradigm receiving less attention. Moreover, the reciprocal effect between the two learning processes, especially their combined potential to foster holistic student development, remains inadequately explored. To this end, in this paper, we propose RIGL, a unified Reciprocal model to trace knowledge states at both the individual and group levels, drawing from the Independent and Group Learning processes. Specifically, we first introduce a time frame-aware reciprocal embedding module to concurrently model both student and group response interactions across various time frames. Subsequently, we employ reciprocal enhanced learning modeling to fully exploit the comprehensive and complementary information between the two behaviors. Furthermore, we design a relation-guided temporal attentive network, comprised of dynamic graph modeling coupled with a temporal self-attention mechanism. It is used to delve into the dynamic influence of individual and group interactions throughout the learning processes. Conclusively, we introduce a bias-aware contrastive learning module to bolster the stability of the model's training. Extensive experiments on four real-world educational datasets clearly demonstrate the effectiveness of the proposed RIGL model.
comment: Accepted by KDD 2024. 12 pages
☆ A Neural Column Generation Approach to the Vehicle Routing Problem with Two-Dimensional Loading and Last-In-First-Out Constraints IJCAI 2024
The vehicle routing problem with two-dimensional loading constraints (2L-CVRP) and the last-in-first-out (LIFO) rule presents significant practical and algorithmic challenges. While numerous heuristic approaches have been proposed to address its complexity, stemming from two NP-hard problems: the vehicle routing problem (VRP) and the two-dimensional bin packing problem (2D-BPP), less attention has been paid to developing exact algorithms. Bridging this gap, this article presents an exact algorithm that integrates advanced machine learning techniques, specifically a novel combination of attention and recurrence mechanisms. This integration accelerates the state-of-the-art exact algorithm by a median of 29.79% across various problem instances. Moreover, the proposed algorithm successfully resolves an open instance in the standard test-bed, demonstrating significant improvements brought about by the incorporation of machine learning models. Code is available at https://github.com/xyfffff/NCG-for-2L-CVRP.
comment: Accepted by International Joint Conference on Artificial Intelligence (IJCAI 2024)
☆ Insect Identification in the Wild: The AMI Dataset
Insects represent half of all global biodiversity, yet many of the world's insects are disappearing, with severe implications for ecosystems and agriculture. Despite this crisis, data on insect diversity and abundance remain woefully inadequate, due to the scarcity of human experts and the lack of scalable tools for monitoring. Ecologists have started to adopt camera traps to record and study insects, and have proposed computer vision algorithms as an answer for scalable data processing. However, insect monitoring in the wild poses unique challenges that have not yet been addressed within computer vision, including the combination of long-tailed data, extremely similar classes, and significant distribution shifts. We provide the first large-scale machine learning benchmarks for fine-grained insect recognition, designed to match real-world tasks faced by ecologists. Our contributions include a curated dataset of images from citizen science platforms and museums, and an expert-annotated dataset drawn from automated camera traps across multiple continents, designed to test out-of-distribution generalization under field conditions. We train and evaluate a variety of baseline algorithms and introduce a combination of data augmentation techniques that enhance generalization across geographies and hardware setups. Code and datasets are made publicly available.
☆ Retrieval-Augmented Generation for Generative Artificial Intelligence in Medicine
Generative artificial intelligence (AI) has brought revolutionary innovations in various fields, including medicine. However, it also exhibits limitations. In response, retrieval-augmented generation (RAG) provides a potential solution, enabling models to generate more accurate contents by leveraging the retrieval of external knowledge. With the rapid advancement of generative AI, RAG can pave the way for connecting this transformative technology with medical applications and is expected to bring innovations in equity, reliability, and personalization to health care.
☆ Abstraction-of-Thought Makes Language Models Better Reasoners
Abstract reasoning, the ability to reason from the abstract essence of a problem, serves as a key to generalization in human reasoning. However, eliciting language models to perform reasoning with abstraction remains unexplored. This paper seeks to bridge this gap by introducing a novel structured reasoning format called Abstraction-of-Thought (AoT). The uniqueness of AoT lies in its explicit requirement for varying levels of abstraction within the reasoning process. This approach could elicit language models to first contemplate on the abstract level before incorporating concrete details, which is overlooked by the prevailing step-by-step Chain-of-Thought (CoT) method. To align models with the AoT format, we present AoT Collection, a generic finetuning dataset consisting of 348k high-quality samples with AoT reasoning processes, collected via an automated and scalable pipeline. We finetune a wide range of language models with AoT Collection and conduct extensive evaluations on 23 unseen tasks from the challenging benchmark Big-Bench Hard. Experimental results indicate that models aligned to AoT reasoning format substantially outperform those aligned to CoT in many reasoning tasks.
comment: Work in Process
☆ Cycle-Correspondence Loss: Learning Dense View-Invariant Visual Features from Unlabeled and Unordered RGB Images
Robot manipulation relying on learned object-centric descriptors became popular in recent years. Visual descriptors can easily describe manipulation task objectives, they can be learned efficiently using self-supervision, and they can encode actuated and even non-rigid objects. However, learning robust, view-invariant keypoints in a self-supervised approach requires a meticulous data collection approach involving precise calibration and expert supervision. In this paper we introduce Cycle-Correspondence Loss (CCL) for view-invariant dense descriptor learning, which adopts the concept of cycle-consistency, enabling a simple data collection pipeline and training on unpaired RGB camera views. The key idea is to autonomously detect valid pixel correspondences by attempting to use a prediction over a new image to predict the original pixel in the original image, while scaling error terms based on the estimated confidence. Our evaluation shows that we outperform other self-supervised RGB-only methods, and approach performance of supervised methods, both with respect to keypoint tracking as well as for a robot grasping downstream task.
comment: 8 pages, 4 figures, 2 tables
☆ Federated Learning with Limited Node Labels
Subgraph federated learning (SFL) is a research methodology that has gained significant attention for its potential to handle distributed graph-structured data. In SFL, the local model comprises graph neural networks (GNNs) with a partial graph structure. However, some SFL models have overlooked the significance of missing cross-subgraph edges, which can lead to local GNNs being unable to message-pass global representations to other parties' GNNs. Moreover, existing SFL models require substantial labeled data, which limits their practical applications. To overcome these limitations, we present a novel SFL framework called FedMpa that aims to learn cross-subgraph node representations. FedMpa first trains a multilayer perceptron (MLP) model using a small amount of data and then propagates the federated feature to the local structures. To further improve the embedding representation of nodes with local subgraphs, we introduce the FedMpae method, which reconstructs the local graph structure with an innovation view that applies pooling operation to form super-nodes. Our extensive experiments on six graph datasets demonstrate that FedMpa is highly effective in node classification. Furthermore, our ablation experiments verify the effectiveness of FedMpa.
☆ Exploring Sensing Devices for Heart and Lung Sound Monitoring
This paper presents a comprehensive review of cardiorespiratory auscultation sensing devices which is useful for understanding the theoretical aspects of sensing devices, as well as practical notes to design novel sensing devices. One of the methods to design a stethoscope is using electret condenser microphones (ECM). In this paper, we first introduce the acoustic properties of the heart and lungs, as well as a brief history of stethoscope evolution. Then, we discuss the basic concept of ECM sensors and a recent stethoscope based on this technology. In response to the limitations of ECM-based systems, we explore the potential of microelectromechanical systems (MEMS), particularly focusing on piezoelectric transducer (PZT) sensors. This paper comprehensively reviews sensing technologies, emphasizing innovative MEMS-based designs for wearable cardiopulmonary auscultation in the past decade. To our knowledge, this is the first paper to summarize ECM and MEMS applications for heart and lung sound analysis. Keywords: Micro-electro-mechanical Systems (MEMS); Electret Condenser Microphone (ECM); Wearable Sensing Devices; Cardiorespiratory Auscultation; Phonocardiography (PCG); Heart Sound; Lung Sound
☆ PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers NAACL 2024
In this paper, we conduct a study to utilize LLMs as a solution for decision making that requires complex data analysis. We define Decision QA as the task of answering the best decision, $d_{best}$, for a decision-making question $Q$, business rules $R$ and a database $D$. Since there is no benchmark that can examine Decision QA, we propose Decision QA benchmark, DQA. It has two scenarios, Locating and Building, constructed from two video games (Europa Universalis IV and Victoria 3) that have almost the same goal as Decision QA. To address Decision QA effectively, we also propose a new RAG technique called the iterative plan-then-retrieval augmented generation (PlanRAG). Our PlanRAG-based LM generates the plan for decision making as the first step, and the retriever generates the queries for data analysis as the second step. The proposed method outperforms the state-of-the-art iterative RAG method by 15.8% in the Locating scenario and by 7.4% in the Building scenario, respectively. We release our code and benchmark at https://github.com/myeon9h/PlanRAG.
comment: NAACL 2024
☆ Adaptive Selection for Homogeneous Tools: An Instantiation in the RAG Scenario
Current research on tool learning primarily focuses on selecting the most effective tool from a wide array of options, often overlooking cost-effectiveness, a crucial factor in human problem-solving. In this paper, we address the selection of homogeneous tools by predicting both their performance and the associated cost required to accomplish a given task. We then assign queries to the optimal tools in a cost-effective manner. Our experimental results demonstrate that our method achieves higher performance at a lower cost compared to strong baseline approaches.
☆ PSLM: Parallel Generation of Text and Speech with LLMs for Low-Latency Spoken Dialogue Systems
Multimodal language models that process both text and speech have a potential for applications in spoken dialogue systems. However, current models face two major challenges in response generation latency: (1) generating a spoken response requires the prior generation of a written response, and (2) speech sequences are significantly longer than text sequences. This study addresses these issues by extending the input and output sequences of the language model to support the parallel generation of text and speech. Our experiments on spoken question answering tasks demonstrate that our approach improves latency while maintaining the quality of response content. Additionally, we show that latency can be further reduced by generating speech in multiple sequences. Demo samples are available at https://rinnakk.github.io/research/publications/PSLM.
comment: 8 pages, 4 figures, 4 tables, demo samples: https://rinnakk.github.io/research/publications/PSLM
☆ Beyond Under-Alignment: Atomic Preference Enhanced Factuality Tuning for Large Language Models
Large language models (LLMs) have achieved remarkable success but still tend to generate factually erroneous responses, a phenomenon known as hallucination. A recent trend is to use preference learning to fine-tune models to align with factuality. However, existing work primarily evaluates fine-tuned models on in-domain (ID) datasets and the factuality on out-of-domain (OOD) datasets remains underexplored. In this paper, we conduct a comprehensive evaluation of the factuality of different models tuned by various preference learning algorithms and demonstrate that their performance on OOD datasets either increases minimally or decreases. Subsequently, we reveal that the main cause of model's failure to uphold factuality under a distribution shift is \textbf{under-alignment}, rather than \textbf{over-alignment}, by analyzing the token distribution shift of the models before and after tuning. Finally, we propose \textbf{APEFT} (\textbf{A}tomic \textbf{P}reference \textbf{E}nhanced \textbf{F}actuality \textbf{T}uning), a framework that enhances model's awareness of factuality at the granularity of individual facts. Extensive experiments demonstrate that APEFT improves model performance by an average of $\boldsymbol{3.45\%}$ on both ID and OOD datasets, which is highly effective.
☆ Pushing the Frontier on Approximate EFX Allocations
We study the problem of allocating a set of indivisible goods to a set of agents with additive valuation functions, aiming to achieve approximate envy-freeness up to any good ($\alpha$-EFX). The state-of-the-art results on the problem include that (exact) EFX allocations exist when (a) there are at most three agents, or (b) the agents' valuation functions can take at most two values, or (c) the agents' valuation functions can be represented via a graph. For $\alpha$-EFX, it is known that a $0.618$-EFX allocation exists for any number of agents with additive valuation functions. In this paper, we show that $2/3$-EFX allocations exist when (a) there are at most \emph{seven agents}, (b) the agents' valuation functions can take at most \emph{three values}, or (c) the agents' valuation functions can be represented via a \emph{multigraph}. Our results can be interpreted in two ways. First, by relaxing the notion of EFX to $2/3$-EFX, we obtain existence results for strict generalizations of the settings for which exact EFX allocations are known to exist. Secondly, by imposing restrictions on the setting, we manage to beat the barrier of $0.618$ and achieve an approximation guarantee of $2/3$. Therefore, our results push the \emph{frontier} of existence and computation of approximate EFX allocations, and provide insights into the challenges of settling the existence of exact EFX allocations.
comment: The conference version of this work has been accepted to the Twenty-Fifth ACM Conference on Economics and Computation (EC 2024)
☆ A Novel Algorithm for Community Detection in Networks using Rough Sets and Consensus Clustering
Complex networks, such as those in social, biological, and technological systems, often present challenges to the task of community detection. Our research introduces a novel rough clustering based consensus community framework (RC-CCD) for effective structure identification of network communities. The RC-CCD method employs rough set theory to handle uncertainties within data and utilizes a consensus clustering approach to aggregate multiple clustering results, enhancing the reliability and accuracy of community detection. This integration allows the RC-CCD to effectively manage overlapping communities, which are often present in complex networks. This approach excels at detecting overlapping communities, offering a detailed and accurate representation of network structures. Comprehensive testing on benchmark networks generated by the Lancichinetti-Fortunato-Radicchi method showcased the strength and adaptability of the new proposal to varying node degrees and community sizes. Cross-comparisons of RC-CCD versus other well known detection algorithms outcomes highlighted its stability and adaptability.
☆ Fast Rates for Bandit PAC Multiclass Classification
We study multiclass PAC learning with bandit feedback, where inputs are classified into one of $K$ possible labels and feedback is limited to whether or not the predicted labels are correct. Our main contribution is in designing a novel learning algorithm for the agnostic $(\varepsilon,\delta)$-PAC version of the problem, with sample complexity of $O\big( (\operatorname{poly}(K) + 1 / \varepsilon^2) \log (|H| / \delta) \big)$ for any finite hypothesis class $H$. In terms of the leading dependence on $\varepsilon$, this improves upon existing bounds for the problem, that are of the form $O(K/\varepsilon^2)$. We also provide an extension of this result to general classes and establish similar sample complexity bounds in which $\log |H|$ is replaced by the Natarajan dimension. This matches the optimal rate in the full-information version of the problem and resolves an open question studied by Daniely, Sabato, Ben-David, and Shalev-Shwartz (2011) who demonstrated that the multiplicative price of bandit feedback in realizable PAC learning is $\Theta(K)$. We complement this by revealing a stark contrast with the agnostic case, where the price of bandit feedback is only $O(1)$ as $\varepsilon \to 0$. Our algorithm utilizes a stochastic optimization technique to minimize a log-barrier potential based on Frank-Wolfe updates for computing a low-variance exploration distribution over the hypotheses, and is made computationally efficient provided access to an ERM oracle over $H$.
☆ PDSS: A Privacy-Preserving Framework for Step-by-Step Distillation of Large Language Models
In the context of real-world applications, leveraging large language models (LLMs) for domain-specific tasks often faces two major challenges: domain-specific knowledge privacy and constrained resources. To address these issues, we propose PDSS, a privacy-preserving framework for step-by-step distillation of LLMs. PDSS works on a server-client architecture, wherein client transmits perturbed prompts to the server's LLM for rationale generation. The generated rationales are then decoded by the client and used to enrich the training of task-specific small language model(SLM) within a multi-task learning paradigm. PDSS introduces two privacy protection strategies: the Exponential Mechanism Strategy and the Encoder-Decoder Strategy, balancing prompt privacy and rationale usability. Experiments demonstrate the effectiveness of PDSS in various text generation tasks, enabling the training of task-specific SLM with enhanced performance while prioritizing data privacy protection.
☆ A Cutting-Edge Deep Learning Method For Enhancing IoT Security
There have been significant issues given the IoT, with heterogeneity of billions of devices and with a large amount of data. This paper proposed an innovative design of the Internet of Things (IoT) Environment Intrusion Detection System (or IDS) using Deep Learning-integrated Convolutional Neural Networks (CNN) and Long Short-Term Memory (LSTM) networks. Our model, based on the CICIDS2017 dataset, achieved an accuracy of 99.52% in classifying network traffic as either benign or malicious. The real-time processing capability, scalability, and low false alarm rate in our model surpass some traditional IDS approaches and, therefore, prove successful for application in today's IoT networks. The development and the performance of the model, with possible applications that may extend to other related fields of adaptive learning techniques and cross-domain applicability, are discussed. The research involving deep learning for IoT cybersecurity offers a potent solution for significantly improving network security.
☆ QueerBench: Quantifying Discrimination in Language Models Toward Queer Identities
With the increasing role of Natural Language Processing (NLP) in various applications, challenges concerning bias and stereotype perpetuation are accentuated, which often leads to hate speech and harm. Despite existing studies on sexism and misogyny, issues like homophobia and transphobia remain underexplored and often adopt binary perspectives, putting the safety of LGBTQIA+ individuals at high risk in online spaces. In this paper, we assess the potential harm caused by sentence completions generated by English large language models (LLMs) concerning LGBTQIA+ individuals. This is achieved using QueerBench, our new assessment framework, which employs a template-based approach and a Masked Language Modeling (MLM) task. The analysis indicates that large language models tend to exhibit discriminatory behaviour more frequently towards individuals within the LGBTQIA+ community, reaching a difference gap of 7.2% in the QueerBench score of harmfulness.
☆ GW-MoE: Resolving Uncertainty in MoE Router with Global Workspace Theory
Mixture-of-Experts (MoE) has been demonstrated as an efficient method to scale up models. By dynamically and sparsely selecting activated experts, MoE can effectively reduce computational costs. Despite the success, we observe that many tokens in the MoE models have uncertain routing results. These tokens have nearly equal scores for choosing each expert, and we demonstrate that this uncertainty can lead to incorrect selections. Inspired by the Global Workspace Theory (GWT), we propose a new fine-tuning method, GW-MoE, to address this issue. The core idea is to broadcast the uncertain tokens across experts during fine-tuning. Therefore, these tokens can acquire the necessary knowledge from any expert during inference and become less sensitive to the choice. GW-MoE does not introduce additional inference overhead. We validate that GW can mitigate the uncertain problem and consistently improve in different tasks (text classification, question answering, summarization, code generation, and mathematical problem solving) and model sizes (650M and 8B parameters).
☆ Problem-Solving in Language Model Networks
To improve the reasoning and question-answering capabilities of Large Language Models (LLMs), several multi-agent approaches have been introduced. While these methods enhance performance, the application of collective intelligence-based approaches to complex network structures and the dynamics of agent interactions remain underexplored. This work extends the concept of multi-agent debate to more general network topologies, measuring the question-answering accuracy, influence, consensus, and the effects of bias on the collective. The results show that random networks perform similarly to fully connected networks despite using significantly fewer tokens. Furthermore, a strong consensus among agents in correlates with correct answers, whereas divided responses typically indicate incorrect answers. Analysing the influence of the agents reveals a balance between self-reflection and interconnectedness; self-reflection aids when local interactions are incorrect, and local interactions aid when the agent itself is incorrect. Additionally, bias plays a strong role in system performance with correctly biased hub nodes boosting performance. These insights suggest that using random networks or scale-free networks with knowledgeable agents placed in central positions can enhance the overall performance of multi-agent systems.
comment: 8 pages, 2024 Conference on Artificial Life
☆ WebCanvas: Benchmarking Web Agents in Online Environments
For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.
comment: Our platform, tool and dataset are publically available at https://www.imean.ai/web-canvas/ and https://huggingface.co/datasets/iMeanAI/Mind2Web-Live/
☆ UAV-based Intelligent Information Systems on Winter Road Safety for Autonomous Vehicles
As autonomous vehicles continue to revolutionize transportation, addressing challenges posed by adverse weather conditions, particularly during winter, becomes paramount for ensuring safe and efficient operations. One of the most important aspects of a road safety inspection during adverse weather is when a limited lane width can reduce the capacity of the road and raise the risk of serious accidents involving autonomous vehicles. In this research, a method for improving driving challenges on roads in winter conditions, with a model that segments and estimates the width of the road from the perspectives of Uncrewed aerial vehicles and autonomous vehicles. The proposed approach in this article is needed to empower self-driving cars with up-to-date and accurate insights, enhancing their adaptability and decision-making capabilities in winter landscapes.
☆ Certified ML Object Detection for Surveillance Missions
In this paper, we present a development process of a drone detection system involving a machine learning object detection component. The purpose is to reach acceptable performance objectives and provide sufficient evidences, required by the recommendations (soon to be published) of the ED 324 / ARP 6983 standard, to gain confidence in the dependability of the designed system.
☆ Memory Sequence Length of Data Sampling Impacts the Adaptation of Meta-Reinforcement Learning Agents
Fast adaptation to new tasks is extremely important for embodied agents in the real world. Meta-reinforcement learning (meta-RL) has emerged as an effective method to enable fast adaptation in unknown environments. Compared to on-policy meta-RL algorithms, off-policy algorithms rely heavily on efficient data sampling strategies to extract and represent the historical trajectories. However, little is known about how different data sampling methods impact the ability of meta-RL agents to represent unknown environments. Here, we investigate the impact of data sampling strategies on the exploration and adaptability of meta-RL agents. Specifically, we conducted experiments with two types of off-policy meta-RL algorithms based on Thompson sampling and Bayes-optimality theories in continuous control tasks within the MuJoCo environment and sparse reward navigation tasks. Our analysis revealed the long-memory and short-memory sequence sampling strategies affect the representation and adaptive capabilities of meta-RL agents. We found that the algorithm based on Bayes-optimality theory exhibited more robust and better adaptability than the algorithm based on Thompson sampling, highlighting the importance of appropriate data sampling strategies for the agent's representation of an unknown environment, especially in the case of sparse rewards.
☆ Navigating Knowledge Management Implementation Success in Government Organizations: A type-2 fuzzy approach
Optimal information and knowledge management is crucial for organizations to achieve their objectives efficiently. As a rare and valuable resource, effective knowledge management provides a strategic advantage and has become a key determinant of organizational success. The study aims to identify critical success and failure factors for implementing knowledge management systems in government organizations. This research employs a descriptive survey methodology, collecting data through random interviews and questionnaires. The study highlights the critical success factors for knowledge management systems in government organizations, including cooperation, an open atmosphere, staff training, creativity and innovation, removal of organizational constraints, reward policies, role modeling, and focus. Conversely, failure to consider formality, staff participation, collaboration technologies, network and hardware infrastructure, complexity, IT staff, and trust can pose significant obstacles to successful implementation.
☆ Retrieval Meets Reasoning: Dynamic In-Context Editing for Long-Text Understanding
Current Large Language Models (LLMs) face inherent limitations due to their pre-defined context lengths, which impede their capacity for multi-hop reasoning within extensive textual contexts. While existing techniques like Retrieval-Augmented Generation (RAG) have attempted to bridge this gap by sourcing external information, they fall short when direct answers are not readily available. We introduce a novel approach that re-imagines information retrieval through dynamic in-context editing, inspired by recent breakthroughs in knowledge editing. By treating lengthy contexts as malleable external knowledge, our method interactively gathers and integrates relevant information, thereby enabling LLMs to perform sophisticated reasoning steps. Experimental results demonstrate that our method effectively empowers context-limited LLMs, such as Llama2, to engage in multi-hop reasoning with improved performance, which outperforms state-of-the-art context window extrapolation methods and even compares favorably to more advanced commercial long-context models. Our interactive method not only enhances reasoning capabilities but also mitigates the associated training and computational costs, making it a pragmatic solution for enhancing LLMs' reasoning within expansive contexts.
☆ Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models
Recently, large code generation models trained in a self-supervised manner on extensive unlabeled programming language data have achieved remarkable success. While these models acquire vast amounts of code knowledge, they perform poorly on code understanding tasks, such as code search and clone detection, as they are specifically trained for generation. Pre-training a larger encoder-only architecture model from scratch on massive code data can improve understanding performance. However, this approach is costly and time-consuming, making it suboptimal. In this paper, we pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks, significantly reducing training costs. We examine effective strategies for enabling decoder-only models to acquire robust code representations. Furthermore, we introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance in understanding tasks such as code search and clone detection. Our analysis shows that our method effectively reduces the distance between semantically identical samples in the representation space. These findings suggest the potential for unifying code understanding and generation tasks using a decoder-only structured model.
comment: 8 pages, 4 figures
☆ Automatic benchmarking of large multimodal models via iterative experiment programming
Assessing the capabilities of large multimodal models (LMMs) often requires the creation of ad-hoc evaluations. Currently, building new benchmarks requires tremendous amounts of manual work for each specific analysis. This makes the evaluation process tedious and costly. In this paper, we present APEx, Automatic Programming of Experiments, the first framework for automatic benchmarking of LMMs. Given a research question expressed in natural language, APEx leverages a large language model (LLM) and a library of pre-specified tools to generate a set of experiments for the model at hand, and progressively compile a scientific report. The report drives the testing procedure: based on the current status of the investigation, APEx chooses which experiments to perform and whether the results are sufficient to draw conclusions. Finally, the LLM refines the report, presenting the results to the user in natural language. Thanks to its modularity, our framework is flexible and extensible as new tools become available. Empirically, APEx reproduces the findings of existing studies while allowing for arbitrary analyses and hypothesis testing.
comment: 31 pages, 6 figures, code is available at https://github.com/altndrr/apex
☆ Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning ICMR'24
The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from both modalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.
comment: Accepyed by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ PruningBench: A Comprehensive Benchmark of Structural Pruning NeurIPS 2024
Structural pruning has emerged as a promising approach for producing more efficient models. Nevertheless, the community suffers from a lack of standardized benchmarks and metrics, leaving the progress in this area not fully comprehended. To fill this gap, we present the first comprehensive benchmark, termed \textit{PruningBench}, for structural pruning. PruningBench showcases the following three characteristics: 1) PruningBench employs a unified and consistent framework for evaluating the effectiveness of diverse structural pruning techniques; 2) PruningBench systematically evaluates 16 existing pruning methods, encompassing a wide array of models (e.g., CNNs and ViTs) and tasks (e.g., classification and detection); 3) PruningBench provides easily implementable interfaces to facilitate the implementation of future pruning methods, and enables the subsequent researchers to incorporate their work into our leaderboards. We provide an online pruning platform http://pruning.vipazoo.cn for customizing pruning tasks and reproducing all results in this paper. Codes will be made publicly available.
comment: Submitted to NeurIPS 2024 Datasets and Benchmarks Track
☆ Research on Dangerous Flight Weather Prediction based on Machine Learning
With the continuous expansion of the scale of air transport, the demand for aviation meteorological support also continues to grow. The impact of hazardous weather on flight safety is critical. How to effectively use meteorological data to improve the early warning capability of flight dangerous weather and ensure the safe flight of aircraft is the primary task of aviation meteorological services. In this work, support vector machine (SVM) models are used to predict hazardous flight weather, especially for meteorological conditions with high uncertainty such as storms and turbulence. SVM is a supervised learning method that distinguishes between different classes of data by finding optimal decision boundaries in a high-dimensional space. In order to meet the needs of this study, we chose the radial basis function (RBF) as the kernel function, which helps to deal with nonlinear problems and enables the model to better capture complex meteorological data structures. During the model training phase, we used historical meteorological observations from multiple weather stations, including temperature, humidity, wind speed, wind direction, and other meteorological indicators closely related to flight safety. From this data, the SVM model learns how to distinguish between normal and dangerous flight weather conditions.
☆ Faithful Density-Peaks Clustering via Matrix Computations on MPI Parallelization System
Density peaks clustering (DP) has the ability of detecting clusters of arbitrary shape and clustering non-Euclidean space data, but its quadratic complexity in both computing and storage makes it difficult to scale for big data. Various approaches have been proposed in this regard, including MapReduce based distribution computing, multi-core parallelism, presentation transformation (e.g., kd-tree, Z-value), granular computing, and so forth. However, most of these existing methods face two limitations. One is their target datasets are mostly constrained to be in Euclidian space, the other is they emphasize only on local neighbors while ignoring global data distribution due to restriction to cut-off kernel when computing density. To address the two issues, we present a faithful and parallel DP method that makes use of two types of vector-like distance matrices and an inverse leading-node-finding policy. The method is implemented on a message passing interface (MPI) system. Extensive experiments showed that our method is capable of clustering non-Euclidean data such as in community detection, while outperforming the state-of-the-art counterpart methods in accuracy when clustering large Euclidean data. Our code is publicly available at https://github.com/alanxuji/FaithPDP.
comment: This paper presents a novel approach FaithPDP that takes advantages of both hardware (multi-core architecture of CPU) and modern programming language (Python or Matlab for efficient vector and matrix computation) to achieve clustering result identical to vanilla DP algorithm, while the computing complexity is reduced to pseudo-linear
☆ Generative Artificial Intelligence-Guided User Studies: An Application for Air Taxi Services
User studies are crucial for meeting user needs. In user studies, real experimental scenarios and participants are constructed and recruited. However, emerging and unfamiliar studies face limitations, including safety concerns and iterative efficiency. To address these challenges, this study utilizes a large language model (LLM) to create generative AI virtual scenarios for user experience. By recruiting real users to evaluate this experience, we can collect feedback that enables rapid iteration in the early design phase. The air taxi is particularly representative of these challenges and has been chosen as the case study for this research. The key contribution was designing a virtual ATJ using OpenAI's GPT-4 model and AI image and video generators. Based on the LLM-generated scripts, key visuals were created for the air taxi, and the ATJ was evaluated by 72 participants. Furthermore, the LLM demonstrated the ability to identify and suggest environments that significantly improve participants' attitudes toward air taxis. Education level and gender significantly influenced participants' attitudes and their satisfaction with the ATJ. Our study confirms the capability of generative AI to support user studies, providing a feasible approach and valuable insights for designing air taxi user experiences in the early design phase.
comment: 39 pages, 6 main figures, 10 appendix figures
☆ JEN-1 DreamStyler: Customized Musical Concept Learning via Pivotal Parameters Tuning
Large models for text-to-music generation have achieved significant progress, facilitating the creation of high-quality and varied musical compositions from provided text prompts. However, input text prompts may not precisely capture user requirements, particularly when the objective is to generate music that embodies a specific concept derived from a designated reference collection. In this paper, we propose a novel method for customized text-to-music generation, which can capture the concept from a two-minute reference music and generate a new piece of music conforming to the concept. We achieve this by fine-tuning a pretrained text-to-music model using the reference music. However, directly fine-tuning all parameters leads to overfitting issues. To address this problem, we propose a Pivotal Parameters Tuning method that enables the model to assimilate the new concept while preserving its original generative capabilities. Additionally, we identify a potential concept conflict when introducing multiple concepts into the pretrained model. We present a concept enhancement strategy to distinguish multiple concepts, enabling the fine-tuned model to generate music incorporating either individual or multiple concepts simultaneously. Since we are the first to work on the customized music generation task, we also introduce a new dataset and evaluation protocol for the new task. Our proposed Jen1-DreamStyler outperforms several baselines in both qualitative and quantitative evaluations. Demos will be available at https://www.jenmusic.ai/research#DreamStyler.
☆ An Investigation of Neuron Activation as a Unified Lens to Explain Chain-of-Thought Eliciting Arithmetic Reasoning of LLMs ACL 2024
Large language models (LLMs) have shown strong arithmetic reasoning capabilities when prompted with Chain-of-Thought (CoT) prompts. However, we have only a limited understanding of how they are processed by LLMs. To demystify it, prior work has primarily focused on ablating different components in the CoT prompt and empirically observing their resulting LLM performance change. Yet, the reason why these components are important to LLM reasoning is not explored. To fill this gap, in this work, we investigate ``neuron activation'' as a lens to provide a unified explanation to observations made by prior work. Specifically, we look into neurons within the feed-forward layers of LLMs that may have activated their arithmetic reasoning capabilities, using Llama2 as an example. To facilitate this investigation, we also propose an approach based on GPT-4 to automatically identify neurons that imply arithmetic reasoning. Our analyses revealed that the activation of reasoning neurons in the feed-forward layers of an LLM can explain the importance of various components in a CoT prompt, and future research can extend it for a more complete understanding.
comment: 9 pages, 1 figure, to be published in ACL 2024
♻ ☆ OccamLLM: Fast and Exact Language Model Arithmetic in a Single Step
Despite significant advancements in text generation and reasoning, Large Language Models (LLMs) still face challenges in accurately performing complex arithmetic operations. To achieve accurate calculations, language model systems often enable LLMs to generate code for arithmetic operations. However, this approach compromises speed and security and, if finetuning is involved, risks the language model losing prior capabilities. We propose a framework that enables exact arithmetic in \textit{a single autoregressive step}, providing faster, more secure, and more interpretable LLM systems with arithmetic capabilities. We use the hidden states of an LLM to control a symbolic architecture which performs arithmetic. Our implementation using Llama 3 8B Instruct with OccamNet as a symbolic model (OccamLlama) achieves 100\% accuracy on single arithmetic operations ($+,-,\times,\div,\sin{},\cos{},\log{},\exp{},\sqrt{}$), outperforming GPT 4o and on par with GPT 4o using a code interpreter. OccamLlama also outperforms GPT 4o both with and without a code interpreter on mathematical problem solving benchmarks involving challenging arithmetic, thus enabling small LLMs to match the arithmetic performance of even much larger models. We will make our code public shortly.
♻ ☆ Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks
We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token ``Sure''), potentially with multiple restarts. In this way, we achieve nearly 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.
comment: Updates in the v2: more models (Llama3, Phi-3, Nemotron-4-340B), jailbreak artifacts for all attacks are available, evaluation of generalization to a different judge (Llama-3-70B and Llama Guard 2), more experiments (convergence plots over iterations, ablation on the suffix length for random search), improved exposition of the paper, examples of jailbroken generation
♻ ☆ Unleashing the potential of prompt engineering in Large Language Models: a comprehensive review
This paper delves into the pivotal role of prompt engineering in unleashing the capabilities of Large Language Models (LLMs). Prompt engineering is the process of structuring input text for LLMs and is a technique integral to optimizing the efficacy of LLMs. This survey elucidates foundational principles of prompt engineering, such as role-prompting, one-shot, and few-shot prompting, as well as more advanced methodologies such as the chain-of-thought and tree-of-thoughts prompting. The paper sheds light on how external assistance in the form of plugins can assist in this task, and reduce machine hallucination by retrieving external knowledge. We subsequently delineate prospective directions in prompt engineering research, emphasizing the need for a deeper understanding of structures and the role of agents in Artificial Intelligence-Generated Content (AIGC) tools. We discuss how to assess the efficacy of prompt methods from different perspectives and using different methods. Finally, we gather information about the application of prompt engineering in such fields as education and programming, showing its transformative potential. This comprehensive survey aims to serve as a friendly guide for anyone venturing through the big world of LLMs and prompt engineering.
♻ ☆ Reinforcement Learning from Multi-role Debates as Feedback for Bias Mitigation in LLMs
Bias in LLMs can harm user experience and societal outcomes. However, current bias mitigation methods often require intensive human feedback, lack transferability to other topics or yield overconfident and random outputs. We find that involving LLMs in role-playing scenario boosts their ability to recognize and mitigate biases. Based on this, we propose Reinforcement Learning from Multi-role Debates as Feedback (RLDF), a novel approach for bias mitigation replacing human feedback in traditional RLHF. We utilize LLMs in multi-role debates to create a dataset that includes both high-bias and low-bias instances for training the reward model in reinforcement learning. Our approach comprises two modes: (1) self-reflection, where the same LLM participates in multi-role debates, and (2) teacher-student, where a more advanced LLM like GPT-3.5-turbo guides the LLM to perform this task. Experimental results across different LLMs demonstrate the effectiveness of our approach in bias mitigation.
comment: The first three authors contributed equally to this work
♻ ☆ Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
We propose Self-Control, a novel method utilizing suffix gradients to control the behavior of large language models (LLMs) without explicit human annotations. Given a guideline expressed in suffix string and the model's self-assessment of adherence, Self-Control computes the gradient of this self-judgment concerning the model's hidden states, directly influencing the auto-regressive generation process towards desired behaviors. To enhance efficiency, we introduce Self-Control_{prefix}, a compact module that encapsulates the learned representations from suffix gradients into a Prefix Controller, facilitating inference-time control for various LLM behaviors. Our experiments demonstrate Self-Control's efficacy across multiple domains, including emotional modulation, ensuring harmlessness, and enhancing complex reasoning. Especially, Self-Control_{prefix} enables a plug-and-play control and jointly controls multiple attributes, improving model outputs without altering model parameters or increasing inference-time costs.
comment: 41 pages, 12 figures, 41 tables; Website: https://llm-self-control.github.io/
♻ ☆ Capturing Knowledge Graphs and Rules with Octagon Embeddings IJCAI 2024
Region based knowledge graph embeddings represent relations as geometric regions. This has the advantage that the rules which are captured by the model are made explicit, making it straightforward to incorporate prior knowledge and to inspect learned models. Unfortunately, existing approaches are severely restricted in their ability to model relational composition, and hence also their ability to model rules, thus failing to deliver on the main promise of region based models. With the aim of addressing these limitations, we investigate regions which are composed of axis-aligned octagons. Such octagons are particularly easy to work with, as intersections and compositions can be straightforwardly computed, while they are still sufficiently expressive to model arbitrary knowledge graphs. Among others, we also show that our octagon embeddings can properly capture a non-trivial class of rule bases. Finally, we show that our model achieves competitive experimental results.
comment: Accepted for IJCAI 2024
♻ ☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
♻ ☆ Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries
Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while Text-to-SQL systems that incorporate such models continuously reach new high scores on -- often synthetic -- benchmark datasets, a systematic exploration of their robustness towards different data models in a real-world, realistic scenario is notably missing. This paper provides the first in-depth evaluation of the data model robustness of Text-to-SQL systems in practice based on a multi-year international project focused on Text-to-SQL interfaces. Our evaluation is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022, during which about 6K natural language questions were asked and executed. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models. For each data model, we explore the performance of representative Text-to-SQL systems and language models. We further quantify the impact of training data size, pre-, and post-processing steps as well as language model inference time. Our comprehensive evaluation sheds light on the design choices of real-world Text-to-SQL systems and their impact on moving from research prototypes to real deployments. Last, we provide a new benchmark dataset to the community, which is the first to enable the evaluation of different data models for the same dataset and is substantially more challenging than most previous datasets in terms of query complexity.
♻ ☆ Transformers Can Represent $n$-gram Language Models
Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
♻ ☆ Connecting the Dots: Evaluating Abstract Reasoning Capabilities of LLMs Using the New York Times Connections Word Game
The New York Times Connections game has emerged as a popular and challenging pursuit for word puzzle enthusiasts. We collect 200 Connections games to evaluate the performance of state-of-the-art large language models (LLMs) against expert and novice human players. Our results show that even the best-performing LLM, GPT-4o, which has otherwise shown impressive reasoning abilities on a wide variety of benchmarks, can only fully solve 8% of the games. Compared to GPT-4o, novice and expert players perform better, with expert human players significantly outperforming GPT-4o. To deepen our understanding we create a taxonomy of the knowledge types required to successfully categorize words in the Connections game, revealing that LLMs struggle with associative, encyclopedic, and linguistic knowledge. Our findings establish the New York Times Connections game as a challenging benchmark for evaluating abstract reasoning capabilities in humans and AI systems.
♻ ☆ The Lie Derivative for Measuring Learned Equivariance ICLR 2023
Equivariance guarantees that a model's predictions capture key symmetries in data. When an image is translated or rotated, an equivariant model's representation of that image will translate or rotate accordingly. The success of convolutional neural networks has historically been tied to translation equivariance directly encoded in their architecture. The rising success of vision transformers, which have no explicit architectural bias towards equivariance, challenges this narrative and suggests that augmentations and training data might also play a significant role in their performance. In order to better understand the role of equivariance in recent vision models, we introduce the Lie derivative, a method for measuring equivariance with strong mathematical foundations and minimal hyperparameters. Using the Lie derivative, we study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. The scale of our analysis allows us to separate the impact of architecture from other factors like model size or training method. Surprisingly, we find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities, and that as models get larger and more accurate they tend to display more equivariance, regardless of architecture. For example, transformers can be more equivariant than convolutional neural networks after training.
comment: ICLR 2023. Code available at: https://github.com/ngruver/lie-deriv
♻ ☆ WRDScore: New Metric for Evaluation of Natural Language Generation Models
The problem of natural language generation, and, more specifically, method name prediction, faces significant difficulties when proposed models need to be evaluated on test data. Such a metric would need to consider the versatility with which a single method can be named, with respect to both semantics and syntax. Measuring the direct overlap between the predicted and reference (true) sequences will not be able to capture these subtleties. Other existing embedding based metrics either do not measure precision and recall or impose strict unrealistic assumptions on both sequences. To address these issues, we propose a new metric that, on the one hand, is very simple and lightweight, and, on the other hand, is able to calculate precision and recall without resorting to any assumptions while obtaining good performance with respect to the human judgement.
♻ ☆ STG4Traffic: A Survey and Benchmark of Spatial-Temporal Graph Neural Networks for Traffic Prediction
Traffic prediction has been an active research topic in the domain of spatial-temporal data mining. Accurate real-time traffic prediction is essential to improve the safety, stability, and versatility of smart city systems, i.e., traffic control and optimal routing. The complex and highly dynamic spatial-temporal dependencies make effective predictions still face many challenges. Recent studies have shown that spatial-temporal graph neural networks exhibit great potential applied to traffic prediction, which combines sequential models with graph convolutional networks to jointly model temporal and spatial correlations. However, a survey study of graph learning, spatial-temporal graph models for traffic, as well as a fair comparison of baseline models are pending and unavoidable issues. In this paper, we first provide a systematic review of graph learning strategies and commonly used graph convolution algorithms. Then we conduct a comprehensive analysis of the strengths and weaknesses of recently proposed spatial-temporal graph network models. Furthermore, we build a study called STG4Traffic using the deep learning framework PyTorch to establish a standardized and scalable benchmark on two types of traffic datasets. We can evaluate their performance by personalizing the model settings with uniform metrics. Finally, we point out some problems in the current study and discuss future directions. Source codes are available at https://github.com/trainingl/STG4Traffic.
♻ ☆ User Centric Evaluation of Code Generation Tools
With the rapid advance of machine learning (ML) technology, large language models (LLMs) are increasingly explored as an intelligent tool to generate program code from natural language specifications. However, existing evaluations of LLMs have focused on their capabilities in comparison with humans. It is desirable to evaluate their usability when deciding on whether to use a LLM in software production. This paper proposes a user centric method for this purpose. It includes metadata in the test cases of a benchmark to describe their usages, conducts testing in a multi-attempt process that mimics the uses of LLMs, measures LLM generated solutions on a set of quality attributes that reflect usability, and evaluates the performance based on user experiences in the uses of LLMs as a tool. The paper also reports a case study with the method in the evaluation of ChatGPT's usability as a code generation tool for the R programming language. Our experiments demonstrated that ChatGPT is highly useful for generating R program code although it may fail on hard programming tasks. The user experiences are good with overall average number of attempts being 1.61 and the average time of completion being 47.02 seconds. Our experiments also found that the weakest aspect of usability is conciseness, which has a score of 3.80 out of 5.
comment: The paper is accepted by IEEE AITest 2024 at IEEE CISOSE 2024 Congress as an invited paper, and will appear in the AITest 2024 Conference Proceedings
♻ ☆ Mixing Artificial and Natural Intelligence: From Statistical Mechanics to AI and Back to Turbulence
The paper reflects on the future role of AI in scientific research, with a special focus on turbulence studies, and examines the evolution of AI, particularly through Diffusion Models rooted in non-equilibrium statistical mechanics. It underscores the significant impact of AI on advancing reduced, Lagrangian models of turbulence through innovative use of deep neural networks. Additionally, the paper reviews various other AI applications in turbulence research and outlines potential challenges and opportunities in the concurrent advancement of AI and statistical hydrodynamics. This discussion sets the stage for a future where AI and turbulence research are intricately intertwined, leading to more profound insights and advancements in both fields.
comment: 38 pages, 9 figures
♻ ☆ Assistive Large Language Model Agents for Socially-Aware Negotiation Dialogues EMNLP 2024
We develop assistive agents based on Large Language Models (LLMs) that aid interlocutors in business negotiations. Specifically, we simulate business negotiations by letting two LLM-based agents engage in role play. A third LLM acts as a remediator agent to rewrite utterances violating norms for improving negotiation outcomes. We introduce a simple tuning-free and label-free In-Context Learning (ICL) method to identify high-quality ICL exemplars for the remediator, where we propose a novel select criteria, called value impact, to measure the quality of the negotiation outcomes. We provide rich empirical evidence to demonstrate its effectiveness in negotiations across three different negotiation topics. The source code and the generated dataset will be publicly available upon acceptance.
comment: 25 pages, 3 figures, 13 tables; Under review in EMNLP 2024
♻ ☆ RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models
The advent of Large Language Models (LLMs) has paved the way for complex tasks such as role-playing, which enhances user interactions by enabling models to imitate various characters. However, the closed-source nature of state-of-the-art LLMs and their general-purpose training limit role-playing optimization. In this paper, we introduce RoleLLM, a framework to benchmark, elicit, and enhance role-playing abilities in LLMs. RoleLLM comprises four stages: (1) Role Profile Construction for 100 roles; (2) Context-Based Instruction Generation (Context-Instruct) for role-specific knowledge extraction; (3) Role Prompting using GPT (RoleGPT) for speaking style imitation; and (4) Role-Conditioned Instruction Tuning (RoCIT) for fine-tuning open-source models along with role customization. By Context-Instruct and RoleGPT, we create RoleBench, the first systematic and fine-grained character-level benchmark dataset for role-playing with 168,093 samples. Moreover, RoCIT on RoleBench yields RoleLLaMA (English) and RoleGLM (Chinese), significantly enhancing role-playing abilities and even achieving comparable results with RoleGPT (using GPT-4).
comment: 30 pages, repo at https://github.com/InteractiveNLP-Team/RoleLLM-public
♻ ☆ ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code
Despite Large Language Models (LLMs) like GPT-4 achieving impressive results in function-level code generation, they struggle with repository-scale code understanding (e.g., coming up with the right arguments for calling routines), requiring a deeper comprehension of complex file interactions. Also, recently, people have developed LLM agents that attempt to interact with repository code (e.g., compiling and evaluating its execution), prompting the need to evaluate their performance. These gaps have motivated our development of ML-Bench, a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. Addressing the need for LLMs to interpret long code contexts and translate instructions into precise, executable scripts, ML-Bench encompasses annotated 9,641 examples across 18 GitHub repositories, challenging LLMs to accommodate user-specified arguments and documentation intricacies effectively. To evaluate both LLMs and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment. Our findings indicate that while GPT-4o leads with a Pass@5 rate surpassing 50%, there remains significant scope for improvement, highlighted by issues such as hallucinated outputs and difficulties with bash script generation. Notably, in the more demanding ML-Agent-Bench, GPT-4o achieves a 76.47% success rate, reflecting the efficacy of iterative action and feedback in complex task resolution. Our code, dataset, and models are available at https://github.com/gersteinlab/ML-bench.
♻ ☆ How Well Do Multi-modal LLMs Interpret CT Scans? An Auto-Evaluation Framework for Analyses
Automatically interpreting CT scans can ease the workload of radiologists. However, this is challenging mainly due to the scarcity of adequate datasets and reference standards for evaluation. This study aims to bridge this gap by introducing a novel evaluation framework, named ``GPTRadScore''. This framework assesses the capabilities of multi-modal LLMs, such as GPT-4 with Vision (GPT-4V), Gemini Pro Vision, LLaVA-Med, and RadFM, in generating descriptions for prospectively-identified findings. By employing a decomposition technique based on GPT-4, GPTRadScore compares these generated descriptions with gold-standard report sentences, analyzing their accuracy in terms of body part, location, and type of finding. Evaluations demonstrated a high correlation with clinician assessments and highlighted its potential over traditional metrics, such as BLEU, METEOR, and ROUGE. Furthermore, to contribute to future studies, we plan to release a benchmark dataset annotated by clinicians. Using GPTRadScore, we found that while GPT-4V and Gemini Pro Vision fare better, their performance revealed significant areas for improvement, primarily due to limitations in the dataset used for training these models. To demonstrate this potential, RadFM was fine-tuned and it resulted in significant accuracy improvements: location accuracy rose from 3.41\% to 12.8\%, body part accuracy from 29.12\% to 53\%, and type accuracy from 9.24\% to 30\%, thereby validating our hypothesis.
♻ ☆ A Rate-Distortion View of Uncertainty Quantification
In supervised learning, understanding an input's proximity to the training data can help a model decide whether it has sufficient evidence for reaching a reliable prediction. While powerful probabilistic models such as Gaussian Processes naturally have this property, deep neural networks often lack it. In this paper, we introduce Distance Aware Bottleneck (DAB), i.e., a new method for enriching deep neural networks with this property. Building on prior information bottleneck approaches, our method learns a codebook that stores a compressed representation of all inputs seen during training. The distance of a new example from this codebook can serve as an uncertainty estimate for the example. The resulting model is simple to train and provides deterministic uncertainty estimates by a single forward pass. Finally, our method achieves better out-of-distribution (OOD) detection and misclassification prediction than prior methods, including expensive ensemble methods, deep kernel Gaussian Processes, and approaches based on the standard information bottleneck.
♻ ☆ A generalizable framework for low-rank tensor completion with numerical priors
Low-Rank Tensor Completion, a method which exploits the inherent structure of tensors, has been studied extensively as an effective approach to tensor completion. Whilst such methods attained great success, none have systematically considered exploiting the numerical priors of tensor elements. Ignoring numerical priors causes loss of important information regarding the data, and therefore prevents the algorithms from reaching optimal accuracy. Despite the existence of some individual works which consider ad hoc numerical priors for specific tasks, no generalizable frameworks for incorporating numerical priors have appeared. We present the Generalized CP Decomposition Tensor Completion (GCDTC) framework, the first generalizable framework for low-rank tensor completion that takes numerical priors of the data into account. We test GCDTC by further proposing the Smooth Poisson Tensor Completion (SPTC) algorithm, an instantiation of the GCDTC framework, whose performance exceeds current state-of-the-arts by considerable margins in the task of non-negative tensor completion, exemplifying GCDTC's effectiveness. Our code is open-source.
comment: Accepted to Pattern Recognition
♻ ☆ Dynamic Normativity: Necessary and Sufficient Conditions for Value Alignment
The critical inquiry pervading the realm of Philosophy, and perhaps extending its influence across all Humanities disciplines, revolves around the intricacies of morality and normativity. Surprisingly, in recent years, this thematic thread has woven its way into an unexpected domain, one not conventionally associated with pondering "what ought to be": the field of artificial intelligence (AI) research. Central to morality and AI, we find "alignment", a problem related to the challenges of expressing human goals and values in a manner that artificial systems can follow without leading to unwanted adversarial effects. More explicitly and with our current paradigm of AI development in mind, we can think of alignment as teaching human values to non-anthropomorphic entities trained through opaque, gradient-based learning techniques. This work addresses alignment as a technical-philosophical problem that requires solid philosophical foundations and practical implementations that bring normative theory to AI system development. To accomplish this, we propose two sets of necessary and sufficient conditions that, we argue, should be considered in any alignment process. While necessary conditions serve as metaphysical and metaethical roots that pertain to the permissibility of alignment, sufficient conditions establish a blueprint for aligning AI systems under a learning-based paradigm. After laying such foundations, we present implementations of this approach by using state-of-the-art techniques and methods for aligning general-purpose language systems. We call this framework Dynamic Normativity. Its central thesis is that any alignment process under a learning paradigm that cannot fulfill its necessary and sufficient conditions will fail in producing aligned systems.
♻ ☆ HARE: HumAn pRiors, a key to small language model Efficiency
Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
♻ ☆ Make Your Home Safe: Time-aware Unsupervised User Behavior Anomaly Detection in Smart Homes via Loss-guided Mask KDD 2024
Smart homes, powered by the Internet of Things, offer great convenience but also pose security concerns due to abnormal behaviors, such as improper operations of users and potential attacks from malicious attackers. Several behavior modeling methods have been proposed to identify abnormal behaviors and mitigate potential risks. However, their performance often falls short because they do not effectively learn less frequent behaviors, consider temporal context, or account for the impact of noise in human behaviors. In this paper, we propose SmartGuard, an autoencoder-based unsupervised user behavior anomaly detection framework. First, we design a Loss-guided Dynamic Mask Strategy (LDMS) to encourage the model to learn less frequent behaviors, which are often overlooked during learning. Second, we propose a Three-level Time-aware Position Embedding (TTPE) to incorporate temporal information into positional embedding to detect temporal context anomaly. Third, we propose a Noise-aware Weighted Reconstruction Loss (NWRL) that assigns different weights for routine behaviors and noise behaviors to mitigate the interference of noise behaviors during inference. Comprehensive experiments on three datasets with ten types of anomaly behaviors demonstrates that SmartGuard consistently outperforms state-of-the-art baselines and also offers highly interpretable results.
comment: KDD 2024
♻ ☆ A Survey of Fragile Model Watermarking
Model fragile watermarking, inspired by both the field of adversarial attacks on neural networks and traditional multimedia fragile watermarking, has gradually emerged as a potent tool for detecting tampering, and has witnessed rapid development in recent years. Unlike robust watermarks, which are widely used for identifying model copyrights, fragile watermarks for models are designed to identify whether models have been subjected to unexpected alterations such as backdoors, poisoning, compression, among others. These alterations can pose unknown risks to model users, such as misidentifying stop signs as speed limit signs in classic autonomous driving scenarios. This paper provides an overview of the relevant work in the field of model fragile watermarking since its inception, categorizing them and revealing the developmental trajectory of the field, thus offering a comprehensive survey for future endeavors in model fragile watermarking.
♻ ☆ Advances in 3D Neural Stylization: A Survey
Modern artificial intelligence offers a novel and transformative approach to creating digital art across diverse styles and modalities like images, videos and 3D data, unleashing the power of creativity and revolutionizing the way that we perceive and interact with visual content. This paper reports on recent advances in stylized 3D asset creation and manipulation with the expressive power of neural networks. We establish a taxonomy for neural stylization, considering crucial design choices such as scene representation, guidance data, optimization strategies, and output styles. Building on such taxonomy, our survey first revisits the background of neural stylization on 2D images, and then presents in-depth discussions on recent neural stylization methods for 3D data, accompanied by a mini-benchmark evaluating selected neural field stylization methods. Based on the insights gained from the survey, we highlight the practical significance, open challenges, future research, and potential impacts of neural stylization, which facilitates researchers and practitioners to navigate the rapidly evolving landscape of 3D content creation using modern artificial intelligence.
♻ ☆ Instruction Fine-Tuning: Does Prompt Loss Matter?
We present a novel study analyzing the effects of various prompt loss token weights (PLW) for supervised instruction fine-tuning (SIFT). While prompt-masking (PLW = 0) is common for SIFT, some fine-tuning APIs support fractional PLWs and suggest that using a small non-zero PLW can help stabilize learning when fine-tuning on short-completion data. However, there has never been a study confirming this claim, and OpenAI, a major cloud-based SIFT provider, recently removed this parameter from their fine-tuning API. We found that performance of models fine-tuned on short-completion data had a statistically-significant negative quadratic relationship with PLW. Using small values (0.01 - 0.5) of PLW produced better results on multiple-choice and short-generation benchmarks (outperforming models fine-tuned on long-completion data) while large values (~ 1.0) of PLW produced better results on long-generation benchmarks. We explained this effect and verified its importance through additional experiments. This research serves as a warning to API providers about the importance of providing a PLW parameter for SIFT.
comment: 8 pages of content. 13 pages of appendices. 45 figures
♻ ☆ Semi-Supervised Coupled Thin-Plate Spline Model for Rotation Correction and Beyond
Thin-plate spline (TPS) is a principal warp that allows for representing elastic, nonlinear transformation with control point motions. With the increase of control points, the warp becomes increasingly flexible but usually encounters a bottleneck caused by undesired issues, e.g., content distortion. In this paper, we explore generic applications of TPS in single-image-based warping tasks, such as rotation correction, rectangling, and portrait correction. To break this bottleneck, we propose the coupled thin-plate spline model (CoupledTPS), which iteratively couples multiple TPS with limited control points into a more flexible and powerful transformation. Concretely, we first design an iterative search to predict new control points according to the current latent condition. Then, we present the warping flow as a bridge for the coupling of different TPS transformations, effectively eliminating interpolation errors caused by multiple warps. Besides, in light of the laborious annotation cost, we develop a semi-supervised learning scheme to improve warping quality by exploiting unlabeled data. It is formulated through dual transformation between the searched control points of unlabeled data and its graphic augmentation, yielding an implicit correction consistency constraint. Finally, we collect massive unlabeled data to exhibit the benefit of our semi-supervised scheme in rotation correction. Extensive experiments demonstrate the superiority and universality of CoupledTPS over the existing state-of-the-art (SoTA) solutions for rotation correction and beyond. The code and data are available at https://github.com/nie-lang/CoupledTPS.
comment: Accepted to TPAMI2024
♻ ☆ Data Set Terminology of Deep Learning in Medicine: A Historical Review and Recommendation
Medicine and deep learning-based artificial intelligence (AI) engineering represent two distinct fields each with decades of published history. With such history comes a set of terminology that has a specific way in which it is applied. However, when two distinct fields with overlapping terminology start to collaborate, miscommunication and misunderstandings can occur. This narrative review aims to give historical context for these terms, accentuate the importance of clarity when these terms are used in medical AI contexts, and offer solutions to mitigate misunderstandings by readers from either field. Through an examination of historical documents, including articles, writing guidelines, and textbooks, this review traces the divergent evolution of terms for data sets and their impact. Initially, the discordant interpretations of the word 'validation' in medical and AI contexts are explored. Then the data sets used for AI evaluation are classified, namely random splitting, cross-validation, temporal, geographic, internal, and external sets. The accurate and standardized description of these data sets is crucial for demonstrating the robustness and generalizability of AI applications in medicine. This review clarifies existing literature to provide a comprehensive understanding of these classifications and their implications in AI evaluation. This review then identifies often misunderstood terms and proposes pragmatic solutions to mitigate terminological confusion. Among these solutions are the use of standardized terminology such as 'training set,' 'validation (or tuning) set,' and 'test set,' and explicit definition of data set splitting terminologies in each medical AI research publication. This review aspires to enhance the precision of communication in medical AI, thereby fostering more effective and transparent research methodologies in this interdisciplinary field.
comment: 20 pages, 3 figures, 3 tables
♻ ☆ ERASER: Machine Unlearning in MLaaS via an Inference Serving-Aware Approach CCS'24
Over the past years, Machine Learning-as-a-Service (MLaaS) has received a surging demand for supporting Machine Learning-driven services to offer revolutionized user experience across diverse application areas. MLaaS provides inference service with low inference latency based on an ML model trained using a dataset collected from numerous individual data owners. Recently, for the sake of data owners' privacy and to comply with the "right to be forgotten (RTBF)" as enacted by data protection legislation, many machine unlearning methods have been proposed to remove data owners' data from trained models upon their unlearning requests. However, despite their promising efficiency, almost all existing machine unlearning methods handle unlearning requests independently from inference requests, which unfortunately introduces a new security issue of inference service obsolescence and a privacy vulnerability of undesirable exposure for machine unlearning in MLaaS. In this paper, we propose the ERASER framework for machinE unleaRning in MLaAS via an inferencE seRving-aware approach. ERASER strategically choose appropriate unlearning execution timing to address the inference service obsolescence issue. A novel inference consistency certification mechanism is proposed to avoid the violation of RTBF principle caused by postponed unlearning executions, thereby mitigating the undesirable exposure vulnerability. ERASER offers three groups of design choices to allow for tailor-made variants that best suit the specific environments and preferences of various MLaaS systems. Extensive empirical evaluations across various settings confirm ERASER's effectiveness, e.g., it can effectively save up to 99% of inference latency and 31% of computation overhead over the inference-oblivion baseline.
comment: Accepted by CCS'24
♻ ☆ Fuzzy Convolution Neural Networks for Tabular Data Classification
Recently, convolution neural networks (CNNs) have attracted a great deal of attention due to their remarkable performance in various domains, particularly in image and text classification tasks. However, their application to tabular data classification remains underexplored. There are many fields such as bioinformatics, finance, medicine where nonimage data are prevalent. Adaption of CNNs to classify nonimage data remains highly challenging. This paper investigates the efficacy of CNNs for tabular data classification, aiming to bridge the gap between traditional machine learning approaches and deep learning techniques. We propose a novel framework fuzzy convolution neural network (FCNN) tailored specifically for tabular data to capture local patterns within feature vectors. In our approach, we map feature values to fuzzy memberships. The fuzzy membership vectors are converted into images that are used to train the CNN model. The trained CNN model is used to classify unknown feature vectors. To validate our approach, we generated six complex noisy data sets. We used randomly selected seventy percent samples from each data set for training and thirty percent for testing. The data sets were also classified using the state-of-the-art machine learning algorithms such as the decision tree (DT), support vector machine (SVM), fuzzy neural network (FNN), Bayes classifier, and Random Forest (RF). Experimental results demonstrate that our proposed model can effectively learn meaningful representations from tabular data, achieving competitive or superior performance compared to existing methods. Overall, our finding suggests that the proposed FCNN model holds promise as a viable alternative for tabular data classification tasks, offering a fresh prospective and potentially unlocking new opportunities for leveraging deep learning in structured data analysis.
comment: 10 pages, 16 figures, Submitted to IEEE Access
♻ ☆ The Influencer Next Door: How Misinformation Creators Use GenAI
Advances in generative AI (GenAI) have raised concerns about detecting and discerning AI-generated content from human-generated content. Most existing literature assumes a paradigm where 'expert' organized disinformation creators and flawed AI models deceive 'ordinary' users. Based on longitudinal ethnographic research with misinformation creators and consumers between 2022-2023, we instead find that GenAI supports bricolage work, where non-experts increasingly use GenAI to remix, repackage, and (re)produce content to meet their personal needs and desires. This research yielded four key findings: First, participants primarily used GenAI for creation, rather than truth-seeking. Second, a spreading 'influencer millionaire' narrative drove participants to become content creators, using GenAI as a productivity tool to generate a volume of (often misinformative) content. Third, GenAI lowered the barrier to entry for content creation across modalities, enticing consumers to become creators and significantly increasing existing creators' output. Finally, participants used Gen AI to learn and deploy marketing tactics to expand engagement and monetize their content. We argue for shifting analysis from the public as consumers of AI content to bricoleurs who use GenAI creatively, often without a detailed understanding of its underlying technology. We analyze how these understudied emergent uses of GenAI produce new or accelerated misinformation harms, and their implications for AI products, platforms and policies.
comment: 44 pages, 25 figures
♻ ☆ MLEM: Generative and Contrastive Learning as Distinct Modalities for Event Sequences
This study explores the application of self-supervised learning techniques for event sequences. It is a key modality in various applications such as banking, e-commerce, and healthcare. However, there is limited research on self-supervised learning for event sequences, and methods from other domains like images, texts, and speech may not easily transfer. To determine the most suitable approach, we conduct a detailed comparative analysis of previously identified best-performing methods. We find that neither the contrastive nor generative method is superior. Our assessment includes classifying event sequences, predicting the next event, and evaluating embedding quality. These results further highlight the potential benefits of combining both methods. Given the lack of research on hybrid models in this domain, we initially adapt the baseline model from another domain. However, upon observing its underperformance, we develop a novel method called the Multimodal-Learning Event Model (MLEM). MLEM treats contrastive learning and generative modeling as distinct yet complementary modalities, aligning their embeddings. The results of our study demonstrate that combining contrastive and generative approaches into one procedure with MLEM achieves superior performance across multiple metrics.
comment: 11 pages, 9 figures
♻ ☆ STEMO: Early Spatio-temporal Forecasting with Multi-Objective Reinforcement Learning KDD 2024
Accuracy and timeliness are indeed often conflicting goals in prediction tasks. Premature predictions may yield a higher rate of false alarms, whereas delaying predictions to gather more information can render them too late to be useful. In applications such as wildfires, crimes, and traffic jams, timely forecasting are vital for safeguarding human life and property. Consequently, finding a balance between accuracy and timeliness is crucial. In this paper, we propose an early spatio-temporal forecasting model based on Multi-Objective reinforcement learning that can either implement an optimal policy given a preference or infer the preference based on a small number of samples. The model addresses two primary challenges: 1) enhancing the accuracy of early forecasting and 2) providing the optimal policy for determining the most suitable prediction time for each area. Our method demonstrates superior performance on three large-scale real-world datasets, surpassing existing methods in early spatio-temporal forecasting tasks.
comment: Accepted paper in KDD 2024
♻ ☆ Identifying and Mitigating Privacy Risks Stemming from Language Models: A Survey
Large Language Models (LLMs) have shown greatly enhanced performance in recent years, attributed to increased size and extensive training data. This advancement has led to widespread interest and adoption across industries and the public. However, training data memorization in Machine Learning models scales with model size, particularly concerning for LLMs. Memorized text sequences have the potential to be directly leaked from LLMs, posing a serious threat to data privacy. Various techniques have been developed to attack LLMs and extract their training data. As these models continue to grow, this issue becomes increasingly critical. To help researchers and policymakers understand the state of knowledge around privacy attacks and mitigations, including where more work is needed, we present the first SoK on data privacy for LLMs. We (i) identify a taxonomy of salient dimensions where attacks differ on LLMs, (ii) systematize existing attacks, using our taxonomy of dimensions to highlight key trends, (iii) survey existing mitigation strategies, highlighting their strengths and limitations, and (iv) identify key gaps, demonstrating open problems and areas for concern.
comment: 15 pages
♻ ☆ Prediction of the Realisation of an Information Need: An EEG Study
One of the foundational goals of Information Retrieval (IR) is to satisfy searchers' Information Needs (IN). Understanding how INs physically manifest has long been a complex and elusive process. However, recent studies utilising Electroencephalography (EEG) data have provided real-time insights into the neural processes associated with INs. Unfortunately, they have yet to demonstrate how this insight can practically benefit the search experience. As such, within this study, we explore the ability to predict the realisation of IN within EEG data across 14 subjects whilst partaking in a Question-Answering (Q/A) task. Furthermore, we investigate the combinations of EEG features that yield optimal predictive performance, as well as identify regions within the Q/A queries where a subject's realisation of IN is more pronounced. The findings from this work demonstrate that EEG data is sufficient for the real-time prediction of the realisation of an IN across all subjects with an accuracy of 73.5% (SD 2.6%) and on a per-subject basis with an accuracy of 90.1% (SD 22.1%). This work helps to close the gap by bridging theoretical neuroscientific advancements with tangible improvements in information retrieval practices, paving the way for real-time prediction of the realisation of IN.
♻ ☆ Cross-Problem Learning for Solving Vehicle Routing Problems IJCAI'24
Existing neural heuristics often train a deep architecture from scratch for each specific vehicle routing problem (VRP), ignoring the transferable knowledge across different VRP variants. This paper proposes the cross-problem learning to assist heuristics training for different downstream VRP variants. Particularly, we modularize neural architectures for complex VRPs into 1) the backbone Transformer for tackling the travelling salesman problem (TSP), and 2) the additional lightweight modules for processing problem-specific features in complex VRPs. Accordingly, we propose to pre-train the backbone Transformer for TSP, and then apply it in the process of fine-tuning the Transformer models for each target VRP variant. On the one hand, we fully fine-tune the trained backbone Transformer and problem-specific modules simultaneously. On the other hand, we only fine-tune small adapter networks along with the modules, keeping the backbone Transformer still. Extensive experiments on typical VRPs substantiate that 1) the full fine-tuning achieves significantly better performance than the one trained from scratch, and 2) the adapter-based fine-tuning also delivers comparable performance while being notably parameter-efficient. Furthermore, we empirically demonstrate the favorable effect of our method in terms of cross-distribution application and versatility.
comment: Accepted by IJCAI'24
♻ ☆ Data Poisoning to Fake a Nash Equilibrium in Markov Games
We characterize offline data poisoning attacks on Multi-Agent Reinforcement Learning (MARL), where an attacker may change a data set in an attempt to install a (potentially fictitious) unique Markov-perfect Nash equilibrium for a two-player zero-sum Markov game. We propose the unique Nash set, namely the set of games, specified by their Q functions, with a specific joint policy being the unique Nash equilibrium. The unique Nash set is central to poisoning attacks because the attack is successful if and only if data poisoning pushes all plausible games inside the set. The unique Nash set generalizes the reward polytope commonly used in inverse reinforcement learning to MARL. For zero-sum Markov games, both the inverse Nash set and the set of plausible games induced by data are polytopes in the Q function space. We exhibit a linear program to efficiently compute the optimal poisoning attack. Our work sheds light on the structure of data poisoning attacks on offline MARL, a necessary step before one can design more robust MARL algorithms.
♻ ☆ SNN4Agents: A Framework for Developing Energy-Efficient Embodied Spiking Neural Networks for Autonomous Agents
Recent trends have shown that autonomous agents, such as Autonomous Ground Vehicles (AGVs), Unmanned Aerial Vehicles (UAVs), and mobile robots, effectively improve human productivity in solving diverse tasks. However, since these agents are typically powered by portable batteries, they require extremely low power/energy consumption to operate in a long lifespan. To solve this challenge, neuromorphic computing has emerged as a promising solution, where bio-inspired Spiking Neural Networks (SNNs) use spikes from event-based cameras or data conversion pre-processing to perform sparse computations efficiently. However, the studies of SNN deployments for autonomous agents are still at an early stage. Hence, the optimization stages for enabling efficient embodied SNN deployments for autonomous agents have not been defined systematically. Toward this, we propose a novel framework called SNN4Agents that consists of a set of optimization techniques for designing energy-efficient embodied SNNs targeting autonomous agent applications. Our SNN4Agents employs weight quantization, timestep reduction, and attention window reduction to jointly improve the energy efficiency, reduce the memory footprint, optimize the processing latency, while maintaining high accuracy. In the evaluation, we investigate use cases of event-based car recognition, and explore the trade-offs among accuracy, latency, memory, and energy consumption. The experimental results show that our proposed framework can maintain high accuracy (i.e., 84.12% accuracy) with 68.75% memory saving, 3.58x speed-up, and 4.03x energy efficiency improvement as compared to the state-of-the-art work for NCARS dataset. In this manner, our SNN4Agents framework paves the way toward enabling energy-efficient embodied SNN deployments for autonomous agents.
comment: Accepted for publication at Frontiers in Robotics and AI (FROBT) - Section Robot Vision and Artificial Perception
♻ ☆ Timeline-based Sentence Decomposition with In-Context Learning for Temporal Fact Extraction ACL2024
Facts extraction is pivotal for constructing knowledge graphs. Recently, the increasing demand for temporal facts in downstream tasks has led to the emergence of the task of temporal fact extraction. In this paper, we specifically address the extraction of temporal facts from natural language text. Previous studies fail to handle the challenge of establishing time-to-fact correspondences in complex sentences. To overcome this hurdle, we propose a timeline-based sentence decomposition strategy using large language models (LLMs) with in-context learning, ensuring a fine-grained understanding of the timeline associated with various facts. In addition, we evaluate the performance of LLMs for direct temporal fact extraction and get unsatisfactory results. To this end, we introduce TSDRE, a method that incorporates the decomposition capabilities of LLMs into the traditional fine-tuning of smaller pre-trained language models (PLMs). To support the evaluation, we construct ComplexTRED, a complex temporal fact extraction dataset. Our experiments show that TSDRE achieves state-of-the-art results on both HyperRED-Temporal and ComplexTRED datasets.
comment: Accepted to ACL2024 main conference
♻ ☆ FedMKT: Federated Mutual Knowledge Transfer for Large and Small Language Models
Recent research in federated large language models (LLMs) has primarily focused on enabling clients to fine-tune their locally deployed homogeneous LLMs collaboratively or on transferring knowledge from server-based LLMs to small language models (SLMs) at downstream clients. However, a significant gap remains in the simultaneous mutual enhancement of both the server's LLM and clients' SLMs. To bridge this gap, we propose FedMKT, a parameter-efficient federated mutual knowledge transfer framework for large and small language models. This framework is designed to adaptively transfer knowledge from the server's LLM to clients' SLMs while concurrently enriching the LLM with clients' unique domain insights. We facilitate token alignment using minimum edit distance (MinED) and then selective mutual knowledge transfer between client-side SLMs and a server-side LLM, aiming to collectively enhance their performance. Through extensive experiments across three distinct scenarios, we evaluate the effectiveness of FedMKT using various public LLMs and SLMs on a range of NLP text generation tasks. Empirical results demonstrate that FedMKT simultaneously boosts the performance of both LLMs and SLMs.
♻ ☆ A Survey on Large Language Models for Recommendation
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) and have recently gained significant attention in the domain of Recommendation Systems (RS). These models, trained on massive amounts of data using self-supervised learning, have demonstrated remarkable success in learning universal representations and have the potential to enhance various aspects of recommendation systems by some effective transfer techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect of harnessing the power of language models in enhancing recommendation quality is the utilization of their high-quality representations of textual features and their extensive coverage of external knowledge to establish correlations between items and users. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec), with the latter being systematically sorted out for the first time. Furthermore, we systematically review and analyze existing LLM-based recommendation systems within each paradigm, providing insights into their methodologies, techniques, and performance. Additionally, we identify key challenges and several valuable findings to provide researchers and practitioners with inspiration. We have also created a GitHub repository to index relevant papers on LLMs for recommendation, https://github.com/WLiK/LLM4Rec.
comment: 34 pages, 7 figures, 2 tables
♻ ☆ First Multi-Dimensional Evaluation of Flowchart Comprehension for Multimodal Large Language Models
With the development of Multimodal Large Language Models (MLLMs) technology, its general capabilities are increasingly powerful. To evaluate the various abilities of MLLMs, numerous evaluation systems have emerged. But now there is still a lack of a comprehensive method to evaluate MLLMs in the tasks related to flowcharts, which are very important in daily life and work. We propose the first comprehensive method, FlowCE, to assess MLLMs across various dimensions for tasks related to flowcharts. It encompasses evaluating MLLMs' abilities in Reasoning, Localization Recognition, Information Extraction, Logical Verification, and Summarization on flowcharts. However, we find that even the GPT4o model achieves only a score of 56.63. Among open-source models, Phi-3-Vision obtained the highest score of 49.97. We hope that FlowCE can contribute to future research on MLLMs for tasks based on flowcharts. \url{https://github.com/360AILAB-NLP/FlowCE} \end{abstract}
♻ ☆ Uncovering communities of pipelines in the task-fMRI analytical space
Analytical workflows in functional magnetic resonance imaging are highly flexible with limited best practices as to how to choose a pipeline. While it has been shown that the use of different pipelines might lead to different results, there is still a lack of understanding of the factors that drive these differences and of the stability of these differences across contexts. We use community detection algorithms to explore the pipeline space and assess the stability of pipeline relationships across different contexts. We show that there are subsets of pipelines that give similar results, especially those sharing specific parameters (e.g. number of motion regressors, software packages, etc.). Those pipeline-to-pipeline patterns are stable across groups of participants but not across different tasks. By visualizing the differences between communities, we show that the pipeline space is mainly driven by the size of the activation area in the brain and the scale of statistic values in statistic maps.
comment: Accepted at the 2024 IEEE International Conference on Image Processing
♻ ☆ Non-autoregressive Generative Models for Reranking Recommendation KDD 2024
Contemporary recommendation systems are designed to meet users' needs by delivering tailored lists of items that align with their specific demands or interests. In a multi-stage recommendation system, reranking plays a crucial role by modeling the intra-list correlations among items. The key challenge of reranking lies in the exploration of optimal sequences within the combinatorial space of permutations. Recent research proposes a generator-evaluator learning paradigm, where the generator generates multiple feasible sequences and the evaluator picks out the best sequence based on the estimated listwise score. The generator is of vital importance, and generative models are well-suited for the generator function. Current generative models employ an autoregressive strategy for sequence generation. However, deploying autoregressive models in real-time industrial systems is challenging. To address these issues, we propose a Non-AutoRegressive generative model for reranking Recommendation (NAR4Rec) designed to enhance efficiency and effectiveness. To tackle challenges such as sparse training samples and dynamic candidates, we introduce a matching model. Considering the diverse nature of user feedback, we employ a sequence-level unlikelihood training objective to differentiate feasible sequences from unfeasible ones. Additionally, to overcome the lack of dependency modeling in non-autoregressive models regarding target items, we introduce contrastive decoding to capture correlations among these items. Extensive offline experiments validate the superior performance of NAR4Rec over state-of-the-art reranking methods. Online A/B tests reveal that NAR4Rec significantly enhances the user experience. Furthermore, NAR4Rec has been fully deployed in a popular video app Kuaishou with over 300 million daily active users.
comment: Accepted by KDD 2024
♻ ☆ Language-Driven Active Learning for Diverse Open-Set 3D Object Detection
Object detection is crucial for ensuring safe autonomous driving. However, data-driven approaches face challenges when encountering minority or novel objects in the 3D driving scene. In this paper, we propose VisLED, a language-driven active learning framework for diverse open-set 3D Object Detection. Our method leverages active learning techniques to query diverse and informative data samples from an unlabeled pool, enhancing the model's ability to detect underrepresented or novel objects. Specifically, we introduce the Vision-Language Embedding Diversity Querying (VisLED-Querying) algorithm, which operates in both open-world exploring and closed-world mining settings. In open-world exploring, VisLED-Querying selects data points most novel relative to existing data, while in closed-world mining, it mines novel instances of known classes. We evaluate our approach on the nuScenes dataset and demonstrate its efficiency compared to random sampling and entropy-querying methods. Our results show that VisLED-Querying consistently outperforms random sampling and offers competitive performance compared to entropy-querying despite the latter's model-optimality, highlighting the potential of VisLED for improving object detection in autonomous driving scenarios. We make our code publicly available at https://github.com/Bjork-crypto/VisLED-Querying
♻ ☆ Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity CVPR2024
This paper presents novel benchmarks for evaluating vision-language models (VLMs) in zero-shot recognition, focusing on granularity and specificity. Although VLMs excel in tasks like image captioning, they face challenges in open-world settings. Our benchmarks test VLMs' consistency in understanding concepts across semantic granularity levels and their response to varying text specificity. Findings show that VLMs favor moderately fine-grained concepts and struggle with specificity, often misjudging texts that differ from their training data. Extensive evaluations reveal limitations in current VLMs, particularly in distinguishing between correct and subtly incorrect descriptions. While fine-tuning offers some improvements, it doesn't fully address these issues, highlighting the need for VLMs with enhanced generalization capabilities for real-world applications. This study provides insights into VLM limitations and suggests directions for developing more robust models.
comment: CVPR2024 MMFM workshop
♻ ☆ Evaluating Text-to-Visual Generation with Image-to-Text Generation
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
comment: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore
♻ ☆ AIM: Attributing, Interpreting, Mitigating Data Unfairness KDD 2024
Data collected in the real world often encapsulates historical discrimination against disadvantaged groups and individuals. Existing fair machine learning (FairML) research has predominantly focused on mitigating discriminative bias in the model prediction, with far less effort dedicated towards exploring how to trace biases present in the data, despite its importance for the transparency and interpretability of FairML. To fill this gap, we investigate a novel research problem: discovering samples that reflect biases/prejudices from the training data. Grounding on the existing fairness notions, we lay out a sample bias criterion and propose practical algorithms for measuring and countering sample bias. The derived bias score provides intuitive sample-level attribution and explanation of historical bias in data. On this basis, we further design two FairML strategies via sample-bias-informed minimal data editing. They can mitigate both group and individual unfairness at the cost of minimal or zero predictive utility loss. Extensive experiments and analyses on multiple real-world datasets demonstrate the effectiveness of our methods in explaining and mitigating unfairness. Code is available at https://github.com/ZhiningLiu1998/AIM.
comment: 12 pages, 6 figures, accepted by ACM SIGKDD 2024. Webpage: https://github.com/ZhiningLiu1998/AIM
♻ ☆ Investigating Annotator Bias in Large Language Models for Hate Speech Detection
Data annotation, the practice of assigning descriptive labels to raw data, is pivotal in optimizing the performance of machine learning models. However, it is a resource-intensive process susceptible to biases introduced by annotators. The emergence of sophisticated Large Language Models (LLMs), like ChatGPT presents a unique opportunity to modernize and streamline this complex procedure. While existing research extensively evaluates the efficacy of LLMs, as annotators, this paper delves into the biases present in LLMs, specifically GPT 3.5 and GPT 4o when annotating hate speech data. Our research contributes to understanding biases in four key categories: gender, race, religion, and disability. Specifically targeting highly vulnerable groups within these categories, we analyze annotator biases. Furthermore, we conduct a comprehensive examination of potential factors contributing to these biases by scrutinizing the annotated data. We introduce our custom hate speech detection dataset, HateSpeechCorpus, to conduct this research. Additionally, we perform the same experiments on the ETHOS (Mollas et al., 2022) dataset also for comparative analysis. This paper serves as a crucial resource, guiding researchers and practitioners in harnessing the potential of LLMs for dataannotation, thereby fostering advancements in this critical field. The HateSpeechCorpus dataset is available here: https://github.com/AmitDasRup123/HateSpeechCorpus
♻ ☆ Integrated Planning in Hospitals: A Review
Efficient planning of scarce resources in hospitals is a challenging task for which a large variety of Operations Research and Management Science approaches have been developed since the 1950s. While efficient planning of single resources such as operating rooms, beds, or specific types of staff can already lead to enormous efficiency gains, integrated planning of several resources has been shown to hold even greater potential, and a large number of integrated planning approaches have been presented in the literature over the past decades. This paper provides the first literature review that focuses specifically on the Operations Research and Management Science literature related to integrated planning of different resources in hospitals. We collect the relevant literature and analyze it regarding different aspects such as uncertainty modeling and the use of real-life data. Several cross comparisons reveal interesting insights concerning, e.g., relations between the modeling and solution methods used and the practical implementation of the approaches developed. Moreover, we provide a high-level taxonomy for classifying different resource-focused integration approaches and point out gaps in the literature as well as promising directions for future research.
♻ ☆ HistGen: Histopathology Report Generation via Local-Global Feature Encoding and Cross-modal Context Interaction MICCAI2024
Histopathology serves as the gold standard in cancer diagnosis, with clinical reports being vital in interpreting and understanding this process, guiding cancer treatment and patient care. The automation of histopathology report generation with deep learning stands to significantly enhance clinical efficiency and lessen the labor-intensive, time-consuming burden on pathologists in report writing. In pursuit of this advancement, we introduce HistGen, a multiple instance learning-empowered framework for histopathology report generation together with the first benchmark dataset for evaluation. Inspired by diagnostic and report-writing workflows, HistGen features two delicately designed modules, aiming to boost report generation by aligning whole slide images (WSIs) and diagnostic reports from local and global granularity. To achieve this, a local-global hierarchical encoder is developed for efficient visual feature aggregation from a region-to-slide perspective. Meanwhile, a cross-modal context module is proposed to explicitly facilitate alignment and interaction between distinct modalities, effectively bridging the gap between the extensive visual sequences of WSIs and corresponding highly summarized reports. Experimental results on WSI report generation show the proposed model outperforms state-of-the-art (SOTA) models by a large margin. Moreover, the results of fine-tuning our model on cancer subtyping and survival analysis tasks further demonstrate superior performance compared to SOTA methods, showcasing strong transfer learning capability. Dataset, model weights, and source code are available in https://github.com/dddavid4real/HistGen.
comment: Accepted by MICCAI2024
♻ ☆ SwinGNN: Rethinking Permutation Invariance in Diffusion Models for Graph Generation
Diffusion models based on permutation-equivariant networks can learn permutation-invariant distributions for graph data. However, in comparison to their non-invariant counterparts, we have found that these invariant models encounter greater learning challenges since 1) their effective target distributions exhibit more modes; 2) their optimal one-step denoising scores are the score functions of Gaussian mixtures with more components. Motivated by this analysis, we propose a non-invariant diffusion model, called $\textit{SwinGNN}$, which employs an efficient edge-to-edge 2-WL message passing network and utilizes shifted window based self-attention inspired by SwinTransformers. Further, through systematic ablations, we identify several critical training and sampling techniques that significantly improve the sample quality of graph generation. At last, we introduce a simple post-processing trick, $\textit{i.e.}$, randomly permuting the generated graphs, which provably converts any graph generative model to a permutation-invariant one. Extensive experiments on synthetic and real-world protein and molecule datasets show that our SwinGNN achieves state-of-the-art performances. Our code is released at https://github.com/qiyan98/SwinGNN.
comment: TMLR 2024
♻ ☆ Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality
Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks.
Information Retrieval 29
☆ Influence Maximization via Graph Neural Bandits KDD
We consider a ubiquitous scenario in the study of Influence Maximization (IM), in which there is limited knowledge about the topology of the diffusion network. We set the IM problem in a multi-round diffusion campaign, aiming to maximize the number of distinct users that are influenced. Leveraging the capability of bandit algorithms to effectively balance the objectives of exploration and exploitation, as well as the expressivity of neural networks, our study explores the application of neural bandit algorithms to the IM problem. We propose the framework IM-GNB (Influence Maximization with Graph Neural Bandits), where we provide an estimate of the users' probabilities of being influenced by influencers (also known as diffusion seeds). This initial estimate forms the basis for constructing both an exploitation graph and an exploration one. Subsequently, IM-GNB handles the exploration-exploitation tradeoff, by selecting seed nodes in real-time using Graph Convolutional Networks (GCN), in which the pre-estimated graphs are employed to refine the influencers' estimated rewards in each contextual setting. Through extensive experiments on two large real-world datasets, we demonstrate the effectiveness of IM-GNB compared with other baseline methods, significantly improving the spread outcome of such diffusion campaigns, when the underlying network is unknown.
comment: To appear at the 2024 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
☆ News Without Borders: Domain Adaptation of Multilingual Sentence Embeddings for Cross-lingual News Recommendation
Rapidly growing numbers of multilingual news consumers pose an increasing challenge to news recommender systems in terms of providing customized recommendations. First, existing neural news recommenders, even when powered by multilingual language models (LMs), suffer substantial performance losses in zero-shot cross-lingual transfer (ZS-XLT). Second, the current paradigm of fine-tuning the backbone LM of a neural recommender on task-specific data is computationally expensive and infeasible in few-shot recommendation and cold-start setups, where data is scarce or completely unavailable. In this work, we propose a news-adapted sentence encoder (NaSE), domain-specialized from a pretrained massively multilingual sentence encoder (SE). To this end, we construct and leverage PolyNews and PolyNewsParallel, two multilingual news-specific corpora. With the news-adapted multilingual SE in place, we test the effectiveness of (i.e., question the need for) supervised fine-tuning for news recommendation, and propose a simple and strong baseline based on (i) frozen NaSE embeddings and (ii) late click-behavior fusion. We show that NaSE achieves state-of-the-art performance in ZS-XLT in true cold-start and few-shot news recommendation.
PromptDSI: Prompt-based Rehearsal-free Instance-wise Incremental Learning for Document Retrieval
Differentiable Search Index (DSI) utilizes Pre-trained Language Models (PLMs) for efficient document retrieval without relying on external indexes. However, DSIs need full re-training to handle updates in dynamic corpora, causing significant computational inefficiencies. We introduce PromptDSI, a rehearsal-free, prompt-based approach for instance-wise incremental learning in document retrieval. PromptDSI attaches prompts to the frozen PLM's encoder of DSI, leveraging its powerful representation to efficiently index new corpora while maintaining a balance between stability and plasticity. We eliminate the initial forward pass of prompt-based continual learning methods that doubles training and inference time. Moreover, we propose a topic-aware prompt pool that employs neural topic embeddings as fixed keys. This strategy ensures diverse and effective prompt usage, addressing the challenge of parameter underutilization caused by the collapse of the query-key matching mechanism. Our empirical evaluations demonstrate that PromptDSI matches IncDSI in managing forgetting while significantly enhancing recall by over 4% on new corpora.
comment: 21 pages
☆ Behavior-Dependent Linear Recurrent Units for Efficient Sequential Recommendation
Sequential recommender systems aims to predict the users' next interaction through user behavior modeling with various operators like RNNs and attentions. However, existing models generally fail to achieve the three golden principles for sequential recommendation simultaneously, i.e., training efficiency, low-cost inference, and strong performance. To this end, we propose RecBLR, an Efficient Sequential Recommendation Model based on Behavior-Dependent Linear Recurrent Units to accomplish the impossible triangle of the three principles. By incorporating gating mechanisms and behavior-dependent designs into linear recurrent units, our model significantly enhances user behavior modeling and recommendation performance. Furthermore, we unlock the parallelizable training as well as inference efficiency for our model by designing a hardware-aware scanning acceleration algorithm with a customized CUDA kernel. Extensive experiments on real-world datasets with varying lengths of user behavior sequences demonstrate RecBLR's remarkable effectiveness in simultaneously achieving all three golden principles - strong recommendation performance, training efficiency, and low-cost inference, while exhibiting excellent scalability to datasets with long user interaction histories.
☆ Predicting Award Winning Research Papers at Publication Time
In recent years, many studies have been focusing on predicting the scientific impact of research papers. Most of these predictions are based on citations count or rely on features obtainable only from already published papers. In this study, we predict the likelihood for a research paper of winning an award only relying on information available at publication time. For each paper, we build the citation subgraph induced from its bibliography. We initially consider some features of this subgraph, such as the density and the global clustering coefficient, to make our prediction. Then, we mix this information with textual features, extracted from the abstract and the title, to obtain a more accurate final prediction. We made our experiments considering the ArnetMiner citation graph, while the ground truth on award-winning papers has been obtained from a collection of best paper awards from 32 computer science conferences. In our experiment, we obtained an encouraging F1 score of 0.694. Remarkably, The high recall and the low false negatives rate, show how the model performs very well at identifying papers that will not win an award. This behavior can help researchers in getting a first evaluation of their work at publication time. Lastly, we made some first experiments on interpretability. Our results highlight some interesting patterns both in topological and textual features.
☆ LLM4MSR: An LLM-Enhanced Paradigm for Multi-Scenario Recommendation
As the demand for more personalized recommendation grows and a dramatic boom in commercial scenarios arises, the study on multi-scenario recommendation (MSR) has attracted much attention, which uses the data from all scenarios to simultaneously improve their recommendation performance. However, existing methods tend to integrate insufficient scenario knowledge and neglect learning personalized cross-scenario preferences, thus leading to suboptimal performance and inadequate interpretability. Meanwhile, though large language model (LLM) has shown great capability of reasoning and capturing semantic information, the high inference latency and high computation cost of tuning hinder its implementation in industrial recommender systems. To fill these gaps, we propose an effective efficient interpretable LLM-enhanced paradigm LLM4MSR in this work. Specifically, we first leverage LLM to uncover multi-level knowledge including scenario correlations and users' cross-scenario interests from the designed scenario- and user-level prompt without fine-tuning the LLM, then adopt hierarchical meta networks to generate multi-level meta layers to explicitly improves the scenario-aware and personalized recommendation capability. Our experiments on KuaiSAR-small, KuaiSAR, and Amazon datasets validate two significant advantages of LLM4MSR: (i) the effectiveness and compatibility with different multi-scenario backbone models (achieving 1.5%, 1%, and 40% AUC improvement on three datasets), (ii) high efficiency and deployability on industrial recommender systems, and (iii) improved interpretability. The implemented code and data is available to ease reproduction.
☆ Improving Multi-modal Recommender Systems by Denoising and Aligning Multi-modal Content and User Feedback
Multi-modal recommender systems (MRSs) are pivotal in diverse online web platforms and have garnered considerable attention in recent years. However, previous studies overlook the challenges of (1) noisy multi-modal content, (2) noisy user feedback, and (3) aligning multi-modal content with user feedback. In order to tackle these challenges, we propose Denoising and Aligning Multi-modal Recommender System (DA-MRS). To mitigate multi-modal noise, DA-MRS first constructs item-item graphs determined by consistent content similarity across modalities. To denoise user feedback, DA-MRS associates the probability of observed feedback with multi-modal content and devises a denoised BPR loss. Furthermore, DA-MRS implements Alignment guided by User preference to enhance task-specific item representation and Alignment guided by graded Item relations to provide finer-grained alignment. Extensive experiments verify that DA-MRS is a plug-and-play framework and achieves significant and consistent improvements across various datasets, backbone models, and noisy scenarios.
☆ RIGL: A Unified Reciprocal Approach for Tracing the Independent and Group Learning Processes KDD 2024
In the realm of education, both independent learning and group learning are esteemed as the most classic paradigms. The former allows learners to self-direct their studies, while the latter is typically characterized by teacher-directed scenarios. Recent studies in the field of intelligent education have leveraged deep temporal models to trace the learning process, capturing the dynamics of students' knowledge states, and have achieved remarkable performance. However, existing approaches have primarily focused on modeling the independent learning process, with the group learning paradigm receiving less attention. Moreover, the reciprocal effect between the two learning processes, especially their combined potential to foster holistic student development, remains inadequately explored. To this end, in this paper, we propose RIGL, a unified Reciprocal model to trace knowledge states at both the individual and group levels, drawing from the Independent and Group Learning processes. Specifically, we first introduce a time frame-aware reciprocal embedding module to concurrently model both student and group response interactions across various time frames. Subsequently, we employ reciprocal enhanced learning modeling to fully exploit the comprehensive and complementary information between the two behaviors. Furthermore, we design a relation-guided temporal attentive network, comprised of dynamic graph modeling coupled with a temporal self-attention mechanism. It is used to delve into the dynamic influence of individual and group interactions throughout the learning processes. Conclusively, we introduce a bias-aware contrastive learning module to bolster the stability of the model's training. Extensive experiments on four real-world educational datasets clearly demonstrate the effectiveness of the proposed RIGL model.
comment: Accepted by KDD 2024. 12 pages
☆ LLM-enhanced Reranking in Recommender Systems
Reranking is a critical component in recommender systems, playing an essential role in refining the output of recommendation algorithms. Traditional reranking models have focused predominantly on accuracy, but modern applications demand consideration of additional criteria such as diversity and fairness. Existing reranking approaches often fail to harmonize these diverse criteria effectively at the model level. Moreover, these models frequently encounter challenges with scalability and personalization due to their complexity and the varying significance of different reranking criteria in diverse scenarios. In response, we introduce a comprehensive reranking framework enhanced by LLM, designed to seamlessly integrate various reranking criteria while maintaining scalability and facilitating personalized recommendations. This framework employs a fully connected graph structure, allowing the LLM to simultaneously consider multiple aspects such as accuracy, diversity, and fairness through a coherent Chain-of-Thought (CoT) process. A customizable input mechanism is also integrated, enabling the tuning of the language model's focus to meet specific reranking needs. We validate our approach using three popular public datasets, where our framework demonstrates superior performance over existing state-of-the-art reranking models in balancing multiple criteria. The code for this implementation is publicly available.
☆ A Gradient Accumulation Method for Dense Retriever under Memory Constraint
InfoNCE loss is commonly used to train dense retriever in information retrieval tasks. It is well known that a large batch is essential to stable and effective training with InfoNCE loss, which requires significant hardware resources. Due to the dependency of large batch, dense retriever has bottleneck of application and research. Recently, memory reduction methods have been broadly adopted to resolve the hardware bottleneck by decomposing forward and backward or using a memory bank. However, current methods still suffer from slow and unstable training. To address these issues, we propose \longmodelname\ (\modelname), a stable and efficient memory reduction method for dense retriever trains that uses a dual memory bank structure to leverage previously generated query and passage representations. Experiments on widely used five information retrieval datasets indicate that \modelname\ can surpass not only existing memory reduction methods but also high-resource scenario. Moreover, theoretical analysis and experimental results confirm that \modelname\ provides more stable dual-encoder training than current memory bank utilization methods.
☆ CherryRec: Enhancing News Recommendation Quality via LLM-driven Framework
Large Language Models (LLMs) have achieved remarkable progress in language understanding and generation. Custom LLMs leveraging textual features have been applied to recommendation systems, demonstrating improvements across various recommendation scenarios. However, most existing methods perform untrained recommendation based on pre-trained knowledge (e.g., movie recommendation), and the auto-regressive generation of LLMs leads to slow inference speeds, making them less effective in real-time recommendations.To address this, we propose a framework for news recommendation using LLMs, named \textit{CherryRec}, which ensures the quality of recommendations while accelerating the recommendation process. Specifically, we employ a Knowledge-aware News Rapid Selector to retrieve candidate options based on the user's interaction history. The history and retrieved items are then input as text into a fine-tuned LLM, the Content-aware News Llm Evaluator, designed to enhance news recommendation capabilities. Finally, the Value-aware News Scorer integrates the scores to compute the CherryRec Score, which serves as the basis for the final recommendation.We validate the effectiveness of the proposed framework by comparing it with state-of-the-art baseline methods on benchmark datasets. Our experimental results consistently show that CherryRec outperforms the baselines in both recommendation performance and efficiency.The project resource can be accessed at: \url{https://github.com/xxxxxx}
☆ Intermediate Distillation: Data-Efficient Distillation from Black-Box LLMs for Information Retrieval
Recent research has explored distilling knowledge from large language models (LLMs) to optimize retriever models, especially within the retrieval-augmented generation (RAG) framework. However, most existing training methods rely on extracting supervision signals from LLMs' weights or their output probabilities, which is not only resource-intensive but also incompatible with black-box LLMs. In this paper, we introduce \textit{Intermediate Distillation}, a data-efficient knowledge distillation training scheme that treats LLMs as black boxes and distills their knowledge via an innovative LLM-ranker-retriever pipeline, solely using LLMs' ranking generation as the supervision signal. Extensive experiments demonstrate that our proposed method can significantly improve the performance of retriever models with only 1,000 training instances. Moreover, our distilled retriever model significantly boosts performance in question-answering tasks within the RAG framework, demonstrating the potential of LLMs to economically and effectively train smaller models.
comment: 13 pages, 7 figures, 3 tables
☆ Accelerating Complex Disease Treatment through Network Medicine and GenAI: A Case Study on Drug Repurposing for Breast Cancer
The objective of this research is to introduce a network specialized in predicting drugs that can be repurposed by investigating real-world evidence sources, such as clinical trials and biomedical literature. Specifically, it aims to generate drug combination therapies for complex diseases (e.g., cancer, Alzheimer's). We present a multilayered network medicine approach, empowered by a highly configured ChatGPT prompt engineering system, which is constructed on the fly to extract drug mentions in clinical trials. Additionally, we introduce a novel algorithm that connects real-world evidence with disease-specific signaling pathways (e.g., KEGG database). This sheds light on the repurposability of drugs if they are found to bind with one or more protein constituents of a signaling pathway. To demonstrate, we instantiated the framework for breast cancer and found that, out of 46 breast cancer signaling pathways, the framework identified 38 pathways that were covered by at least two drugs. This evidence signals the potential for combining those drugs. Specifically, the most covered signaling pathway, ID hsa:2064, was covered by 108 drugs, some of which can be combined. Conversely, the signaling pathway ID hsa:1499 was covered by only two drugs, indicating a significant gap for further research. Our network medicine framework, empowered by GenAI, shows promise in identifying drug combinations with a high degree of specificity, knowing the exact signaling pathways and proteins that serve as targets. It is noteworthy that ChatGPT successfully accelerated the process of identifying drug mentions in clinical trials, though further investigations are required to determine the relationships among the drug mentions.
comment: 9 pages double columns, 5 figures, 3 algorithms, 3 tables, and 1 listing, Submitted to IEEE MedAI'24 Conference, to be held November 15-17, Chongqing, China
♻ ☆ Future Impact Decomposition in Request-level Recommendations
In recommender systems, reinforcement learning solutions have shown promising results in optimizing the interaction sequence between users and the system over the long-term performance. For practical reasons, the policy's actions are typically designed as recommending a list of items to handle users' frequent and continuous browsing requests more efficiently. In this list-wise recommendation scenario, the user state is updated upon every request in the corresponding MDP formulation. However, this request-level formulation is essentially inconsistent with the user's item-level behavior. In this study, we demonstrate that an item-level optimization approach can better utilize item characteristics and optimize the policy's performance even under the request-level MDP. We support this claim by comparing the performance of standard request-level methods with the proposed item-level actor-critic framework in both simulation and online experiments. Furthermore, we show that a reward-based future decomposition strategy can better express the item-wise future impact and improve the recommendation accuracy in the long term. To achieve a more thorough understanding of the decomposition strategy, we propose a model-based re-weighting framework with adversarial learning that further boost the performance and investigate its correlation with the reward-based strategy.
comment: 12 pages, 8 figures
♻ ☆ Prediction of the Realisation of an Information Need: An EEG Study
One of the foundational goals of Information Retrieval (IR) is to satisfy searchers' Information Needs (IN). Understanding how INs physically manifest has long been a complex and elusive process. However, recent studies utilising Electroencephalography (EEG) data have provided real-time insights into the neural processes associated with INs. Unfortunately, they have yet to demonstrate how this insight can practically benefit the search experience. As such, within this study, we explore the ability to predict the realisation of IN within EEG data across 14 subjects whilst partaking in a Question-Answering (Q/A) task. Furthermore, we investigate the combinations of EEG features that yield optimal predictive performance, as well as identify regions within the Q/A queries where a subject's realisation of IN is more pronounced. The findings from this work demonstrate that EEG data is sufficient for the real-time prediction of the realisation of an IN across all subjects with an accuracy of 73.5% (SD 2.6%) and on a per-subject basis with an accuracy of 90.1% (SD 22.1%). This work helps to close the gap by bridging theoretical neuroscientific advancements with tangible improvements in information retrieval practices, paving the way for real-time prediction of the realisation of IN.
♻ ☆ Knowledge-Driven Cross-Document Relation Extraction ACL 2024
Relation extraction (RE) is a well-known NLP application often treated as a sentence- or document-level task. However, a handful of recent efforts explore it across documents or in the cross-document setting (CrossDocRE). This is distinct from the single document case because different documents often focus on disparate themes, while text within a document tends to have a single goal. Linking findings from disparate documents to identify new relationships is at the core of the popular literature-based knowledge discovery paradigm in biomedicine and other domains. Current CrossDocRE efforts do not consider domain knowledge, which are often assumed to be known to the reader when documents are authored. Here, we propose a novel approach, KXDocRE, that embed domain knowledge of entities with input text for cross-document RE. Our proposed framework has three main benefits over baselines: 1) it incorporates domain knowledge of entities along with documents' text; 2) it offers interpretability by producing explanatory text for predicted relations between entities 3) it improves performance over the prior methods.
comment: Accepted in ACL 2024 Findings
♻ ☆ A Survey on Large Language Models for Recommendation
Large Language Models (LLMs) have emerged as powerful tools in the field of Natural Language Processing (NLP) and have recently gained significant attention in the domain of Recommendation Systems (RS). These models, trained on massive amounts of data using self-supervised learning, have demonstrated remarkable success in learning universal representations and have the potential to enhance various aspects of recommendation systems by some effective transfer techniques such as fine-tuning and prompt tuning, and so on. The crucial aspect of harnessing the power of language models in enhancing recommendation quality is the utilization of their high-quality representations of textual features and their extensive coverage of external knowledge to establish correlations between items and users. To provide a comprehensive understanding of the existing LLM-based recommendation systems, this survey presents a taxonomy that categorizes these models into two major paradigms, respectively Discriminative LLM for Recommendation (DLLM4Rec) and Generative LLM for Recommendation (GLLM4Rec), with the latter being systematically sorted out for the first time. Furthermore, we systematically review and analyze existing LLM-based recommendation systems within each paradigm, providing insights into their methodologies, techniques, and performance. Additionally, we identify key challenges and several valuable findings to provide researchers and practitioners with inspiration. We have also created a GitHub repository to index relevant papers on LLMs for recommendation, https://github.com/WLiK/LLM4Rec.
comment: 34 pages, 7 figures, 2 tables
♻ ☆ Non-autoregressive Generative Models for Reranking Recommendation KDD 2024
Contemporary recommendation systems are designed to meet users' needs by delivering tailored lists of items that align with their specific demands or interests. In a multi-stage recommendation system, reranking plays a crucial role by modeling the intra-list correlations among items. The key challenge of reranking lies in the exploration of optimal sequences within the combinatorial space of permutations. Recent research proposes a generator-evaluator learning paradigm, where the generator generates multiple feasible sequences and the evaluator picks out the best sequence based on the estimated listwise score. The generator is of vital importance, and generative models are well-suited for the generator function. Current generative models employ an autoregressive strategy for sequence generation. However, deploying autoregressive models in real-time industrial systems is challenging. To address these issues, we propose a Non-AutoRegressive generative model for reranking Recommendation (NAR4Rec) designed to enhance efficiency and effectiveness. To tackle challenges such as sparse training samples and dynamic candidates, we introduce a matching model. Considering the diverse nature of user feedback, we employ a sequence-level unlikelihood training objective to differentiate feasible sequences from unfeasible ones. Additionally, to overcome the lack of dependency modeling in non-autoregressive models regarding target items, we introduce contrastive decoding to capture correlations among these items. Extensive offline experiments validate the superior performance of NAR4Rec over state-of-the-art reranking methods. Online A/B tests reveal that NAR4Rec significantly enhances the user experience. Furthermore, NAR4Rec has been fully deployed in a popular video app Kuaishou with over 300 million daily active users.
comment: Accepted by KDD 2024
♻ ☆ Incentive-Aware Recommender Systems in Two-Sided Markets
Online platforms in the Internet Economy commonly incorporate recommender systems that recommend products (or "arms") to users (or "agents"). A key challenge in this domain arises from myopic agents who are naturally incentivized to exploit by choosing the optimal arm based on current information, rather than exploring various alternatives to gather information that benefits the collective. We propose a novel recommender system that aligns with agents' incentives while achieving asymptotically optimal performance, as measured by regret in repeated interactions. Our framework models this incentive-aware system as a multi-agent bandit problem in two-sided markets, where the interactions of agents and arms are facilitated by recommender systems on online platforms. This model incorporates incentive constraints induced by agents' opportunity costs. In scenarios where opportunity costs are known to the platform, we show the existence of an incentive-compatible recommendation algorithm. This algorithm pools recommendations between a genuinely good arm and an unknown arm using a randomized and adaptive strategy. Moreover, when these opportunity costs are unknown, we introduce an algorithm that randomly pools recommendations across all arms, utilizing the cumulative loss from each arm as feedback for strategic exploration. We demonstrate that both algorithms satisfy an ex-post fairness criterion, which protects agents from over-exploitation. All code for using the proposed algorithms and reproducing results is made available on GitHub.
♻ ☆ A Comprehensive Survey on AI-based Methods for Patents
Recent advancements in Artificial Intelligence (AI) and machine learning have demonstrated transformative capabilities across diverse domains. This progress extends to the field of patent analysis and innovation, where AI-based tools present opportunities to streamline and enhance important tasks in the patent cycle such as classification, retrieval, and valuation prediction. This not only accelerates the efficiency of patent researchers and applicants but also opens new avenues for technological innovation and discovery. Our survey provides a comprehensive summary of recent AI tools in patent analysis from more than 40 papers from 26 venues between 2017 and 2023. Unlike existing surveys, we include methods that work for patent image and text data. Furthermore, we introduce a novel taxonomy for the categorization based on the tasks in the patent life cycle as well as the specifics of the AI methods. This interdisciplinary survey aims to serve as a resource for researchers and practitioners who are working at the intersection of AI and patent analysis as well as the patent offices that are aiming to build efficient patent systems.
♻ ☆ GPFedRec: Graph-guided Personalization for Federated Recommendation KDD'24
The federated recommendation system is an emerging AI service architecture that provides recommendation services in a privacy-preserving manner. Using user-relation graphs to enhance federated recommendations is a promising topic. However, it is still an open challenge to construct the user-relation graph while preserving data locality-based privacy protection in federated settings. Inspired by a simple motivation, similar users share a similar vision (embeddings) to the same item set, this paper proposes a novel Graph-guided Personalization for Federated Recommendation (GPFedRec). The proposed method constructs a user-relation graph from user-specific personalized item embeddings at the server without accessing the users' interaction records. The personalized item embedding is locally fine-tuned on each device, and then a user-relation graph will be constructed by measuring the similarity among client-specific item embeddings. Without accessing users' historical interactions, we embody the data locality-based privacy protection of vanilla federated learning. Furthermore, a graph-guided aggregation mechanism is designed to leverage the user-relation graph and federated optimization framework simultaneously. Extensive experiments on five benchmark datasets demonstrate GPFedRec's superior performance. The in-depth study validates that GPFedRec can generally improve existing federated recommendation methods as a plugin while keeping user privacy safe. Code is available to ease reproducibility
comment: Accepted as a regular paper of KDD'24
♻ ☆ DELRec: Distilling Sequential Pattern to Enhance LLM-based Recommendation
Sequential recommendation (SR) tasks enhance recommendation accuracy by capturing the connection between users' past interactions and their changing preferences. Conventional models often focus solely on capturing sequential patterns within the training data, neglecting the broader context and semantic information embedded in item titles from external sources. This limits their predictive power and adaptability. Recently, large language models (LLMs) have shown promise in SR tasks due to their advanced understanding capabilities and strong generalization abilities. Researchers have attempted to enhance LLMs' recommendation performance by incorporating information from SR models. However, previous approaches have encountered problems such as 1) only influencing LLMs at the result level; 2) increased complexity of LLMs recommendation methods leading to reduced interpretability; 3) incomplete understanding and utilization of SR models information by LLMs. To address these problems, we proposes a novel framework, DELRec, which aims to extract knowledge from SR models and enable LLMs to easily comprehend and utilize this supplementary information for more effective sequential recommendations. DELRec consists of two main stages: 1) SR Models Pattern Distilling, focusing on extracting behavioral patterns exhibited by SR models using soft prompts through two well-designed strategies; 2) LLMs-based Sequential Recommendation, aiming to fine-tune LLMs to effectively use the distilled auxiliary information to perform SR tasks. Extensive experimental results conducted on three real datasets validate the effectiveness of the DELRec framework.
♻ ☆ CELA: Cost-Efficient Language Model Alignment for CTR Prediction
Click-Through Rate (CTR) prediction holds a paramount position in recommender systems. The prevailing ID-based paradigm underperforms in cold-start scenarios due to the skewed distribution of feature frequency. Additionally, the utilization of a single modality fails to exploit the knowledge contained within textual features. Recent efforts have sought to mitigate these challenges by integrating Pre-trained Language Models (PLMs). They design hard prompts to structure raw features into text for each interaction and then apply PLMs for text processing. With external knowledge and reasoning capabilities, PLMs extract valuable information even in cases of sparse interactions. Nevertheless, compared to ID-based models, pure text modeling degrades the efficacy of collaborative filtering, as well as feature scalability and efficiency during both training and inference. To address these issues, we propose \textbf{C}ost-\textbf{E}fficient \textbf{L}anguage Model \textbf{A}lignment (\textbf{CELA}) for CTR prediction. CELA incorporates textual features and language models while preserving the collaborative filtering capabilities of ID-based models. This model-agnostic framework can be equipped with plug-and-play textual features, with item-level alignment enhancing the utilization of external information while maintaining training and inference efficiency. Through extensive offline experiments, CELA demonstrates superior performance compared to state-of-the-art methods. Furthermore, an online A/B test conducted on an industrial App recommender system showcases its practical effectiveness, solidifying the potential for real-world applications of CELA.
comment: 10 pages, 5 figures
♻ ☆ Self-supervised Graph Neural Network for Mechanical CAD Retrieval
CAD (Computer-Aided Design) plays a crucial role in mechanical industry, where large numbers of similar-shaped CAD parts are often created. Efficiently reusing these parts is key to reducing design and production costs for enterprises. Retrieval systems are vital for achieving CAD reuse, but the complex shapes of CAD models are difficult to accurately describe using text or keywords, making traditional retrieval methods ineffective. While existing representation learning approaches have been developed for CAD, manually labeling similar samples in these methods is expensive. Additionally, CAD models' unique parameterized data structure presents challenges for applying existing 3D shape representation learning techniques directly. In this work, we propose GC-CAD, a self-supervised contrastive graph neural network-based method for mechanical CAD retrieval that directly models parameterized CAD raw files. GC-CAD consists of two key modules: structure-aware representation learning and contrastive graph learning framework. The method leverages graph neural networks to extract both geometric and topological information from CAD models, generating feature representations. We then introduce a simple yet effective contrastive graph learning framework approach, enabling the model to train without manual labels and generate retrieval-ready representations. Experimental results on four datasets including human evaluation demonstrate that the proposed method achieves significant accuracy improvements and up to 100 times efficiency improvement over the baseline methods.
♻ ☆ AutoSurvey: Large Language Models Can Automatically Write Surveys
This paper introduces AutoSurvey, a speedy and well-organized methodology for automating the creation of comprehensive literature surveys in rapidly evolving fields like artificial intelligence. Traditional survey paper creation faces challenges due to the vast volume and complexity of information, prompting the need for efficient survey methods. While large language models (LLMs) offer promise in automating this process, challenges such as context window limitations, parametric knowledge constraints, and the lack of evaluation benchmarks remain. AutoSurvey addresses these challenges through a systematic approach that involves initial retrieval and outline generation, subsection drafting by specialized LLMs, integration and refinement, and rigorous evaluation and iteration. Our contributions include a comprehensive solution to the survey problem, a reliable evaluation method, and experimental validation demonstrating AutoSurvey's effectiveness.We open our resources at \url{https://github.com/AutoSurveys/AutoSurvey}.
♻ ☆ Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models
Recent proprietary large language models (LLMs), such as GPT-4, have achieved a milestone in tackling diverse challenges in the biomedical domain, ranging from multiple-choice questions to long-form generations. To address challenges that still cannot be handled with the encoded knowledge of LLMs, various retrieval-augmented generation (RAG) methods have been developed by searching documents from the knowledge corpus and appending them unconditionally or selectively to the input of LLMs for generation. However, when applying existing methods to different domain-specific problems, poor generalization becomes apparent, leading to fetching incorrect documents or making inaccurate judgments. In this paper, we introduce Self-BioRAG, a framework reliable for biomedical text that specializes in generating explanations, retrieving domain-specific documents, and self-reflecting generated responses. We utilize 84k filtered biomedical instruction sets to train Self-BioRAG that can assess its generated explanations with customized reflective tokens. Our work proves that domain-specific components, such as a retriever, domain-related document corpus, and instruction sets are necessary for adhering to domain-related instructions. Using three major medical question-answering benchmark datasets, experimental results of Self-BioRAG demonstrate significant performance gains by achieving a 7.2% absolute improvement on average over the state-of-the-art open-foundation model with a parameter size of 7B or less. Overall, we analyze that Self-BioRAG finds the clues in the question, retrieves relevant documents if needed, and understands how to answer with information from retrieved documents and encoded knowledge as a medical expert does. We release our data and code for training our framework components and model weights (7B and 13B) to enhance capabilities in biomedical and clinical domains.
comment: ISMB 2024
♻ ☆ General Distribution Learning: A theoretical framework for Deep Learning
There remain numerous unanswered research questions on deep learning (DL) within the classical learning theory framework. These include the remarkable generalization capabilities of overparametrized neural networks (NNs), the efficient optimization performance despite non-convexity of objectives, the mechanism of flat minima for generalization, and the exceptional performance of deep architectures in solving physical problems. This paper introduces General Distribution Learning (GD Learning), a novel theoretical learning framework designed to address a comprehensive range of machine learning and statistical tasks, including classification, regression and parameter estimation. Departing from traditional statistical machine learning, GD Learning focuses on the true underlying distribution. In GD Learning, learning error, corresponding to the expected error in classical statistical learning framework, is divided into fitting errors due to models and algorithms, as well as sampling errors introduced by limited sampling data. The framework significantly incorporates prior knowledge, especially in scenarios characterized by data scarcity, thereby enhancing performance. Within the GD Learning framework, we demonstrate that the global optimal solutions in non-convex optimization can be approached by minimizing the gradient norm and the non-uniformity of the eigenvalues of the model's Jacobian matrix. This insight leads to the development of the gradient structure control algorithm. GD Learning also offers fresh insights into the questions on deep learning, including overparameterization and non-convex optimization, bias-variance trade-off, and the mechanism of flat minima.
comment: arXiv admin note: text overlap with arXiv:2105.04026 by other authors. arXiv admin note: text overlap with arXiv:2105.04026 by other authors
♻ ☆ "It answers questions that I didn't know I had": Ph.D. Students' Evaluation of an Information Sharing Knowledge Graph
Interdisciplinary PhD programs can be challenging as the vital information needed by students may not be readily available, it is scattered across university's websites, while tacit knowledge can be obtained only by interacting with people. Hence, there is a need to develop a knowledge management model to create, query, and maintain a knowledge repository for interdisciplinary students. We propose a knowledge graph containing information on critical categories and their relationships, extracted from multiple sources, essential for interdisciplinary PhD students. This study evaluates the usability of a participatory designed knowledge graph intended to facilitate information exchange and decision-making. The usability findings demonstrate that interaction with this knowledge graph benefits PhD students by notably reducing uncertainty and academic stress, particularly among newcomers. Knowledge graph supported them in decision making, especially when choosing collaborators in an interdisciplinary setting. Key helpful features are related to exploring student faculty networks, milestones tracking, rapid access to aggregated data, and insights into crowdsourced fellow students' activities. The knowledge graph provides a solution to meet the personalized needs of doctoral researchers and has the potential to improve the information discovery and decision-making process substantially. It also includes the tacit knowledge exchange support missing from most current approaches, which is critical for this population and establishing interdisciplinary collaborations. This approach can be applied to other interdisciplinary programs and domains globally.
♻ ☆ Ranking with Ties based on Noisy Performance Data
We consider the problem of ranking a set of objects based on their performance when the measurement of said performance is subject to noise. In this scenario, the performance is measured repeatedly, resulting in a range of measurements for each object. If the ranges of two objects do not overlap, then we consider one object as 'better' than the other, and we expect it to receive a higher rank; if, however, the ranges overlap, then the objects are incomparable, and we wish them to be assigned the same rank. Unfortunately, the incomparability relation of ranges is in general not transitive; as a consequence, in general the two requirements cannot be satisfied simultaneously, i.e., it is not possible to guarantee both distinct ranks for objects with separated ranges, and same rank for objects with overlapping ranges. This conflict leads to more than one reasonable way to rank a set of objects. In this paper, we explore the ambiguities that arise when ranking with ties, and define a set of reasonable rankings, which we call partial rankings. We develop and analyse three different methodologies to compute a partial ranking. Finally, we show how performance differences among objects can be investigated with the help of partial ranking.
comment: 22 pages, 21 figures
Multimedia 7
☆ VIA: A Spatiotemporal Video Adaptation Framework for Global and Local Video Editing
Video editing stands as a cornerstone of digital media, from entertainment and education to professional communication. However, previous methods often overlook the necessity of comprehensively understanding both global and local contexts, leading to inaccurate and inconsistency edits in the spatiotemporal dimension, especially for long videos. In this paper, we introduce VIA, a unified spatiotemporal VIdeo Adaptation framework for global and local video editing, pushing the limits of consistently editing minute-long videos. First, to ensure local consistency within individual frames, the foundation of VIA is a novel test-time editing adaptation method, which adapts a pre-trained image editing model for improving consistency between potential editing directions and the text instruction, and adapts masked latent variables for precise local control. Furthermore, to maintain global consistency over the video sequence, we introduce spatiotemporal adaptation that adapts consistent attention variables in key frames and strategically applies them across the whole sequence to realize the editing effects. Extensive experiments demonstrate that, compared to baseline methods, our VIA approach produces edits that are more faithful to the source videos, more coherent in the spatiotemporal context, and more precise in local control. More importantly, we show that VIA can achieve consistent long video editing in minutes, unlocking the potentials for advanced video editing tasks over long video sequences.
comment: 13 pages, 11 figures
☆ Competitive Learning for Achieving Content-specific Filters in Video Coding for Machines ICIP 2024
This paper investigates the efficacy of jointly optimizing content-specific post-processing filters to adapt a human oriented video/image codec into a codec suitable for machine vision tasks. By observing that artifacts produced by video/image codecs are content-dependent, we propose a novel training strategy based on competitive learning principles. This strategy assigns training samples to filters dynamically, in a fuzzy manner, which further optimizes the winning filter on the given sample. Inspired by simulated annealing optimization techniques, we employ a softmax function with a temperature variable as the weight allocation function to mitigate the effects of random initialization. Our evaluation, conducted on a system utilizing multiple post-processing filters within a Versatile Video Coding (VVC) codec framework, demonstrates the superiority of content-specific filters trained with our proposed strategies, specifically, when images are processed in blocks. Using VVC reference software VTM 12.0 as the anchor, experiments on the OpenImages dataset show an improvement in the BD-rate reduction from -41.3% and -44.6% to -42.3% and -44.7% for object detection and instance segmentation tasks, respectively, compared to independently trained filters. The statistics of the filter usage align with our hypothesis and underscore the importance of jointly optimizing filters for both content and reconstruction quality. Our findings pave the way for further improving the performance of video/image codecs.
comment: Accepted to be preseneted in ICIP 2024
☆ Enhancing Visible-Infrared Person Re-identification with Modality- and Instance-aware Visual Prompt Learning ICMR'24
The Visible-Infrared Person Re-identification (VI ReID) aims to match visible and infrared images of the same pedestrians across non-overlapped camera views. These two input modalities contain both invariant information, such as shape, and modality-specific details, such as color. An ideal model should utilize valuable information from both modalities during training for enhanced representational capability. However, the gap caused by modality-specific information poses substantial challenges for the VI ReID model to handle distinct modality inputs simultaneously. To address this, we introduce the Modality-aware and Instance-aware Visual Prompts (MIP) network in our work, designed to effectively utilize both invariant and specific information for identification. Specifically, our MIP model is built on the transformer architecture. In this model, we have designed a series of modality-specific prompts, which could enable our model to adapt to and make use of the specific information inherent in different modality inputs, thereby reducing the interference caused by the modality gap and achieving better identification. Besides, we also employ each pedestrian feature to construct a group of instance-specific prompts. These customized prompts are responsible for guiding our model to adapt to each pedestrian instance dynamically, thereby capturing identity-level discriminative clues for identification. Through extensive experiments on SYSU-MM01 and RegDB datasets, the effectiveness of both our designed modules is evaluated. Additionally, our proposed MIP performs better than most state-of-the-art methods.
comment: Accepyed by ACM International Conference on Multimedia Retrieval (ICMR'24)
☆ SFedCA: Credit Assignment-Based Active Client Selection Strategy for Spiking Federated Learning
Spiking federated learning is an emerging distributed learning paradigm that allows resource-constrained devices to train collaboratively at low power consumption without exchanging local data. It takes advantage of both the privacy computation property in federated learning (FL) and the energy efficiency in spiking neural networks (SNN). Thus, it is highly promising to revolutionize the efficient processing of multimedia data. However, existing spiking federated learning methods employ a random selection approach for client aggregation, assuming unbiased client participation. This neglect of statistical heterogeneity affects the convergence and accuracy of the global model significantly. In our work, we propose a credit assignment-based active client selection strategy, the SFedCA, to judiciously aggregate clients that contribute to the global sample distribution balance. Specifically, the client credits are assigned by the firing intensity state before and after local model training, which reflects the local data distribution difference from the global model. Comprehensive experiments are conducted on various non-identical and independent distribution (non-IID) scenarios. The experimental results demonstrate that the SFedCA outperforms the existing state-of-the-art spiking federated learning methods, and requires fewer communication rounds.
comment: 9 pages
♻ ☆ Evaluating Text-to-Visual Generation with Image-to-Text Generation
Despite significant progress in generative AI, comprehensive evaluation remains challenging because of the lack of effective metrics and standardized benchmarks. For instance, the widely-used CLIPScore measures the alignment between a (generated) image and text prompt, but it fails to produce reliable scores for complex prompts involving compositions of objects, attributes, and relations. One reason is that text encoders of CLIP can notoriously act as a "bag of words", conflating prompts such as "the horse is eating the grass" with "the grass is eating the horse". To address this, we introduce the VQAScore, which uses a visual-question-answering (VQA) model to produce an alignment score by computing the probability of a "Yes" answer to a simple "Does this figure show '{text}'?" question. Though simpler than prior art, VQAScore computed with off-the-shelf models produces state-of-the-art results across many (8) image-text alignment benchmarks. We also compute VQAScore with an in-house model that follows best practices in the literature. For example, we use a bidirectional image-question encoder that allows image embeddings to depend on the question being asked (and vice versa). Our in-house model, CLIP-FlanT5, outperforms even the strongest baselines that make use of the proprietary GPT-4V. Interestingly, although we train with only images, VQAScore can also align text with video and 3D models. VQAScore allows researchers to benchmark text-to-visual generation using complex texts that capture the compositional structure of real-world prompts. We introduce GenAI-Bench, a more challenging benchmark with 1,600 compositional text prompts that require parsing scenes, objects, attributes, relationships, and high-order reasoning like comparison and logic. GenAI-Bench also offers over 15,000 human ratings for leading image and video generation models such as Stable Diffusion, DALL-E 3, and Gen2.
comment: We open-source our data, model, and code at: https://github.com/linzhiqiu/t2v_metrics ; Project page: https://linzhiqiu.github.io/papers/vqascore
♻ ☆ CtrSVDD: A Benchmark Dataset and Baseline Analysis for Controlled Singing Voice Deepfake Detection
Recent singing voice synthesis and conversion advancements necessitate robust singing voice deepfake detection (SVDD) models. Current SVDD datasets face challenges due to limited controllability, diversity in deepfake methods, and licensing restrictions. Addressing these gaps, we introduce CtrSVDD, a large-scale, diverse collection of bonafide and deepfake singing vocals. These vocals are synthesized using state-of-the-art methods from publicly accessible singing voice datasets. CtrSVDD includes 47.64 hours of bonafide and 260.34 hours of deepfake singing vocals, spanning 14 deepfake methods and involving 164 singer identities. We also present a baseline system with flexible front-end features, evaluated against a structured train/dev/eval split. The experiments show the importance of feature selection and highlight a need for generalization towards deepfake methods that deviate further from training distribution. The CtrSVDD dataset and baselines are publicly accessible.
comment: Accepted by Interspeech 2024
♻ ☆ Attack on Scene Flow using Point Clouds
Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks.
Computer Vision and Pattern Recognition 160
☆ LLaNA: Large Language and NeRF Assistant
Multimodal Large Language Models (MLLMs) have demonstrated an excellent understanding of images and 3D data. However, both modalities have shortcomings in holistically capturing the appearance and geometry of objects. Meanwhile, Neural Radiance Fields (NeRFs), which encode information within the weights of a simple Multi-Layer Perceptron (MLP), have emerged as an increasingly widespread modality that simultaneously encodes the geometry and photorealistic appearance of objects. This paper investigates the feasibility and effectiveness of ingesting NeRF into MLLM. We create LLaNA, the first general-purpose NeRF-language assistant capable of performing new tasks such as NeRF captioning and Q\&A. Notably, our method directly processes the weights of the NeRF's MLP to extract information about the represented objects without the need to render images or materialize 3D data structures. Moreover, we build a dataset of NeRFs with text annotations for various NeRF-language tasks with no human intervention. Based on this dataset, we develop a benchmark to evaluate the NeRF understanding capability of our method. Results show that processing NeRF weights performs favourably against extracting 2D or 3D representations from NeRFs.
comment: Under review. Project page: https://andreamaduzzi.github.io/llana/
☆ Autoregressive Image Generation without Vector Quantization
Conventional wisdom holds that autoregressive models for image generation are typically accompanied by vector-quantized tokens. We observe that while a discrete-valued space can facilitate representing a categorical distribution, it is not a necessity for autoregressive modeling. In this work, we propose to model the per-token probability distribution using a diffusion procedure, which allows us to apply autoregressive models in a continuous-valued space. Rather than using categorical cross-entropy loss, we define a Diffusion Loss function to model the per-token probability. This approach eliminates the need for discrete-valued tokenizers. We evaluate its effectiveness across a wide range of cases, including standard autoregressive models and generalized masked autoregressive (MAR) variants. By removing vector quantization, our image generator achieves strong results while enjoying the speed advantage of sequence modeling. We hope this work will motivate the use of autoregressive generation in other continuous-valued domains and applications.
comment: Tech report
☆ mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
☆ Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%
In the realm of image quantization exemplified by VQGAN, the process encodes images into discrete tokens drawn from a codebook with a predefined size. Recent advancements, particularly with LLAMA 3, reveal that enlarging the codebook significantly enhances model performance. However, VQGAN and its derivatives, such as VQGAN-FC (Factorized Codes) and VQGAN-EMA, continue to grapple with challenges related to expanding the codebook size and enhancing codebook utilization. For instance, VQGAN-FC is restricted to learning a codebook with a maximum size of 16,384, maintaining a typically low utilization rate of less than 12% on ImageNet. In this work, we propose a novel image quantization model named VQGAN-LC (Large Codebook), which extends the codebook size to 100,000, achieving an utilization rate exceeding 99%. Unlike previous methods that optimize each codebook entry, our approach begins with a codebook initialized with 100,000 features extracted by a pre-trained vision encoder. Optimization then focuses on training a projector that aligns the entire codebook with the feature distributions of the encoder in VQGAN-LC. We demonstrate the superior performance of our model over its counterparts across a variety of tasks, including image reconstruction, image classification, auto-regressive image generation using GPT, and image creation with diffusion- and flow-based generative models. Code and models are available at https://github.com/zh460045050/VQGAN-LC.
☆ OoDIS: Anomaly Instance Segmentation Benchmark CVPR 2024
Autonomous vehicles require a precise understanding of their environment to navigate safely. Reliable identification of unknown objects, especially those that are absent during training, such as wild animals, is critical due to their potential to cause serious accidents. Significant progress in semantic segmentation of anomalies has been driven by the availability of out-of-distribution (OOD) benchmarks. However, a comprehensive understanding of scene dynamics requires the segmentation of individual objects, and thus the segmentation of instances is essential. Development in this area has been lagging, largely due to the lack of dedicated benchmarks. To address this gap, we have extended the most commonly used anomaly segmentation benchmarks to include the instance segmentation task. Our evaluation of anomaly instance segmentation methods shows that this challenge remains an unsolved problem. The benchmark website and the competition page can be found at: https://vision.rwth-aachen.de/oodis .
comment: Accepted at the VAND 2.0 Workshop at CVPR 2024. Project page: https://vision.rwth-aachen.de/oodis
☆ RetinaGS: Scalable Training for Dense Scene Rendering with Billion-Scale 3D Gaussians
In this work, we explore the possibility of training high-parameter 3D Gaussian splatting (3DGS) models on large-scale, high-resolution datasets. We design a general model parallel training method for 3DGS, named RetinaGS, which uses a proper rendering equation and can be applied to any scene and arbitrary distribution of Gaussian primitives. It enables us to explore the scaling behavior of 3DGS in terms of primitive numbers and training resolutions that were difficult to explore before and surpass previous state-of-the-art reconstruction quality. We observe a clear positive trend of increasing visual quality when increasing primitive numbers with our method. We also demonstrate the first attempt at training a 3DGS model with more than one billion primitives on the full MatrixCity dataset that attains a promising visual quality.
☆ MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.
comment: This project is available at https://github.com/Liuziyu77/MMDU
☆ Unveiling Encoder-Free Vision-Language Models
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.
comment: 16 pages, 7 figures
☆ Exploring the Role of Large Language Models in Prompt Encoding for Diffusion Models
Large language models (LLMs) based on decoder-only transformers have demonstrated superior text understanding capabilities compared to CLIP and T5-series models. However, the paradigm for utilizing current advanced LLMs in text-to-image diffusion models remains to be explored. We observed an unusual phenomenon: directly using a large language model as the prompt encoder significantly degrades the prompt-following ability in image generation. We identified two main obstacles behind this issue. One is the misalignment between the next token prediction training in LLM and the requirement for discriminative prompt features in diffusion models. The other is the intrinsic positional bias introduced by the decoder-only architecture. To deal with this issue, we propose a novel framework to fully harness the capabilities of LLMs. Through the carefully designed usage guidance, we effectively enhance the text representation capability for prompt encoding and eliminate its inherent positional bias. This allows us to integrate state-of-the-art LLMs into the text-to-image generation model flexibly. Furthermore, we also provide an effective manner to fuse multiple LLMs into our framework. Considering the excellent performance and scaling capabilities demonstrated by the transformer architecture, we further design an LLM-Infused Diffusion Transformer (LI-DiT) based on the framework. We conduct extensive experiments to validate LI-DiT across model size and data size. Benefiting from the inherent ability of the LLMs and our innovative designs, the prompt understanding performance of LI-DiT easily surpasses state-of-the-art open-source models as well as mainstream closed-source commercial models including Stable Diffusion 3, DALL-E 3, and Midjourney V6. The powerful LI-DiT-10B will be available after further optimization and security checks.
☆ Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation CVPR 2024
We introduce Infinigen Indoors, a Blender-based procedural generator of photorealistic indoor scenes. It builds upon the existing Infinigen system, which focuses on natural scenes, but expands its coverage to indoor scenes by introducing a diverse library of procedural indoor assets, including furniture, architecture elements, appliances, and other day-to-day objects. It also introduces a constraint-based arrangement system, which consists of a domain-specific language for expressing diverse constraints on scene composition, and a solver that generates scene compositions that maximally satisfy the constraints. We provide an export tool that allows the generated 3D objects and scenes to be directly used for training embodied agents in real-time simulators such as Omniverse and Unreal. Infinigen Indoors is open-sourced under the BSD license. Please visit https://infinigen.org for code and videos.
comment: Accepted to CVPR 2024
☆ On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks effectively, they face challenges with the high computational demands of complex visually-situated text understanding. Such tasks often require increased token inputs and large vision modules to harness high-resolution information. Striking a balance between model size and data importance remains an open question. This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs. By strategically formulating datasets, optimizing vision modules, and enhancing supervision techniques, we achieve significant improvements in inference throughput while maintaining high performance. Extensive experiments across models ranging from 160M to 13B parameters offer insights into model optimization. We will fully open-source our codebase, models, and datasets at https://github.com/naver-ai/elva .
comment: 17 pages, 8 figures
☆ Composing Object Relations and Attributes for Image-Text Matching CVPR'24
We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive, even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network, our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model, we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity), which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes, CORA. Experimental results on two prominent image-text retrieval benchmarks, Flickr30K and MSCOCO, demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder.
comment: Accepted to CVPR'24
☆ MegaScenes: Scene-Level View Synthesis at Scale
Scene-level novel view synthesis (NVS) is fundamental to many vision and graphics applications. Recently, pose-conditioned diffusion models have led to significant progress by extracting 3D information from 2D foundation models, but these methods are limited by the lack of scene-level training data. Common dataset choices either consist of isolated objects (Objaverse), or of object-centric scenes with limited pose distributions (DTU, CO3D). In this paper, we create a large-scale scene-level dataset from Internet photo collections, called MegaScenes, which contains over 100K structure from motion (SfM) reconstructions from around the world. Internet photos represent a scalable data source but come with challenges such as lighting and transient objects. We address these issues to further create a subset suitable for the task of NVS. Additionally, we analyze failure cases of state-of-the-art NVS methods and significantly improve generation consistency. Through extensive experiments, we validate the effectiveness of both our dataset and method on generating in-the-wild scenes. For details on the dataset and code, see our project page at https://megascenes.github.io .
comment: Our project page is at https://megascenes.github.io
☆ VideoLLM-online: Online Video Large Language Model for Streaming Video CVPR 2024
Recent Large Language Models have been enhanced with vision capabilities, enabling them to comprehend images, videos, and interleaved vision-language content. However, the learning methods of these large multimodal models typically treat videos as predetermined clips, making them less effective and efficient at handling streaming video inputs. In this paper, we propose a novel Learning-In-Video-Stream (LIVE) framework, which enables temporally aligned, long-context, and real-time conversation within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue, encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs, (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format, and (3) an optimized inference pipeline to speed up the model responses in real-world video streams. With our LIVE framework, we built VideoLLM-online model upon Llama-2/Llama-3 and demonstrate its significant advantages in processing streaming videos. For instance, on average, our model can support streaming dialogue in a 5-minute video clip at over 10 FPS on an A100 GPU. Moreover, it also showcases state-of-the-art performance on public offline video benchmarks, such as recognition, captioning, and forecasting. The code, model, data, and demo have been made available at https://showlab.github.io/videollm-online.
comment: CVPR 2024. This arxiv version is upgraded with Llama-3
☆ LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. Additionally, we show that predicting intermediate 2-D representations, which we refer to as "visual traces", can help further align vision and action spaces for robot learning. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model, and we evaluate on 12 different tasks in the RLBench simulator as well as a physical Franka Emika Panda 7-DoF robot. Our experiments yield strong performance, demonstrating that LLARVA - using 2-D and language representations - performs well compared to several contemporary baselines, and can generalize across various robot environments and configurations.
☆ Faces of Experimental Pain: Transferability of Deep Learned Heat Pain Features to Electrical Pain
The limited size of pain datasets are a challenge in developing robust deep learning models for pain recognition. Transfer learning approaches are often employed in these scenarios. In this study, we investigate whether deep learned feature representation for one type of experimentally induced pain can be transferred to another. Participating in the AI4Pain challenge, our goal is to classify three levels of pain (No-Pain, Low-Pain, High-Pain). The challenge dataset contains data collected from 65 participants undergoing varying intensities of electrical pain. We utilize the video recording from the dataset to investigate the transferability of deep learned heat pain model to electrical pain. In our proposed approach, we leverage an existing heat pain convolutional neural network (CNN) - trained on BioVid dataset - as a feature extractor. The images from the challenge dataset are inputted to the pre-trained heat pain CNN to obtain feature vectors. These feature vectors are used to train two machine learning models: a simple feed-forward neural network and a long short-term memory (LSTM) network. Our approach was tested using the dataset's predefined training, validation, and testing splits. Our models outperformed the baseline of the challenge on both the validation and tests sets, highlighting the potential of models trained on other pain datasets for reliable feature extraction.
☆ PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models
Text-to-image (T2I) models have made substantial progress in generating images from textual prompts. However, they frequently fail to produce images consistent with physical commonsense, a vital capability for applications in world simulation and everyday tasks. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge, particularly physical commonsense. To address this issue, we introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties, encompassing 31 distinct physical scenarios. We assess 6 prominent T2I models, including proprietary models DALLE3 and Gemini, and demonstrate that incorporating physical principles into prompts enhances the models' ability to generate physically accurate images. Our findings reveal that: (1) even advanced models frequently err in various physical scenarios, except for optics; (2) GPT-4o, with item-specific scoring instructions, effectively evaluates the models' understanding of physical commonsense, closely aligning with human assessments; and (3) current T2I models are primarily focused on text-to-image translation, lacking profound reasoning regarding physical commonsense. We advocate for increased attention to the inherent knowledge within T2I models, beyond their utility as mere image generation tools. The code and data are available at https://github.com/OpenGVLab/PhyBench.
☆ Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
☆ A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping
Robotic grasping presents a difficult motor task in real-world scenarios, constituting a major hurdle to the deployment of capable robots across various industries. Notably, the scarcity of data makes grasping particularly challenging for learned models. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms predicated on massive amounts of data sourced from the Internet, and now nearly all prominent models leverage pretrained backbone networks. Against this backdrop, we begin to investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance. This preliminary literature review sheds light on critical challenges and delineates prospective directions for future research in visual pretraining for robotic manipulation.
comment: This report was written in February 2023, thus does not account for any works since then
☆ Task Me Anything
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.
comment: website: https://www.task-me-anything.org
☆ Deep Learning methodology for the identification of wood species using high-resolution macroscopic images
Significant advancements in the field of wood species identification are needed worldwide to support sustainable timber trade. In this work we contribute to automate the identification of wood species via high-resolution macroscopic images of timber. The main challenge of this problem is that fine-grained patterns in timber are crucial in order to accurately identify wood species, and these patterns are not properly learned by traditional convolutional neural networks (CNNs) trained on low/medium resolution images. We propose a Timber Deep Learning Identification with Patch-based Inference Voting methodology, abbreviated TDLI-PIV methodology. Our proposal exploits the concept of patching and the availability of high-resolution macroscopic images of timber in order to overcome the inherent challenges that CNNs face in timber identification. The TDLI-PIV methodology is able to capture fine-grained patterns in timber and, moreover, boosts robustness and prediction accuracy via a collaborative voting inference process. In this work we also introduce a new data set of marcroscopic images of timber, called GOIMAI-Phase-I, which has been obtained using optical magnification in order to capture fine-grained details, which contrasts to the other datasets that are publicly available. More concretely, images in GOIMAI-Phase-I are taken with a smartphone with a 24x magnifying lens attached to the camera. Our data set contains 2120 images of timber and covers 37 legally protected wood species. Our experiments have assessed the performance of the TDLI-PIV methodology, involving the comparison with other methodologies available in the literature, exploration of data augmentation methods and the effect that the dataset size has on the accuracy of TDLI-PIV.
comment: 17 pages and 6 figures
☆ Solving Vision Tasks with Simple Photoreceptors Instead of Cameras
A de facto standard in solving computer vision problems is to use a common high-resolution camera and choose its placement on an agent (i.e., position and orientation) based on human intuition. On the other hand, extremely simple and well-designed visual sensors found throughout nature allow many organisms to perform diverse, complex behaviors. In this work, motivated by these examples, we raise the following questions: 1. How effective simple visual sensors are in solving vision tasks? 2. What role does their design play in their effectiveness? We explore simple sensors with resolutions as low as one-by-one pixel, representing a single photoreceptor First, we demonstrate that just a few photoreceptors can be enough to solve many tasks, such as visual navigation and continuous control, reasonably well, with performance comparable to that of a high-resolution camera. Second, we show that the design of these simple visual sensors plays a crucial role in their ability to provide useful information and successfully solve these tasks. To find a well-performing design, we present a computational design optimization algorithm and evaluate its effectiveness across different tasks and domains, showing promising results. Finally, we perform a human survey to evaluate the effectiveness of intuitive designs devised manually by humans, showing that the computationally found design is among the best designs in most cases.
☆ Matching Query Image Against Selected NeRF Feature for Efficient and Scalable Localization
Neural implicit representations such as NeRF have revolutionized 3D scene representation with photo-realistic quality. However, existing methods for visual localization within NeRF representations suffer from inefficiency and scalability issues, particularly in large-scale environments. This work proposes MatLoc-NeRF, a novel matching-based localization framework using selected NeRF features. It addresses efficiency by employing a learnable feature selection mechanism that identifies informative NeRF features for matching with query images. This eliminates the need for all NeRF features or additional descriptors, leading to faster and more accurate pose estimation. To tackle large-scale scenes, MatLoc-NeRF utilizes a pose-aware scene partitioning strategy. It ensures that only the most relevant NeRF sub-block generates key features for a specific pose. Additionally, scene segmentation and a place predictor provide fast coarse initial pose estimation. Evaluations on public large-scale datasets demonstrate that MatLoc-NeRF achieves superior efficiency and accuracy compared to existing NeRF-based localization methods.
comment: 12 pages, 2 figures
☆ Domain Generalization for In-Orbit 6D Pose Estimation
We address the problem of estimating the relative 6D pose, i.e., position and orientation, of a target spacecraft, from a monocular image, a key capability for future autonomous Rendezvous and Proximity Operations. Due to the difficulty of acquiring large sets of real images, spacecraft pose estimation networks are exclusively trained on synthetic ones. However, because those images do not capture the illumination conditions encountered in orbit, pose estimation networks face a domain gap problem, i.e., they do not generalize to real images. Our work introduces a method that bridges this domain gap. It relies on a novel, end-to-end, neural-based architecture as well as a novel learning strategy. This strategy improves the domain generalization abilities of the network through multi-task learning and aggressive data augmentation policies, thereby enforcing the network to learn domain-invariant features. We demonstrate that our method effectively closes the domain gap, achieving state-of-the-art accuracy on the widespread SPEED+ dataset. Finally, ablation studies assess the impact of key components of our method on its generalization abilities.
☆ V3Det Challenge 2024 on Vast Vocabulary and Open Vocabulary Object Detection: Methods and Results
Detecting objects in real-world scenes is a complex task due to various challenges, including the vast range of object categories, and potential encounters with previously unknown or unseen objects. The challenges necessitate the development of public benchmarks and challenges to advance the field of object detection. Inspired by the success of previous COCO and LVIS Challenges, we organize the V3Det Challenge 2024 in conjunction with the 4th Open World Vision Workshop: Visual Perception via Learning in an Open World (VPLOW) at CVPR 2024, Seattle, US. This challenge aims to push the boundaries of object detection research and encourage innovation in this field. The V3Det Challenge 2024 consists of two tracks: 1) Vast Vocabulary Object Detection: This track focuses on detecting objects from a large set of 13204 categories, testing the detection algorithm's ability to recognize and locate diverse objects. 2) Open Vocabulary Object Detection: This track goes a step further, requiring algorithms to detect objects from an open set of categories, including unknown objects. In the following sections, we will provide a comprehensive summary and analysis of the solutions submitted by participants. By analyzing the methods and solutions presented, we aim to inspire future research directions in vast vocabulary and open-vocabulary object detection, driving progress in this field. Challenge homepage: https://v3det.openxlab.org.cn/challenge
☆ InterNeRF: Scaling Radiance Fields via Parameter Interpolation CVPR 2024
Neural Radiance Fields (NeRFs) have unmatched fidelity on large, real-world scenes. A common approach for scaling NeRFs is to partition the scene into regions, each of which is assigned its own parameters. When implemented naively, such an approach is limited by poor test-time scaling and inconsistent appearance and geometry. We instead propose InterNeRF, a novel architecture for rendering a target view using a subset of the model's parameters. Our approach enables out-of-core training and rendering, increasing total model capacity with only a modest increase to training time. We demonstrate significant improvements in multi-room scenes while remaining competitive on standard benchmarks.
comment: Presented at CVPR 2024 Neural Rendering Intelligence Workshop
☆ Correspondence Free Multivector Cloud Registration using Conformal Geometric Algebra
We present, for the first time, a novel theoretical approach to address the problem of correspondence free multivector cloud registration in conformal geometric algebra. Such formalism achieves several favorable properties. Primarily, it forms an orthogonal automorphism that extends beyond the typical vector space to the entire conformal geometric algebra while respecting the multivector grading. Concretely, the registration can be viewed as an orthogonal transformation (\it i.e., scale, translation, rotation) belonging to $SO(4,1)$ - group of special orthogonal transformations in conformal geometric algebra. We will show that such formalism is able to: $(i)$ perform the registration without directly accessing the input multivectors. Instead, we use primitives or geometric objects provided by the conformal model - the multivectors, $(ii)$ the geometric objects are obtained by solving a multilinear eigenvalue problem to find sets of eigenmultivectors. In this way, we can explicitly avoid solving the correspondences in the registration process. Most importantly, this offers rotation and translation equivariant properties between the input multivectors and the eigenmultivectors. Experimental evaluation is conducted in datasets commonly used in point cloud registration, to testify the usefulness of the approach with emphasis to ambiguities arising from high levels of noise. The code is available at https://github.com/Numerical-Geometric-Algebra/RegistrationGA . This work was submitted to the International Journal of Computer Vision and is currently under review.
☆ Latent Denoising Diffusion GAN: Faster sampling, Higher image quality
Diffusion models are emerging as powerful solutions for generating high-fidelity and diverse images, often surpassing GANs under many circumstances. However, their slow inference speed hinders their potential for real-time applications. To address this, DiffusionGAN leveraged a conditional GAN to drastically reduce the denoising steps and speed up inference. Its advancement, Wavelet Diffusion, further accelerated the process by converting data into wavelet space, thus enhancing efficiency. Nonetheless, these models still fall short of GANs in terms of speed and image quality. To bridge these gaps, this paper introduces the Latent Denoising Diffusion GAN, which employs pre-trained autoencoders to compress images into a compact latent space, significantly improving inference speed and image quality. Furthermore, we propose a Weighted Learning strategy to enhance diversity and image quality. Experimental results on the CIFAR-10, CelebA-HQ, and LSUN-Church datasets prove that our model achieves state-of-the-art running speed among diffusion models. Compared to its predecessors, DiffusionGAN and Wavelet Diffusion, our model shows remarkable improvements in all evaluation metrics. Code and pre-trained checkpoints: \url{https://github.com/thanhluantrinh/LDDGAN.git}
comment: Submited to IEEE Access
☆ OGNI-DC: Robust Depth Completion with Optimization-Guided Neural Iterations
Depth completion is the task of generating a dense depth map given an image and a sparse depth map as inputs. It has important applications in various downstream tasks. In this paper, we present OGNI-DC, a novel framework for depth completion. The key to our method is "Optimization-Guided Neural Iterations" (OGNI). It consists of a recurrent unit that refines a depth gradient field and a differentiable depth integrator that integrates the depth gradients into a depth map. OGNI-DC exhibits strong generalization, outperforming baselines by a large margin on unseen datasets and across various sparsity levels. Moreover, OGNI-DC has high accuracy, achieving state-of-the-art performance on the NYUv2 and the KITTI benchmarks. Code is available at https://github.com/princeton-vl/OGNI-DC.
☆ A First Physical-World Trajectory Prediction Attack via LiDAR-induced Deceptions in Autonomous Driving USENIX Security
Trajectory prediction forecasts nearby agents' moves based on their historical trajectories. Accurate trajectory prediction is crucial for autonomous vehicles. Existing attacks compromise the prediction model of a victim AV by directly manipulating the historical trajectory of an attacker AV, which has limited real-world applicability. This paper, for the first time, explores an indirect attack approach that induces prediction errors via attacks against the perception module of a victim AV. Although it has been shown that physically realizable attacks against LiDAR-based perception are possible by placing a few objects at strategic locations, it is still an open challenge to find an object location from the vast search space in order to launch effective attacks against prediction under varying victim AV velocities. Through analysis, we observe that a prediction model is prone to an attack focusing on a single point in the scene. Consequently, we propose a novel two-stage attack framework to realize the single-point attack. The first stage of prediction-side attack efficiently identifies, guided by the distribution of detection results under object-based attacks against perception, the state perturbations for the prediction model that are effective and velocity-insensitive. In the second stage of location matching, we match the feasible object locations with the found state perturbations. Our evaluation using a public autonomous driving dataset shows that our attack causes a collision rate of up to 63% and various hazardous responses of the victim AV. The effectiveness of our attack is also demonstrated on a real testbed car. To the best of our knowledge, this study is the first security analysis spanning from LiDAR-based perception to prediction in autonomous driving, leading to a realistic attack on prediction. To counteract the proposed attack, potential defenses are discussed.
comment: In Proceedings of the 33rd USENIX Security Symposium 2024
☆ Lightweight Model Pre-training via Language Guided Knowledge Distillation
This paper studies the problem of pre-training for small models, which is essential for many mobile devices. Current state-of-the-art methods on this problem transfer the representational knowledge of a large network (as a Teacher) into a smaller model (as a Student) using self-supervised distillation, improving the performance of the small model on downstream tasks. However, existing approaches are insufficient in extracting the crucial knowledge that is useful for discerning categories in downstream tasks during the distillation process. In this paper, for the first time, we introduce language guidance to the distillation process and propose a new method named Language-Guided Distillation (LGD) system, which uses category names of the target downstream task to help refine the knowledge transferred between the teacher and student. To this end, we utilize a pre-trained text encoder to extract semantic embeddings from language and construct a textual semantic space called Textual Semantics Bank (TSB). Furthermore, we design a Language-Guided Knowledge Aggregation (LGKA) module to construct the visual semantic space, also named Visual Semantics Bank (VSB). The task-related knowledge is transferred by driving a student encoder to mimic the similarity score distribution inferred by a teacher over TSB and VSB. Compared with other small models obtained by either ImageNet pre-training or self-supervised distillation, experiment results show that the distilled lightweight model using the proposed LGD method presents state-of-the-art performance and is validated on various downstream tasks, including classification, detection, and segmentation. We have made the code available at https://github.com/mZhenz/LGD.
☆ Effective Rank Analysis and Regularization for Enhanced 3D Gaussian Splatting
3D reconstruction from multi-view images is one of the fundamental challenges in computer vision and graphics. Recently, 3D Gaussian Splatting (3DGS) has emerged as a promising technique capable of real-time rendering with high-quality 3D reconstruction. This method utilizes 3D Gaussian representation and tile-based splatting techniques, bypassing the expensive neural field querying. Despite its potential, 3DGS encounters challenges, including needle-like artifacts, suboptimal geometries, and inaccurate normals, due to the Gaussians converging into anisotropic Gaussians with one dominant variance. We propose using effective rank analysis to examine the shape statistics of 3D Gaussian primitives, and identify the Gaussians indeed converge into needle-like shapes with the effective rank 1. To address this, we introduce effective rank as a regularization, which constrains the structure of the Gaussians. Our new regularization method enhances normal and geometry reconstruction while reducing needle-like artifacts. The approach can be integrated as an add-on module to other 3DGS variants, improving their quality without compromising visual fidelity.
☆ See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.
comment: 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective
☆ Discriminative Hamiltonian Variational Autoencoder for Accurate Tumor Segmentation in Data-Scarce Regimes
Deep learning has gained significant attention in medical image segmentation. However, the limited availability of annotated training data presents a challenge to achieving accurate results. In efforts to overcome this challenge, data augmentation techniques have been proposed. However, the majority of these approaches primarily focus on image generation. For segmentation tasks, providing both images and their corresponding target masks is crucial, and the generation of diverse and realistic samples remains a complex task, especially when working with limited training datasets. To this end, we propose a new end-to-end hybrid architecture based on Hamiltonian Variational Autoencoders (HVAE) and a discriminative regularization to improve the quality of generated images. Our method provides an accuracte estimation of the joint distribution of the images and masks, resulting in the generation of realistic medical images with reduced artifacts and off-distribution instances. As generating 3D volumes requires substantial time and memory, our architecture operates on a slice-by-slice basis to segment 3D volumes, capitilizing on the richly augmented dataset. Experiments conducted on two public datasets, BRATS (MRI modality) and HECKTOR (PET modality), demonstrate the efficacy of our proposed method on different medical imaging modalities with limited data.
☆ Multimodal Learning To Improve Segmentation With Intraoperative CBCT & Preoperative CT
Intraoperative medical imaging, particularly Cone-beam computed tomography (CBCT), is an important tool facilitating computer aided interventions, despite a lower visual quality. While this degraded image quality can affect downstream segmentation, the availability of high quality preoperative scans represents potential for improvements. Here we consider a setting where preoperative CT and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect. We propose a multimodal learning method that fuses roughly aligned CBCT and CT scans and investigate the effect of CBCT quality and misalignment (affine and elastic transformations facilitating misalignment) on the final segmentation performance. As an application scenario, we focus on the segmentation of liver and liver tumor semantic segmentation and evaluate the effect of intraoperative image quality and misalignment on segmentation performance. To accomplish this, high quality, labelled CTs are defined as preoperative and used as a basis to simulate intraoperative CBCT. We show that the fusion of preoperative CT and simulated, intraoperative CBCT mostly improves segmentation performance and that even clearly misaligned preoperative data has the potential to improve segmentation performance.
comment: Submitted to SASHIMI2024
☆ SeamPose: Repurposing Seams as Capacitive Sensors in a Shirt for Upper-Body Pose Tracking
Seams are areas of overlapping fabric formed by stitching two or more pieces of fabric together in the cut-and-sew apparel manufacturing process. In SeamPose, we repurposed seams as capacitive sensors in a shirt for continuous upper-body pose estimation. Compared to previous all-textile motion-capturing garments that place the electrodes on the surface of clothing, our solution leverages existing seams inside of a shirt by machine-sewing insulated conductive threads over the seams. The unique invisibilities and placements of the seams afford the sensing shirt to look and wear the same as a conventional shirt while providing exciting pose-tracking capabilities. To validate this approach, we implemented a proof-of-concept untethered shirt. With eight capacitive sensing seams, our customized deep-learning pipeline accurately estimates the upper-body 3D joint positions relative to the pelvis. With a 12-participant user study, we demonstrated promising cross-user and cross-session tracking performance. SeamPose represents a step towards unobtrusive integration of smart clothing for everyday pose estimation.
☆ AnyMaker: Zero-shot General Object Customization via Decoupled Dual-Level ID Injection
Text-to-image based object customization, aiming to generate images with the same identity (ID) as objects of interest in accordance with text prompts and reference images, has made significant progress. However, recent customizing research is dominated by specialized tasks, such as human customization or virtual try-on, leaving a gap in general object customization. To this end, we introduce AnyMaker, an innovative zero-shot object customization framework capable of generating general objects with high ID fidelity and flexible text editability. The efficacy of AnyMaker stems from its novel general ID extraction, dual-level ID injection, and ID-aware decoupling. Specifically, the general ID extraction module extracts sufficient ID information with an ensemble of self-supervised models to tackle the diverse customization tasks for general objects. Then, to provide the diffusion UNet with the extracted ID as much while not damaging the text editability in the generation process, we design a global-local dual-level ID injection module, in which the global-level semantic ID is injected into text descriptions while the local-level ID details are injected directly into the model through newly added cross-attention modules. In addition, we propose an ID-aware decoupling module to disentangle ID-related information from non-ID elements in the extracted representations for high-fidelity generation of both identity and text descriptions. To validate our approach and boost the research of general object customization, we create the first large-scale general ID dataset, Multi-Category ID-Consistent (MC-IDC) dataset, with 315k text-image samples and 10k categories. Experiments show that AnyMaker presents remarkable performance in general object customization and outperforms specialized methods in corresponding tasks. Code and dataset will be released soon.
☆ YOLO-FEDER FusionNet: A Novel Deep Learning Architecture for Drone Detection ICIP
Predominant methods for image-based drone detection frequently rely on employing generic object detection algorithms like YOLOv5. While proficient in identifying drones against homogeneous backgrounds, these algorithms often struggle in complex, highly textured environments. In such scenarios, drones seamlessly integrate into the background, creating camouflage effects that adversely affect the detection quality. To address this issue, we introduce a novel deep learning architecture called YOLO-FEDER FusionNet. Unlike conventional approaches, YOLO-FEDER FusionNet combines generic object detection methods with the specialized strength of camouflage object detection techniques to enhance drone detection capabilities. Comprehensive evaluations of YOLO-FEDER FusionNet show the efficiency of the proposed model and demonstrate substantial improvements in both reducing missed detections and false alarms.
comment: 7 pages, 4 figures, 6 tables, to be published in the conference proceedings of the 2024 IEEE International Conference on Image Processing (ICIP)
☆ Feasibility of Federated Learning from Client Databases with Different Brain Diseases and MRI Modalities
Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous databases that may contain scans and segmentation labels for different types of brain pathologies and diverse sets of MRI modalities. Is it feasible to use Federated Learning (FL) for training a single model on client databases that contain scans and labels of different brain pathologies and diverse sets of MRI modalities? We demonstrate promising results by combining appropriate, simple, and practical modifications to the model and training strategy: Designing a model with input channels that cover the whole set of modalities available across clients, training with random modality drop, and exploring the effects of feature normalization methods. Evaluation on 7 brain MRI databases with 5 different diseases shows that such FL framework can train a single model that is shown to be very promising in segmenting all disease types seen during training. Importantly, it is able to segment these diseases in new databases that contain sets of modalities different from those in training clients. These results demonstrate, for the first time, feasibility and effectiveness of using FL to train a single segmentation model on decentralised data with diverse brain diseases and MRI modalities, a necessary step towards leveraging heterogeneous real-world databases. Code will be made available at: https://github.com/FelixWag/FL-MultiDisease-MRI
☆ DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
Scientific documents record research findings and valuable human knowledge, comprising a vast corpus of high-quality data. Leveraging multi-modality data extracted from these documents and assessing large models' abilities to handle scientific document-oriented tasks is therefore meaningful. Despite promising advancements, large models still perform poorly on multi-page scientific document extraction and understanding tasks, and their capacity to process within-document data formats such as charts and equations remains under-explored. To address these issues, we present DocGenome, a structured document benchmark constructed by annotating 500K scientific documents from 153 disciplines in the arXiv open-access community, using our custom auto-labeling pipeline. DocGenome features four key characteristics: 1) Completeness: It is the first dataset to structure data from all modalities including 13 layout attributes along with their LaTeX source codes. 2) Logicality: It provides 6 logical relationships between different entities within each scientific document. 3) Diversity: It covers various document-oriented tasks, including document classification, visual grounding, document layout detection, document transformation, open-ended single-page QA and multi-page QA. 4) Correctness: It undergoes rigorous quality control checks conducted by a specialized team. We conduct extensive experiments to demonstrate the advantages of DocGenome and objectively evaluate the performance of large models on our benchmark.
comment: Homepage of DocGenome: https://unimodal4reasoning.github.io/DocGenome_page 22 pages, 11 figures
☆ Words in Motion: Representation Engineering for Motion Forecasting
Motion forecasting transforms sequences of past movements and environment context into future motion. Recent methods rely on learned representations, resulting in hidden states that are difficult to interpret. In this work, we use natural language to quantize motion features in a human-interpretable way, and measure the degree to which they are embedded in hidden states. Our experiments reveal that hidden states of motion sequences are arranged with respect to our discrete sets of motion features. Following these insights, we fit control vectors to motion features, which allow for controlling motion forecasts at inference. Consequently, our method enables controlling transformer-based motion forecasting models with textual inputs, providing a unique interface to interact with and understand these models. Our implementation is available at https://github.com/kit-mrt/future-motion
☆ Learning Hierarchical Semantic Classification by Grounding on Consistent Image Segmentations
Hierarchical semantic classification requires the prediction of a taxonomy tree instead of a single flat level of the tree, where both accuracies at individual levels and consistency across levels matter. We can train classifiers for individual levels, which has accuracy but not consistency, or we can train only the finest level classification and infer higher levels, which has consistency but not accuracy. Our key insight is that hierarchical recognition should not be treated as multi-task classification, as each level is essentially a different task and they would have to compromise with each other, but be grounded on image segmentations that are consistent across semantic granularities. Consistency can in fact improve accuracy. We build upon recent work on learning hierarchical segmentation for flat-level recognition, and extend it to hierarchical recognition. It naturally captures the intuition that fine-grained recognition requires fine image segmentation whereas coarse-grained recognition requires coarse segmentation; they can all be integrated into one recognition model that drives fine-to-coarse internal visual parsing.Additionally, we introduce a Tree-path KL Divergence loss to enforce consistent accurate predictions across levels. Our extensive experimentation and analysis demonstrate our significant gains on predicting an accurate and consistent taxonomy tree.
comment: 34 pages
☆ ChildDiffusion: Unlocking the Potential of Generative AI and Controllable Augmentations for Child Facial Data using Stable Diffusion and Large Language Models
In this research work we have proposed high-level ChildDiffusion framework capable of generating photorealistic child facial samples and further embedding several intelligent augmentations on child facial data using short text prompts, detailed textual guidance from LLMs, and further image to image transformation using text guidance control conditioning thus providing an opportunity to curate fully synthetic large scale child datasets. The framework is validated by rendering high-quality child faces representing ethnicity data, micro expressions, face pose variations, eye blinking effects, facial accessories, different hair colours and styles, aging, multiple and different child gender subjects in a single frame. Addressing privacy concerns regarding child data acquisition requires a comprehensive approach that involves legal, ethical, and technological considerations. Keeping this in view this framework can be adapted to synthesise child facial data which can be effectively used for numerous downstream machine learning tasks. The proposed method circumvents common issues encountered in generative AI tools, such as temporal inconsistency and limited control over the rendered outputs. As an exemplary use case we have open-sourced child ethnicity data consisting of 2.5k child facial samples of five different classes which includes African, Asian, White, South Asian/ Indian, and Hispanic races by deploying the model in production inference phase. The rendered data undergoes rigorous qualitative as well as quantitative tests to cross validate its efficacy and further fine-tuning Yolo architecture for detecting and classifying child ethnicity as an exemplary downstream machine learning task.
comment: This work has been submitted to the IEEE Transactions Journal for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
We introduce Duoduo CLIP, a model for 3D representation learning that learns shape encodings from multi-view images instead of point-clouds. The choice of multi-view images allows us to leverage 2D priors from off-the-shelf CLIP models to facilitate fine-tuning with 3D data. Our approach not only shows better generalization compared to existing point cloud methods, but also reduces GPU requirements and training time. In addition, we modify the model with cross-view attention to leverage information across multiple frames of the object which further boosts performance. Compared to the current SOTA point cloud method that requires 480 A100 hours to train 1 billion model parameters we only require 57 A5000 hours and 87 million parameters. Multi-view images also provide more flexibility in use cases compared to point clouds. This includes being able to encode objects with a variable number of images, with better performance when more views are used. This is in contrast to point cloud based methods, where an entire scan or model of an object is required. We showcase this flexibility with object retrieval from images of real-world objects. Our model also achieves better performance in more fine-grained text to shape retrieval, demonstrating better text-and-shape alignment than point cloud based models.
☆ Harmonizing Feature Maps: A Graph Convolutional Approach for Enhancing Adversarial Robustness
The vulnerability of Deep Neural Networks to adversarial perturbations presents significant security concerns, as the imperceptible perturbations can contaminate the feature space and lead to incorrect predictions. Recent studies have attempted to calibrate contaminated features by either suppressing or over-activating particular channels. Despite these efforts, we claim that adversarial attacks exhibit varying disruption levels across individual channels. Furthermore, we argue that harmonizing feature maps via graph and employing graph convolution can calibrate contaminated features. To this end, we introduce an innovative plug-and-play module called Feature Map-based Reconstructed Graph Convolution (FMR-GC). FMR-GC harmonizes feature maps in the channel dimension to reconstruct the graph, then employs graph convolution to capture neighborhood information, effectively calibrating contaminated features. Extensive experiments have demonstrated the superior performance and scalability of FMR-GC. Moreover, our model can be combined with advanced adversarial training methods to considerably enhance robustness without compromising the model's clean accuracy.
☆ Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting
Color image inpainting is a challenging task in imaging science. The existing method is based on real operation, and the red, green and blue channels of the color image are processed separately, ignoring the correlation between each channel. In order to make full use of the correlation between each channel, this paper proposes a Quaternion Generative Adversarial Neural Network (QGAN) model and related theory, and applies it to solve the problem of color image inpainting with large area missing. Firstly, the definition of quaternion deconvolution is given and the quaternion batch normalization is proposed. Secondly, the above two innovative modules are applied to generate adversarial networks to improve stability. Finally, QGAN is applied to color image inpainting and compared with other state-of-the-art algorithms. The experimental results show that QGAN has superiority in color image inpainting with large area missing.
comment: 19 pages, 6 figures
☆ Simple Yet Efficient: Towards Self-Supervised FG-SBIR with Unified Sample Feature Alignment
Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space. However, scalability is hindered by the growing complexity of solutions, mainly due to the abstract nature of fine-grained sketches. In this paper, we propose a simple yet efficient approach to narrow the gap between the two modes. It mainly facilitates unified mutual information sharing both intra- and inter-samples, rather than treating them as a single feature alignment problem between modalities. Specifically, our approach includes: (i) Employing dual weight-sharing networks to optimize alignment within sketch and image domain, which also effectively mitigates model learning saturation issues. (ii) Introducing an objective optimization function based on contrastive loss to enhance the model's ability to align features intra- and inter-samples. (iii) Presenting a learnable TRSM combined of self-attention and cross-attention to promote feature representations among tokens, further enhancing sample alignment in the embedding space. Our framework achieves excellent results on CNN- and ViT-based backbones. Extensive experiments demonstrate its superiority over existing methods. We also introduce Cloths-V1, the first professional fashion sketches and images dataset, utilized to validate our method and will be beneficial for other applications.
comment: 10 pages,8 figures, 4 tables
☆ AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We carefully design two types of prompt instructions through interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2)textual descriptions to indicate potential directions for rotation correction.During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Real-world demonstration can be found at https://sites.google.com/view/aic-mllm
☆ Improving Quality Control of Whole Slide Images by Explicit Artifact Augmentation
The problem of artifacts in whole slide image acquisition, prevalent in both clinical workflows and research-oriented settings, necessitates human intervention and re-scanning. Overcoming this challenge requires developing quality control algorithms, that are hindered by the limited availability of relevant annotated data in histopathology. The manual annotation of ground-truth for artifact detection methods is expensive and time-consuming. This work addresses the issue by proposing a method dedicated to augmenting whole slide images with artifacts. The tool seamlessly generates and blends artifacts from an external library to a given histopathology dataset. The augmented datasets are then utilized to train artifact classification methods. The evaluation shows their usefulness in classification of the artifacts, where they show an improvement from 0.10 to 0.01 AUROC depending on the artifact type. The framework, model, weights, and ground-truth annotations are freely released to facilitate open science and reproducible research.
☆ RO-SVD: A Reconfigurable Hardware Copyright Protection Framework for AIGC Applications
The dramatic surge in the utilisation of generative artificial intelligence (GenAI) underscores the need for a secure and efficient mechanism to responsibly manage, use and disseminate multi-dimensional data generated by artificial intelligence (AI). In this paper, we propose a blockchain-based copyright traceability framework called ring oscillator-singular value decomposition (RO-SVD), which introduces decomposition computing to approximate low-rank matrices generated from hardware entropy sources and establishes an AI-generated content (AIGC) copyright traceability mechanism at the device level. By leveraging the parallelism and reconfigurability of field-programmable gate arrays (FPGAs), our framework can be easily constructed on existing AI-accelerated devices and provide a low-cost solution to emerging copyright issues of AIGC. We developed a hardware-software (HW/SW) co-design prototype based on comprehensive analysis and on-board experiments with multiple AI-applicable FPGAs. Using AI-generated images as a case study, our framework demonstrated effectiveness and emphasised customisation, unpredictability, efficiency, management and reconfigurability. To the best of our knowledge, this is the first practical hardware study discussing and implementing copyright traceability specifically for AI-generated content.
comment: Accepted on 20 May 2024 as a full paper at ASAP 2024
☆ Inpainting the Gaps: A Novel Framework for Evaluating Explanation Methods in Vision Transformers
The perturbation test remains the go-to evaluation approach for explanation methods in computer vision. This evaluation method has a major drawback of test-time distribution shift due to pixel-masking that is not present in the training set. To overcome this drawback, we propose a novel evaluation framework called \textbf{Inpainting the Gaps (InG)}. Specifically, we propose inpainting parts that constitute partial or complete objects in an image. In this way, one can perform meaningful image perturbations with lower test-time distribution shifts, thereby improving the efficacy of the perturbation test. InG is applied to the PartImageNet dataset to evaluate the performance of popular explanation methods for three training strategies of the Vision Transformer (ViT). Based on this evaluation, we found Beyond Intuition and Generic Attribution to be the two most consistent explanation models. Further, and interestingly, the proposed framework results in higher and more consistent evaluation scores across all the ViT models considered in this work. To the best of our knowledge, InG is the first semi-synthetic framework for the evaluation of ViT explanation methods.
☆ HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model
Foundation models (FMs) are revolutionizing the analysis and understanding of remote sensing (RS) scenes, including aerial RGB, multispectral, and SAR images. However, hyperspectral images (HSIs), which are rich in spectral information, have not seen much application of FMs, with existing methods often restricted to specific tasks and lacking generality. To fill this gap, we introduce HyperSIGMA, a vision transformer-based foundation model for HSI interpretation, scalable to over a billion parameters. To tackle the spectral and spatial redundancy challenges in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, and real-world applicability.
comment: The code and models will be released at https://github.com/WHU-Sigma/HyperSIGMA
☆ Prior Normality Prompt Transformer for Multi-class Industrial Image Anomaly Detection
Image anomaly detection plays a pivotal role in industrial inspection. Traditional approaches often demand distinct models for specific categories, resulting in substantial deployment costs. This raises concerns about multi-class anomaly detection, where a unified model is developed for multiple classes. However, applying conventional methods, particularly reconstruction-based models, directly to multi-class scenarios encounters challenges such as identical shortcut learning, hindering effective discrimination between normal and abnormal instances. To tackle this issue, our study introduces the Prior Normality Prompt Transformer (PNPT) method for multi-class image anomaly detection. PNPT strategically incorporates normal semantics prompting to mitigate the "identical mapping" problem. This entails integrating a prior normality prompt into the reconstruction process, yielding a dual-stream model. This innovative architecture combines normal prior semantics with abnormal samples, enabling dual-stream reconstruction grounded in both prior knowledge and intrinsic sample characteristics. PNPT comprises four essential modules: Class-Specific Normality Prompting Pool (CS-NPP), Hierarchical Patch Embedding (HPE), Semantic Alignment Coupling Encoding (SACE), and Contextual Semantic Conditional Decoding (CSCD). Experimental validation on diverse benchmark datasets and real-world industrial applications highlights PNPT's superior performance in multi-class industrial anomaly detection.
comment: Accepted by IEEE Transactions on Industrial Informatics
☆ GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation
Large language models have seen widespread adoption in math problem-solving. However, in geometry problems that usually require visual aids for better understanding, even the most advanced multi-modal models currently still face challenges in effectively using image information. High-quality data is crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset. Experimental results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on the MathVista and MathVision benchmarks. The code is available at https://github.com/Lanyu0303/GeoGPT4V_Project
☆ Learning from Exemplars for Interactive Image Segmentation
Interactive image segmentation enables users to interact minimally with a machine, facilitating the gradual refinement of the segmentation mask for a target of interest. Previous studies have demonstrated impressive performance in extracting a single target mask through interactive segmentation. However, the information cues of previously interacted objects have been overlooked in the existing methods, which can be further explored to speed up interactive segmentation for multiple targets in the same category. To this end, we introduce novel interactive segmentation frameworks for both a single object and multiple objects in the same category. Specifically, our model leverages transformer backbones to extract interaction-focused visual features from the image and the interactions to obtain a satisfactory mask of a target as an exemplar. For multiple objects, we propose an exemplar-informed module to enhance the learning of similarities among the objects of the target category. To combine attended features from different modules, we incorporate cross-attention blocks followed by a feature fusion module. Experiments conducted on mainstream benchmarks demonstrate that our models achieve superior performance compared to previous methods. Particularly, our model reduces users' labor by around 15\%, requiring two fewer clicks to achieve target IoUs 85\% and 90\%. The results highlight our models' potential as a flexible and practical annotation tool. The source code will be released after publication.
comment: Under review
☆ Calibrating Where It Matters: Constrained Temperature Scaling NeurIPS 2023
We consider calibration of convolutional classifiers for diagnostic decision making. Clinical decision makers can use calibrated classifiers to minimise expected costs given their own cost function. Such functions are usually unknown at training time. If minimising expected costs is the primary aim, algorithms should focus on tuning calibration in regions of probability simplex likely to effect decisions. We give an example, modifying temperature scaling calibration, and demonstrate improved calibration where it matters using convnets trained to classify dermoscopy images.
comment: Presented at Medical Imaging Meets NeurIPS 2023
☆ MedThink: Inducing Medical Large-scale Visual Language Models to Hallucinate Less by Thinking More
When Large Vision Language Models (LVLMs) are applied to multimodal medical generative tasks, they suffer from significant model hallucination issues. This severely impairs the model's generative accuracy, making it challenging for LVLMs to be implemented in real-world medical scenarios to assist doctors in diagnosis. Enhancing the training data for downstream medical generative tasks is an effective way to address model hallucination. Moreover, the limited availability of training data in the medical field and privacy concerns greatly hinder the model's accuracy and generalization capabilities. In this paper, we introduce a method that mimics human cognitive processes to construct fine-grained instruction pairs and apply the concept of chain-of-thought (CoT) from inference scenarios to training scenarios, thereby proposing a method called MedThink. Our experiments on various LVLMs demonstrate that our novel data construction method tailored for the medical domain significantly improves the model's performance in medical image report generation tasks and substantially mitigates the hallucinations. All resources of this work will be released soon.
☆ Solving the Inverse Problem of Electrocardiography for Cardiac Digital Twins: A Survey
Cardiac digital twins are personalized virtual representations used to understand complex heart mechanisms. Solving the ECG inverse problem is crucial for accurate virtual heart modelling, enabling the derivation of internal electrical activity information from recorded surface potentials. Despite challenges from cardiac complexity, noisy ECG data, and computational efficiency, recent advancements hold significant promise for enhancing virtual heart modelling, ultimately advancing precision medicine in cardiology. This paper aims to provide a comprehensive review of the methods of solving ECG inverse problem, the validation strategies, the clinical applications, and future perspectives. For the computing methodologies, we broadly classify state-of-the-art approaches into two categories: deterministic and probabilistic methods, including conventional and deep learning-based techniques. Integrating physics laws with deep learning models holds promise, but challenges such as capturing dynamic electrophysiology accurately, accessing accurate domain knowledge, and quantifying prediction uncertainty persist. Integrating models into clinical workflows while ensuring interpretability and usability for healthcare professionals is essential. Overcoming these challenges will drive further research in cardiac digital twins.
☆ PrAViC: Probabilistic Adaptation Framework for Real-Time Video Classification
Video processing is generally divided into two main categories: processing of the entire video, which typically yields optimal classification outcomes, and real-time processing, where the objective is to make a decision as promptly as possible. The latter is often driven by the need to identify rapidly potential critical or dangerous situations. These could include machine failure, traffic accidents, heart problems, or dangerous behavior. Although the models dedicated to the processing of entire videos are typically well-defined and clearly presented in the literature, this is not the case for online processing, where a plethora of hand-devised methods exist. To address this, we present \our{}, a novel, unified, and theoretically-based adaptation framework for dealing with the online classification problem for video data. The initial phase of our study is to establish a robust mathematical foundation for the theory of classification of sequential data, with the potential to make a decision at an early stage. This allows us to construct a natural function that encourages the model to return an outcome much faster. The subsequent phase is to demonstrate a straightforward and readily implementable method for adapting offline models to online and recurrent operations. Finally, by comparing the proposed approach to the non-online state-of-the-art baseline, it is demonstrated that the use of \our{} encourages the network to make earlier classification decisions without compromising accuracy.
☆ SWCF-Net: Similarity-weighted Convolution and Local-global Fusion for Efficient Large-scale Point Cloud Semantic Segmentation
Large-scale point cloud consists of a multitude of individual objects, thereby encompassing rich structural and underlying semantic contextual information, resulting in a challenging problem in efficiently segmenting a point cloud. Most existing researches mainly focus on capturing intricate local features without giving due consideration to global ones, thus failing to leverage semantic context. In this paper, we propose a Similarity-Weighted Convolution and local-global Fusion Network, named SWCF-Net, which takes into account both local and global features. We propose a Similarity-Weighted Convolution (SWConv) to effectively extract local features, where similarity weights are incorporated into the convolution operation to enhance the generalization capabilities. Then, we employ a downsampling operation on the K and V channels within the attention module, thereby reducing the quadratic complexity to linear, enabling the Transformer to deal with large-scale point clouds. At last, orthogonal components are extracted in the global features and then aggregated with local features, thereby eliminating redundant information between local and global features and consequently promoting efficiency. We evaluate SWCF-Net on large-scale outdoor datasets SemanticKITTI and Toronto3D. Our experimental results demonstrate the effectiveness of the proposed network. Our method achieves a competitive result with less computational cost, and is able to handle large-scale point clouds efficiently.
☆ AnyTrans: Translate AnyText in the Image with Large Scale Models
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
☆ Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report
Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge
comment: Conference on Computer Vision and Pattern Recognition's 2nd Multimodal Foundation Models Challenge
☆ Multi-Camera Hand-Eye Calibration for Human-Robot Collaboration in Industrial Robotic Workcells
In industrial scenarios, effective human-robot collaboration relies on multi-camera systems to robustly monitor human operators despite the occlusions that typically show up in a robotic workcell. In this scenario, precise localization of the person in the robot coordinate system is essential, making the hand-eye calibration of the camera network critical. This process presents significant challenges when high calibration accuracy should be achieved in short time to minimize production downtime, and when dealing with extensive camera networks used for monitoring wide areas, such as industrial robotic workcells. Our paper introduces an innovative and robust multi-camera hand-eye calibration method, designed to optimize each camera's pose relative to both the robot's base and to each other camera. This optimization integrates two types of key constraints: i) a single board-to-end-effector transformation, and ii) the relative camera-to-camera transformations. We demonstrate the superior performance of our method through comprehensive experiments employing the METRIC dataset and real-world data collected on industrial scenarios, showing notable advancements over state-of-the-art techniques even using less than 10 images. Additionally, we release an open-source version of our multi-camera hand-eye calibration algorithm at https://github.com/davidea97/Multi-Camera-Hand-Eye-Calibration.git.
☆ Understanding Multi-Granularity for Open-Vocabulary Part Segmentation
Open-vocabulary part segmentation (OVPS) is an emerging research area focused on segmenting fine-grained entities based on diverse and previously unseen vocabularies. Our study highlights the inherent complexities of part segmentation due to intricate boundaries and diverse granularity, reflecting the knowledge-based nature of part identification. To address these challenges, we propose PartCLIPSeg, a novel framework utilizing generalized parts and object-level contexts to mitigate the lack of generalization in fine-grained parts. PartCLIPSeg integrates competitive part relationships and attention control techniques, alleviating ambiguous boundaries and underrepresented parts. Experimental results demonstrate that PartCLIPSeg outperforms existing state-of-the-art OVPS methods, offering refined segmentation and an advanced understanding of part relationships in images. Through extensive experiments, our model demonstrated an improvement over the state-of-the-art models on the Pascal-Part-116, ADE20K-Part-234, and PartImageNet datasets.
☆ Video Frame Interpolation for Polarization via Swin-Transformer
Video Frame Interpolation (VFI) has been extensively explored and demonstrated, yet its application to polarization remains largely unexplored. Due to the selective transmission of light by polarized filters, longer exposure times are typically required to ensure sufficient light intensity, which consequently lower the temporal sample rates. Furthermore, because polarization reflected by objects varies with shooting perspective, focusing solely on estimating pixel displacement is insufficient to accurately reconstruct the intermediate polarization. To tackle these challenges, this study proposes a multi-stage and multi-scale network called Swin-VFI based on the Swin-Transformer and introduces a tailored loss function to facilitate the network's understanding of polarization changes. To ensure the practicality of our proposed method, this study evaluates its interpolated frames in Shape from Polarization (SfP) and Human Shape Reconstruction tasks, comparing them with other state-of-the-art methods such as CAIN, FLAVR, and VFIT. Experimental results demonstrate our approach's superior reconstruction accuracy across all tasks.
comment: 18 pages, 10 figures, 7 tables, 73 citations
☆ Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach
Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
☆ CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
☆ Hallucination Mitigation Prompts Long-term Video Understanding
Recently, multimodal large language models have made significant advancements in video understanding tasks. However, their ability to understand unprocessed long videos is very limited, primarily due to the difficulty in supporting the enormous memory overhead. Although existing methods achieve a balance between memory and information by aggregating frames, they inevitably introduce the severe hallucination issue. To address this issue, this paper constructs a comprehensive hallucination mitigation pipeline based on existing MLLMs. Specifically, we use the CLIP Score to guide the frame sampling process with questions, selecting key frames relevant to the question. Then, We inject question information into the queries of the image Q-former to obtain more important visual features. Finally, during the answer generation stage, we utilize chain-of-thought and in-context learning techniques to explicitly control the generation of answers. It is worth mentioning that for the breakpoint mode, we found that image understanding models achieved better results than video understanding models. Therefore, we aggregated the answers from both types of models using a comparison mechanism. Ultimately, We achieved 84.2\% and 62.9\% for the global and breakpoint modes respectively on the MovieChat dataset, surpassing the official baseline model by 29.1\% and 24.1\%. Moreover the proposed method won the third place in the CVPR LOVEU 2024 Long-Term Video Question Answering Challenge. The code is avaiable at https://github.com/lntzm/CVPR24Track-LongVideo
☆ They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
☆ A Dictionary Based Approach for Removing Out-of-Focus Blur ICIP
The field of image deblurring has seen tremendous progress with the rise of deep learning models. These models, albeit efficient, are computationally expensive and energy consuming. Dictionary based learning approaches have shown promising results in image denoising and Single Image Super-Resolution. We propose an extension of the Rapid and Accurate Image Super-Resolution (RAISR) algorithm introduced by Isidoro, Romano and Milanfar for the task of out-of-focus blur removal. We define a sharpness quality measure which aligns well with the perceptual quality of an image. A metric based blending strategy based on asset allocation management is also proposed. Our method demonstrates an average increase of approximately 13% (PSNR) and 10% (SSIM) compared to popular deblurring methods. Furthermore, our blending scheme curtails ringing artefacts post restoration.
comment: 6 pages, IEEE ICIP
☆ ClawMachine: Fetching Visual Tokens as An Entity for Referring and Grounding
An essential topic for multimodal large language models (MLLMs) is aligning vision and language concepts at a finer level. In particular, we devote efforts to encoding visual referential information for tasks such as referring and grounding. Existing methods, including proxy encoding and geometry encoding, incorporate additional syntax to encode the object's location, bringing extra burdens in training MLLMs to communicate between language and vision. This study presents ClawMachine, offering a new methodology that notates an entity directly using the visual tokens. It allows us to unify the prompt and answer of visual referential tasks without additional syntax. Upon a joint vision-language vocabulary, ClawMachine unifies visual referring and grounding into an auto-regressive format and learns with a decoder-only architecture. Experiments validate that our model achieves competitive performance across visual referring and grounding tasks with a reduced demand for training data. Additionally, ClawMachine demonstrates a native ability to integrate multi-source information for complex visual reasoning, which prior MLLMs can hardly perform without specific adaptions.
comment: Project page: https://github.com/martian422/ClawMachine
☆ Low-power Ship Detection in Satellite Images Using Neuromorphic Hardware
Transmitting Earth observation image data from satellites to ground stations incurs significant costs in terms of power and bandwidth. For maritime ship detection, on-board data processing can identify ships and reduce the amount of data sent to the ground. However, most images captured on board contain only bodies of water or land, with the Airbus Ship Detection dataset showing only 22.1\% of images containing ships. We designed a low-power, two-stage system to optimize performance instead of relying on a single complex model. The first stage is a lightweight binary classifier that acts as a gating mechanism to detect the presence of ships. This stage runs on Brainchip's Akida 1.0, which leverages activation sparsity to minimize dynamic power consumption. The second stage employs a YOLOv5 object detection model to identify the location and size of ships. This approach achieves a mean Average Precision (mAP) of 76.9\%, which increases to 79.3\% when evaluated solely on images containing ships, by reducing false positives. Additionally, we calculated that evaluating the full validation set on a NVIDIA Jetson Nano device requires 111.4 kJ of energy. Our two-stage system reduces this energy consumption to 27.3 kJ, which is less than a fourth, demonstrating the efficiency of a heterogeneous computing system.
☆ GUICourse: From General Vision Language Models to Versatile GUI Agents
Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.
☆ Temporal Lidar Depth Completion
Given the lidar measurements from an autonomous vehicle, we can project the points and generate a sparse depth image. Depth completion aims at increasing the resolution of such a depth image by infilling and interpolating the sparse depth values. Like most existing approaches, we make use of camera images as guidance in very sparse or occluded regions. In addition, we propose a temporal algorithm that utilizes information from previous timesteps using recurrence. In this work, we show how a state-of-the-art method PENet can be modified to benefit from recurrency. Our algorithm achieves state-of-the-art results on the KITTI depth completion dataset while adding only less than one percent of additional overhead in terms of both neural network parameters and floating point operations. The accuracy is especially improved for faraway objects and regions containing a low amount of lidar depth samples. Even in regions without any ground truth (like sky and rooftops) we observe large improvements which are not captured by the existing evaluation metrics.
☆ Semi-Supervised Domain Adaptation Using Target-Oriented Domain Augmentation for 3D Object Detection
3D object detection is crucial for applications like autonomous driving and robotics. However, in real-world environments, variations in sensor data distribution due to sensor upgrades, weather changes, and geographic differences can adversely affect detection performance. Semi-Supervised Domain Adaptation (SSDA) aims to mitigate these challenges by transferring knowledge from a source domain, abundant in labeled data, to a target domain where labels are scarce. This paper presents a new SSDA method referred to as Target-Oriented Domain Augmentation (TODA) specifically tailored for LiDAR-based 3D object detection. TODA efficiently utilizes all available data, including labeled data in the source domain, and both labeled data and unlabeled data in the target domain to enhance domain adaptation performance. TODA consists of two stages: TargetMix and AdvMix. TargetMix employs mixing augmentation accounting for LiDAR sensor characteristics to facilitate feature alignment between the source-domain and target-domain. AdvMix applies point-wise adversarial augmentation with mixing augmentation, which perturbs the unlabeled data to align the features within both labeled and unlabeled data in the target domain. Our experiments conducted on the challenging domain adaptation tasks demonstrate that TODA outperforms existing domain adaptation techniques designed for 3D object detection by significant margins. The code is available at: https://github.com/rasd3/TODA.
comment: Accepted to IEEE Transactions on Intelligent Vehicles (T-IV). The code is available at: https://github.com/rasd3/TODA
☆ Syn-to-Real Unsupervised Domain Adaptation for Indoor 3D Object Detection
The use of synthetic data in indoor 3D object detection offers the potential of greatly reducing the manual labor involved in 3D annotations and training effective zero-shot detectors. However, the complicated domain shifts across syn-to-real indoor datasets remains underexplored. In this paper, we propose a novel Object-wise Hierarchical Domain Alignment (OHDA) framework for syn-to-real unsupervised domain adaptation in indoor 3D object detection. Our approach includes an object-aware augmentation strategy to effectively diversify the source domain data, and we introduce a two-branch adaptation framework consisting of an adversarial training branch and a pseudo labeling branch, in order to simultaneously reach holistic-level and class-level domain alignment. The pseudo labeling is further refined through two proposed schemes specifically designed for indoor UDA. Our adaptation results from synthetic dataset 3D-FRONT to real-world datasets ScanNetV2 and SUN RGB-D demonstrate remarkable mAP25 improvements of 9.7% and 9.1% over Source-Only baselines, respectively, and consistently outperform the methods adapted from 2D and 3D outdoor scenarios. The code will be publicly available upon paper acceptance.
☆ Federated Active Learning Framework for Efficient Annotation Strategy in Skin-lesion Classification
Federated Learning (FL) enables multiple institutes to train models collaboratively without sharing private data. Current FL research focuses on communication efficiency, privacy protection, and personalization and assumes that the data of FL have already been ideally collected. In medical scenarios, however, data annotation demands both expertise and intensive labor, which is a critical problem in FL. Active learning (AL), has shown promising performance in reducing the number of data annotations in medical image analysis. We propose a federated AL (FedAL) framework in which AL is executed periodically and interactively under FL. We exploit a local model in each hospital and a global model acquired from FL to construct an ensemble. We use ensemble-entropy-based AL as an efficient data-annotation strategy in FL. Therefore, our FedAL framework can decrease the amount of annotated data and preserve patient privacy while maintaining the performance of FL. To our knowledge, this is the first FedAL framework applied to medical images. We validated our framework on real-world dermoscopic datasets. Using only 50% of samples, our framework was able to achieve state-of-the-art performance on a skin-lesion classification task. Our framework performed better than several state-of-the-art AL methods under FL and achieved comparable performance to full-data FL.
comment: 14 pages, 3 figures
☆ BaFTA: Backprop-Free Test-Time Adaptation For Zero-Shot Vision-Language Models
Large-scale pretrained vision-language models like CLIP have demonstrated remarkable zero-shot image classification capabilities across diverse domains. To enhance CLIP's performance while preserving the zero-shot paradigm, various test-time prompt tuning methods have been introduced to refine class embeddings through unsupervised learning objectives during inference. However, these methods often encounter challenges in selecting appropriate learning rates to prevent collapsed training in the absence of validation data during test-time adaptation. In this study, we propose a novel backpropagation-free algorithm BaFTA for test-time adaptation of vision-language models. Instead of fine-tuning text prompts to refine class embeddings, our approach directly estimates class centroids using online clustering within a projected embedding space that aligns text and visual embeddings. We dynamically aggregate predictions from both estimated and original class embeddings, as well as from distinct augmented views, by assessing the reliability of each prediction using R\'enyi Entropy. Through extensive experiments, we demonstrate that BaFTA consistently outperforms state-of-the-art test-time adaptation methods in both effectiveness and efficiency.
☆ VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.
comment: 38 pages, 44 figures
☆ MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three tasks: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked 12 diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy artificial intelligence potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.
comment: 22 pages, 8 figures
☆ Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding
The field of self-supervised 3D representation learning has emerged as a promising solution to alleviate the challenge presented by the scarcity of extensive, well-annotated datasets. However, it continues to be hindered by the lack of diverse, large-scale, real-world 3D scene datasets for source data. To address this shortfall, we propose Generalizable Representation Learning (GRL), where we devise a generative Bayesian network to produce diverse synthetic scenes with real-world patterns, and conduct pre-training with a joint objective. By jointly learning a coarse-to-fine contrastive learning task and an occlusion-aware reconstruction task, the model is primed with transferable, geometry-informed representations. Post pre-training on synthetic data, the acquired knowledge of the model can be seamlessly transferred to two principal downstream tasks associated with 3D scene understanding, namely 3D object detection and 3D semantic segmentation, using real-world benchmark datasets. A thorough series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
☆ From Pixels to Progress: Generating Road Network from Satellite Imagery for Socioeconomic Insights in Impoverished Areas IJCAI2024
The Sustainable Development Goals (SDGs) aim to resolve societal challenges, such as eradicating poverty and improving the lives of vulnerable populations in impoverished areas. Those areas rely on road infrastructure construction to promote accessibility and economic development. Although publicly available data like OpenStreetMap is available to monitor road status, data completeness in impoverished areas is limited. Meanwhile, the development of deep learning techniques and satellite imagery shows excellent potential for earth monitoring. To tackle the challenge of road network assessment in impoverished areas, we develop a systematic road extraction framework combining an encoder-decoder architecture and morphological operations on satellite imagery, offering an integrated workflow for interdisciplinary researchers. Extensive experiments of road network extraction on real-world data in impoverished regions achieve a 42.7% enhancement in the F1-score over the baseline methods and reconstruct about 80% of the actual roads. We also propose a comprehensive road network dataset covering approximately 794,178 km2 area and 17.048 million people in 382 impoverished counties in China. The generated dataset is further utilized to conduct socioeconomic analysis in impoverished counties, showing that road network construction positively impacts regional economic development. The technical appendix, code, and generated dataset can be found at https://github.com/tsinghua-fib-lab/Road_network_extraction_impoverished_counties.
comment: 12 pages, 13 figures, IJCAI2024 (AI and Social Good)
☆ i-SRT: Aligning Large Multimodal Models for Videos by Iterative Self-Retrospective Judgment
Aligning Video Large Multimodal Models (VLMMs) face challenges such as modality misalignment and verbose responses. Although iterative approaches such as self-rewarding or iterative direct preference optimization (DPO) recently showed a significant improvement in language model alignment, particularly on reasoning tasks, self-aligned models applied to large video-language models often result in lengthy and irrelevant responses. To address these challenges, we propose a novel method that employs self-retrospection to enhance both response generation and preference modeling, and call iterative self-retrospective judgment (i-SRT). By revisiting and evaluating already generated content and preference in loop, i-SRT improves the alignment between textual and visual modalities, reduce verbosity, and enhances content relevance. Our empirical evaluations across diverse video question answering benchmarks demonstrate that i-SRT significantly outperforms prior arts. We are committed to opensourcing our code, models, and datasets to encourage further investigation.
comment: Technical report
☆ MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Multimodal interleaved datasets featuring free-form interleaved sequences of images and text are crucial for training frontier large multimodal models (LMMs). Despite the rapid progression of open-source LMMs, there remains a pronounced scarcity of large-scale, diverse open-source multimodal interleaved datasets. In response, we introduce MINT-1T, the most extensive and diverse open-source Multimodal INTerleaved dataset to date. MINT-1T comprises one trillion text tokens and three billion images, a 10x scale-up from existing open-source datasets. Additionally, we include previously untapped sources such as PDFs and ArXiv papers. As scaling multimodal interleaved datasets requires substantial engineering effort, sharing the data curation process and releasing the dataset greatly benefits the community. Our experiments show that LMMs trained on MINT-1T rival the performance of models trained on the previous leading dataset, OBELICS. Our data and code will be released at https://github.com/mlfoundations/MINT-1T.
☆ DRIP: Discriminative Rotation-Invariant Pole Landmark Descriptor for 3D LiDAR Localization
In 3D LiDAR-based robot self-localization, pole-like landmarks are gaining popularity as lightweight and discriminative landmarks. This work introduces a novel approach called "discriminative rotation-invariant poles," which enhances the discriminability of pole-like landmarks while maintaining their lightweight nature. Unlike conventional methods that model a pole landmark as a 3D line segment perpendicular to the ground, we propose a simple yet powerful approach that includes not only the line segment's main body but also its surrounding local region of interest (ROI) as part of the pole landmark. Specifically, we describe the appearance, geometry, and semantic features within this ROI to improve the discriminability of the pole landmark. Since such pole landmarks are no longer rotation-invariant, we introduce a novel rotation-invariant convolutional neural network that automatically and efficiently extracts rotation-invariant features from input point clouds for recognition. Furthermore, we train a pole dictionary through unsupervised learning and use it to compress poles into compact pole words, thereby significantly reducing real-time costs while maintaining optimal self-localization performance. Monte Carlo localization experiments using publicly available NCLT dataset demonstrate that the proposed method improves a state-of-the-art pole-based localization framework.
comment: 4 pages, 1 table
☆ Generative Visual Instruction Tuning
We propose to use machine-generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language, and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pre-trained models through instruction finetuning: LLaMA for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities on par with LLaVA and additionally demonstrates competitive results with native multimodal models such as Unified-IO 2, paving the way for building advanced general-purpose visual assistants by effectively re-using existing multimodal models. We open-source our dataset, codebase, and model checkpoints to foster further research and application in this domain.
☆ NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation
Talking head generation based on the neural radiation fields model has shown promising visual effects. However, the slow rendering speed of NeRF seriously limits its application, due to the burdensome calculation process over hundreds of sampled points to synthesize one pixel. In this work, a novel Neural Light Dynamic Fields model is proposed aiming to achieve generating high quality 3D talking face with significant speedup. The NLDF represents light fields based on light segments, and a deep network is used to learn the entire light beam's information at once. In learning the knowledge distillation is applied and the NeRF based synthesized result is used to guide the correct coloration of light segments in NLDF. Furthermore, a novel active pool training strategy is proposed to focus on high frequency movements, particularly on the speaker mouth and eyebrows. The propose method effectively represents the facial light dynamics in 3D talking video generation, and it achieves approximately 30 times faster speed compared to state of the art NeRF based method, with comparable generation visual quality.
☆ YOLO9tr: A Lightweight Model for Pavement Damage Detection Utilizing a Generalized Efficient Layer Aggregation Network and Attention Mechanism
Maintaining road pavement integrity is crucial for ensuring safe and efficient transportation. Conventional methods for assessing pavement condition are often laborious and susceptible to human error. This paper proposes YOLO9tr, a novel lightweight object detection model for pavement damage detection, leveraging the advancements of deep learning. YOLO9tr is based on the YOLOv9 architecture, incorporating a partial attention block that enhances feature extraction and attention mechanisms, leading to improved detection performance in complex scenarios. The model is trained on a comprehensive dataset comprising road damage images from multiple countries, including an expanded set of damage categories beyond the standard four. This broadened classification range allows for a more accurate and realistic assessment of pavement conditions. Comparative analysis demonstrates YOLO9tr's superior precision and inference speed compared to state-of-the-art models like YOLO8, YOLO9 and YOLO10, achieving a balance between computational efficiency and detection accuracy. The model achieves a high frame rate of up to 136 FPS, making it suitable for real-time applications such as video surveillance and automated inspection systems. The research presents an ablation study to analyze the impact of architectural modifications and hyperparameter variations on model performance, further validating the effectiveness of the partial attention block. The results highlight YOLO9tr's potential for practical deployment in real-time pavement condition monitoring, contributing to the development of robust and efficient solutions for maintaining safe and functional road infrastructure.
☆ Holistic-Motion2D: Scalable Whole-body Human Motion Generation in 2D Space
In this paper, we introduce a novel path to $\textit{general}$ human motion generation by focusing on 2D space. Traditional methods have primarily generated human motions in 3D, which, while detailed and realistic, are often limited by the scope of available 3D motion data in terms of both the size and the diversity. To address these limitations, we exploit extensive availability of 2D motion data. We present $\textbf{Holistic-Motion2D}$, the first comprehensive and large-scale benchmark for 2D whole-body motion generation, which includes over 1M in-the-wild motion sequences, each paired with high-quality whole-body/partial pose annotations and textual descriptions. Notably, Holistic-Motion2D is ten times larger than the previously largest 3D motion dataset. We also introduce a baseline method, featuring innovative $\textit{whole-body part-aware attention}$ and $\textit{confidence-aware modeling}$ techniques, tailored for 2D $\underline{\text T}$ext-driv$\underline{\text{EN}}$ whole-bo$\underline{\text D}$y motion gen$\underline{\text{ER}}$ation, namely $\textbf{Tender}$. Extensive experiments demonstrate the effectiveness of $\textbf{Holistic-Motion2D}$ and $\textbf{Tender}$ in generating expressive, diverse, and realistic human motions. We also highlight the utility of 2D motion for various downstream applications and its potential for lifting to 3D motion. The page link is: https://holistic-motion2d.github.io.
comment: 22 pages, 11figures, 17 tables
☆ Mining Open Semantics from CLIP: A Relation Transition Perspective for Few-Shot Learning
Contrastive Vision-Language Pre-training(CLIP) demonstrates impressive zero-shot capability. The key to improve the adaptation of CLIP to downstream task with few exemplars lies in how to effectively model and transfer the useful knowledge embedded in CLIP. Previous work mines the knowledge typically based on the limited visual samples and close-set semantics (i.e., within target category set of downstream task). However, the aligned CLIP image/text encoders contain abundant relationships between visual features and almost infinite open semantics, which may benefit the few-shot learning but remains unexplored. In this paper, we propose to mine open semantics as anchors to perform a relation transition from image-anchor relationship to image-target relationship to make predictions. Specifically, we adopt a transformer module which takes the visual feature as "Query", the text features of the anchors as "Key" and the similarity matrix between the text features of anchor and target classes as "Value". In this way, the output of such a transformer module represents the relationship between the image and target categories, i.e., the classification predictions. To avoid manually selecting the open semantics, we make the [CLASS] token of input text embedding learnable. We conduct extensive experiments on eleven representative classification datasets. The results show that our method performs favorably against previous state-of-the-arts considering few-shot classification settings.
☆ STEVE Series: Step-by-Step Construction of Agent Systems in Minecraft CVPR 2024
Building an embodied agent system with a large language model (LLM) as its core is a promising direction. Due to the significant costs and uncontrollable factors associated with deploying and training such agents in the real world, we have decided to begin our exploration within the Minecraft environment. Our STEVE Series agents can complete basic tasks in a virtual environment and more challenging tasks such as navigation and even creative tasks, with an efficiency far exceeding previous state-of-the-art methods by a factor of $2.5\times$ to $7.3\times$. We begin our exploration with a vanilla large language model, augmenting it with a vision encoder and an action codebase trained on our collected high-quality dataset STEVE-21K. Subsequently, we enhanced it with a Critic and memory to transform it into a complex system. Finally, we constructed a hierarchical multi-agent system. Our recent work explored how to prune the agent system through knowledge distillation. In the future, we will explore more potential applications of STEVE agents in the real world.
comment: CVPR 2024 Embodied AI Workshop
☆ Accurate and Fast Pixel Retrieval with Spatial and Uncertainty Aware Hypergraph Diffusion
This paper presents a novel method designed to enhance the efficiency and accuracy of both image retrieval and pixel retrieval. Traditional diffusion methods struggle to propagate spatial information effectively in conventional graphs due to their reliance on scalar edge weights. To overcome this limitation, we introduce a hypergraph-based framework, uniquely capable of efficiently propagating spatial information using local features during query time, thereby accurately retrieving and localizing objects within a database. Additionally, we innovatively utilize the structural information of the image graph through a technique we term "community selection". This approach allows for the assessment of the initial search result's uncertainty and facilitates an optimal balance between accuracy and speed. This is particularly crucial in real-world applications where such trade-offs are often necessary. Our experimental results, conducted on the (P)ROxford and (P)RParis datasets, demonstrate the significant superiority of our method over existing diffusion techniques. We achieve state-of-the-art (SOTA) accuracy in both image-level and pixel-level retrieval, while also maintaining impressive processing speed. This dual achievement underscores the effectiveness of our hypergraph-based framework and community selection technique, marking a notable advancement in the field of content-based image retrieval.
☆ Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models
Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.
☆ WeatherQA: Can Multimodal Language Models Reason about Severe Weather?
Severe convective weather events, such as hail, tornadoes, and thunderstorms, often occur quickly yet cause significant damage, costing billions of dollars every year. This highlights the importance of forecasting severe weather threats hours in advance to better prepare meteorologists and residents in at-risk areas. Can modern large foundation models perform such forecasting? Existing weather benchmarks typically focus only on predicting time-series changes in certain weather parameters (e.g., temperature, moisture) with text-only features. In this work, we introduce WeatherQA, the first multimodal dataset designed for machines to reason about complex combinations of weather parameters (a.k.a., ingredients) and predict severe weather in real-world scenarios. The dataset includes over 8,000 (multi-images, text) pairs for diverse severe weather events. Each pair contains rich information crucial for forecasting -- the images describe the ingredients capturing environmental instability, surface observations, and radar reflectivity, and the text contains forecast analyses written by human experts. With WeatherQA, we evaluate state-of-the-art vision language models , including GPT4, Claude3, Gemini-1.5, and a fine-tuned Llama3-based VLM, by designing two challenging tasks: (1) multi-choice QA for predicting affected area and (2) classification of the development potential of severe convection. These tasks require deep understanding of domain knowledge (e.g., atmospheric dynamics) and complex reasoning over multimodal data (e.g., interactions between weather parameters). We show a substantial gap between the strongest VLM, GPT4o, and human reasoning. Our comprehensive case study with meteorologists further reveals the weaknesses of the models, suggesting that better training and data integration are necessary to bridge this gap. WeatherQA link: https://github.com/chengqianma/WeatherQA.
comment: 26 pages, 9 figures
☆ Zero-Shot Scene Change Detection
We present a novel, training-free approach to scene change detection. Our method leverages tracking models, which inherently perform change detection between consecutive frames of video by identifying common objects and detecting new or missing objects. Specifically, our method takes advantage of the change detection effect of the tracking model by inputting reference and query images instead of consecutive frames. Furthermore, we focus on the content gap and style gap between two input images in change detection, and address both issues by proposing adaptive content threshold and style bridging layers, respectively. Finally, we extend our approach to video to exploit rich temporal information, enhancing scene change detection performance. We compare our approach and baseline through various experiments. While existing train-based baseline tend to specialize only in the trained domain, our method shows consistent performance across various domains, proving the competitiveness of our approach.
comment: Preprint. Under review
☆ Consistency^2: Consistent and Fast 3D Painting with Latent Consistency Models
Generative 3D Painting is among the top productivity boosters in high-resolution 3D asset management and recycling. Ever since text-to-image models became accessible for inference on consumer hardware, the performance of 3D Painting methods has consistently improved and is currently close to plateauing. At the core of most such models lies denoising diffusion in the latent space, an inherently time-consuming iterative process. Multiple techniques have been developed recently to accelerate generation and reduce sampling iterations by orders of magnitude. Designed for 2D generative imaging, these techniques do not come with recipes for lifting them into 3D. In this paper, we address this shortcoming by proposing a Latent Consistency Model (LCM) adaptation for the task at hand. We analyze the strengths and weaknesses of the proposed model and evaluate it quantitatively and qualitatively. Based on the Objaverse dataset samples study, our 3D painting method attains strong preference in all evaluations. Source code is available at https://github.com/kongdai123/consistency2.
☆ Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion
A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.
comment: 14 pages, 10 figures, 3 tables
☆ Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation CVPR 2024
Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model, while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper, we propose WeCLIP, a CLIP-based single-stage pipeline, for weakly supervised semantic segmentation. Specifically, the frozen CLIP model is applied as the backbone for semantic feature extraction, and a new decoder is designed to interpret extracted semantic features for final prediction. Meanwhile, we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels cannot be optimized during training. We then propose a refinement module (RFM) to rectify them dynamically. Our architecture enforces the proposed decoder and RFM to benefit from each other to boost the final performance. Extensive experiments show that our approach significantly outperforms other approaches with less training cost. Additionally, our WeCLIP also obtains promising results for fully supervised settings. The code is available at https://github.com/zbf1991/WeCLIP.
comment: CVPR 2024 Highlight
☆ SUGARCREPE++ Dataset: Vision-Language Model Sensitivity to Semantic and Lexical Alterations
Despite their remarkable successes, state-of-the-art large language models (LLMs), including vision-and-language models (VLMs) and unimodal language models (ULMs), fail to understand precise semantics. For example, semantically equivalent sentences expressed using different lexical compositions elicit diverging representations. The degree of this divergence and its impact on encoded semantics is not very well understood. In this paper, we introduce the SUGARCREPE++ dataset to analyze the sensitivity of VLMs and ULMs to lexical and semantic alterations. Each sample in SUGARCREPE++ dataset consists of an image and a corresponding triplet of captions: a pair of semantically equivalent but lexically different positive captions and one hard negative caption. This poses a 3-way semantic (in)equivalence problem to the language models. We comprehensively evaluate VLMs and ULMs that differ in architecture, pre-training objectives and datasets to benchmark the performance of SUGARCREPE++ dataset. Experimental results highlight the difficulties of VLMs in distinguishing between lexical and semantic variations, particularly in object attributes and spatial relations. Although VLMs with larger pre-training datasets, model sizes, and multiple pre-training objectives achieve better performance on SUGARCREPE++, there is a significant opportunity for improvement. We show that all the models which achieve better performance on compositionality datasets need not perform equally well on SUGARCREPE++, signifying that compositionality alone may not be sufficient for understanding semantic and lexical alterations. Given the importance of the property that the SUGARCREPE++ dataset targets, it serves as a new challenge to the vision-and-language community.
☆ Few-Shot Recognition via Stage-Wise Augmented Finetuning
Few-shot recognition aims to train a classification model with only a few labeled examples of pre-defined concepts, where annotation can be costly in a downstream task. In another related research area, zero-shot recognition, which assumes no access to any downstream-task data, has been greatly advanced by using pretrained Vision-Language Models (VLMs). In this area, retrieval-augmented learning (RAL) effectively boosts zero-shot accuracy by retrieving and learning from external data relevant to downstream concepts. Motivated by these advancements, our work explores RAL for few-shot recognition. While seemingly straightforward despite being under-explored in the literature (till now!), we present novel challenges and opportunities for applying RAL for few-shot recognition. First, perhaps surprisingly, simply finetuning the VLM on a large amount of retrieved data barely surpasses state-of-the-art zero-shot methods due to the imbalanced distribution of retrieved data and its domain gaps compared to few-shot annotated data. Second, finetuning a VLM on few-shot examples alone significantly outperforms prior methods, and finetuning on the mix of retrieved and few-shot data yields even better results. Third, to mitigate the imbalanced distribution and domain gap issue, we propose Stage-Wise Augmented fineTuning (SWAT) method, which involves end-to-end finetuning on mixed data for the first stage and retraining the classifier solely on the few-shot data in the second stage. Extensive experiments show that SWAT achieves the best performance on standard benchmark datasets, resoundingly outperforming prior works by ~10% in accuracy. Code is available at https://github.com/tian1327/SWAT.
☆ Federated Face Forgery Detection Learning with Personalized Representation
Deep generator technology can produce high-quality fake videos that are indistinguishable, posing a serious social threat. Traditional forgery detection methods directly centralized training on data and lacked consideration of information sharing in non-public video data scenarios and data privacy. Naturally, the federated learning strategy can be applied for privacy protection, which aggregates model parameters of clients but not original data. However, simple federated learning can't achieve satisfactory performance because of poor generalization capabilities for the real hybrid-domain forgery dataset. To solve the problem, the paper proposes a novel federated face forgery detection learning with personalized representation. The designed Personalized Forgery Representation Learning aims to learn the personalized representation of each client to improve the detection performance of individual client models. In addition, a personalized federated learning training strategy is utilized to update the parameters of the distributed detection model. Here collaborative training is conducted on multiple distributed client devices, and shared representations of these client models are uploaded to the server side for aggregation. Experiments on several public face forgery detection datasets demonstrate the superior performance of the proposed algorithm compared with state-of-the-art methods. The code is available at \emph{https://github.com/GANG370/PFR-Forgery.}
comment: The code is publicly available
☆ Graspness Discovery in Clutters for Fast and Accurate Grasp Detection ICCV 2021
Efficient and robust grasp pose detection is vital for robotic manipulation. For general 6 DoF grasping, conventional methods treat all points in a scene equally and usually adopt uniform sampling to select grasp candidates. However, we discover that ignoring where to grasp greatly harms the speed and accuracy of current grasp pose detection methods. In this paper, we propose "graspness", a quality based on geometry cues that distinguishes graspable areas in cluttered scenes. A look-ahead searching method is proposed for measuring the graspness and statistical results justify the rationality of our method. To quickly detect graspness in practice, we develop a neural network named cascaded graspness model to approximate the searching process. Extensive experiments verify the stability, generality and effectiveness of our graspness model, allowing it to be used as a plug-and-play module for different methods. A large improvement in accuracy is witnessed for various previous methods after equipping our graspness model. Moreover, we develop GSNet, an end-to-end network that incorporates our graspness model for early filtering of low-quality predictions. Experiments on a large-scale benchmark, GraspNet-1Billion, show that our method outperforms previous arts by a large margin (30+ AP) and achieves a high inference speed. The library of GSNet has been integrated into AnyGrasp, which is at https://github.com/graspnet/anygrasp_sdk.
comment: ICCV 2021
☆ Diffusion Models in Low-Level Vision: A Survey
Deep generative models have garnered significant attention in low-level vision tasks due to their generative capabilities. Among them, diffusion model-based solutions, characterized by a forward diffusion process and a reverse denoising process, have emerged as widely acclaimed for their ability to produce samples of superior quality and diversity. This ensures the generation of visually compelling results with intricate texture information. Despite their remarkable success, a noticeable gap exists in a comprehensive survey that amalgamates these pioneering diffusion model-based works and organizes the corresponding threads. This paper proposes the comprehensive review of diffusion model-based techniques. We present three generic diffusion modeling frameworks and explore their correlations with other deep generative models, establishing the theoretical foundation. Following this, we introduce a multi-perspective categorization of diffusion models, considering both the underlying framework and the target task. Additionally, we summarize extended diffusion models applied in other tasks, including medical, remote sensing, and video scenarios. Moreover, we provide an overview of commonly used benchmarks and evaluation metrics. We conduct a thorough evaluation, encompassing both performance and efficiency, of diffusion model-based techniques in three prominent tasks. Finally, we elucidate the limitations of current diffusion models and propose seven intriguing directions for future research. This comprehensive examination aims to facilitate a profound understanding of the landscape surrounding denoising diffusion models in the context of low-level vision tasks. A curated list of diffusion model-based techniques in over 20 low-level vision tasks can be found at https://github.com/ChunmingHe/awesome-diffusion-models-in-low-level-vision.
comment: 20 pages, 23 figures, 4 tables
☆ Neural Lineage
Given a well-behaved neural network, is possible to identify its parent, based on which it was tuned? In this paper, we introduce a novel task known as neural lineage detection, aiming at discovering lineage relationships between parent and child models. Specifically, from a set of parent models, neural lineage detection predicts which parent model a child model has been fine-tuned from. We propose two approaches to address this task. (1) For practical convenience, we introduce a learning-free approach, which integrates an approximation of the finetuning process into the neural network representation similarity metrics, leading to a similarity-based lineage detection scheme. (2) For the pursuit of accuracy, we introduce a learning-based lineage detector comprising encoders and a transformer detector. Through experimentation, we have validated that our proposed learning-free and learning-based methods outperform the baseline in various learning settings and are adaptable to a variety of visual models. Moreover, they also exhibit the ability to trace cross-generational lineage, identifying not only parent models but also their ancestors.
☆ Slicing Through Bias: Explaining Performance Gaps in Medical Image Analysis using Slice Discovery Methods
Machine learning models have achieved high overall accuracy in medical image analysis. However, performance disparities on specific patient groups pose challenges to their clinical utility, safety, and fairness. This can affect known patient groups - such as those based on sex, age, or disease subtype - as well as previously unknown and unlabeled groups. Furthermore, the root cause of such observed performance disparities is often challenging to uncover, hindering mitigation efforts. In this paper, to address these issues, we leverage Slice Discovery Methods (SDMs) to identify interpretable underperforming subsets of data and formulate hypotheses regarding the cause of observed performance disparities. We introduce a novel SDM and apply it in a case study on the classification of pneumothorax and atelectasis from chest x-rays. Our study demonstrates the effectiveness of SDMs in hypothesis formulation and yields an explanation of previously observed but unexplained performance disparities between male and female patients in widely used chest X-ray datasets and models. Our findings indicate shortcut learning in both classification tasks, through the presence of chest drains and ECG wires, respectively. Sex-based differences in the prevalence of these shortcut features appear to cause the observed classification performance gap, representing a previously underappreciated interaction between shortcut learning and model fairness analyses.
☆ COT Flow: Learning Optimal-Transport Image Sampling and Editing by Contrastive Pairs
Diffusion models have demonstrated strong performance in sampling and editing multi-modal data with high generation quality, yet they suffer from the iterative generation process which is computationally expensive and slow. In addition, most methods are constrained to generate data from Gaussian noise, which limits their sampling and editing flexibility. To overcome both disadvantages, we present Contrastive Optimal Transport Flow (COT Flow), a new method that achieves fast and high-quality generation with improved zero-shot editing flexibility compared to previous diffusion models. Benefiting from optimal transport (OT), our method has no limitation on the prior distribution, enabling unpaired image-to-image (I2I) translation and doubling the editable space (at both the start and end of the trajectory) compared to other zero-shot editing methods. In terms of quality, COT Flow can generate competitive results in merely one step compared to previous state-of-the-art unpaired image-to-image (I2I) translation methods. To highlight the advantages of COT Flow through the introduction of OT, we introduce the COT Editor to perform user-guided editing with excellent flexibility and quality. The code will be released at https://github.com/zuxinrui/cot_flow.
☆ TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations
This work proposes a novel representation of injective deformations of 3D space, which overcomes existing limitations of injective methods: inaccuracy, lack of robustness, and incompatibility with general learning and optimization frameworks. The core idea is to reduce the problem to a deep composition of multiple 2D mesh-based piecewise-linear maps. Namely, we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D), and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to efficiently and accurately optimize and learn complex deformations, outperforming other injective approaches. As a main application, we produce complex and artifact-free NeRF and SDF deformations.
☆ DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features
We propose DistillNeRF, a self-supervised learning framework addressing the challenge of understanding 3D environments from limited 2D observations in autonomous driving. Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs, and is trained self-supervised with differentiable rendering to reconstruct RGB, depth, or feature images. Our first insight is to exploit per-scene optimized Neural Radiance Fields (NeRFs) by generating dense depth and virtual camera targets for training, thereby helping our model to learn 3D geometry from sparse non-overlapping image inputs. Second, to learn a semantically rich 3D representation, we propose distilling features from pre-trained 2D foundation models, such as CLIP or DINOv2, thereby enabling various downstream tasks without the need for costly 3D human annotations. To leverage these two insights, we introduce a novel model architecture with a two-stage lift-splat-shoot encoder and a parameterized sparse hierarchical voxel representation. Experimental results on the NuScenes dataset demonstrate that DistillNeRF significantly outperforms existing comparable self-supervised methods for scene reconstruction, novel view synthesis, and depth estimation; and it allows for competitive zero-shot 3D semantic occupancy prediction, as well as open-world scene understanding through distilled foundation model features. Demos and code will be available at https://distillnerf.github.io/.
♻ ☆ Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
♻ ☆ Too Many Frames, not all Useful:Efficient Strategies for Long-Form Video QA
Long-form videos that span across wide temporal intervals are highly information redundant and contain multiple distinct events or entities that are often loosely-related. Therefore, when performing long-form video question answering (LVQA),all information necessary to generate a correct response can often be contained within a small subset of frames. Recent literature explore the use of large language models (LLMs) in LVQA benchmarks, achieving exceptional performance, while relying on vision language models (VLMs) to convert all visual content within videos into natural language. Such VLMs often independently caption a large number of frames uniformly sampled from long videos, which is not efficient and can mostly be redundant. Questioning these decision choices, we explore optimal strategies for key-frame selection and sequence-aware captioning, that can significantly reduce these redundancies. We propose two novel approaches that improve each of aspects, namely Hierarchical Keyframe Selector and Sequential Visual LLM. Our resulting framework termed LVNet achieves state-of-the-art performance across three benchmark LVQA datasets. Our code will be released publicly.
♻ ☆ Leveraging VLM-Based Pipelines to Annotate 3D Objects
Pretrained vision language models (VLMs) present an opportunity to caption unlabeled 3D objects at scale. The leading approach to summarize VLM descriptions from different views of an object (Luo et al., 2023) relies on a language model (GPT4) to produce the final output. This text-based aggregation is susceptible to hallucinations as it merges potentially contradictory descriptions. We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response. Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods. We show our probabilistic aggregation is not only more reliable and efficient, but sets the SoTA on inferring object types with respect to human-verified labels. The aggregated annotations are also useful for conditional inference; they improve downstream predictions (e.g., of object material) when the object's type is specified as an auxiliary text-based input. Such auxiliary inputs allow ablating the contribution of visual reasoning over visionless reasoning in an unsupervised setting. With these supervised and unsupervised evaluations, we show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the Objaverse dataset.
♻ ☆ Ultrasound Imaging based on the Variance of a Diffusion Restoration Model
Despite today's prevalence of ultrasound imaging in medicine, ultrasound signal-to-noise ratio is still affected by several sources of noise and artefacts. Moreover, enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. Recently, there has been progress in both model-based and learning-based approaches addressing the problem of ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid reconstruction method combining an ultrasound linear direct model with a learning-based prior coming from a generative Denoising Diffusion model. More specifically, we rely on the unsupervised fine-tuning of a pre-trained Denoising Diffusion Restoration Model (DDRM). Given the nature of multiplicative noise inherent to ultrasound, this paper proposes an empirical model to characterize the stochasticity of diffusion reconstruction of ultrasound images, and shows the interest of its variance as an echogenicity map estimator. We conduct experiments on synthetic, in-vitro, and in-vivo data, demonstrating the efficacy of our variance imaging approach in achieving high-quality image reconstructions from single plane-wave acquisitions and in comparison to state-of-the-art methods. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/DRUSvar
comment: 5 pages; accepted by EUSIPCO 2024. arXiv admin note: text overlap with arXiv:2310.20618
♻ ☆ MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to large language models (LLMs), MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
♻ ☆ TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes
We present TetSphere splatting, an explicit, Lagrangian representation for reconstructing 3D shapes with high-quality geometry. In contrast to conventional object reconstruction methods which predominantly use Eulerian representations, including both neural implicit (e.g., NeRF, NeuS) and explicit representations (e.g., DMTet), and often struggle with high computational demands and suboptimal mesh quality, TetSphere splatting utilizes an underused but highly effective geometric primitive -- tetrahedral meshes. This approach directly yields superior mesh quality without relying on neural networks or post-processing. It deforms multiple initial tetrahedral spheres to accurately reconstruct the 3D shape through a combination of differentiable rendering and geometric energy optimization, resulting in significant computational efficiency. Serving as a robust and versatile geometry representation, Tet-Sphere splatting seamlessly integrates into diverse applications, including single-view 3D reconstruction, image-/text-to-3D content generation. Experimental results demonstrate that TetSphere splatting outperforms existing representations, delivering faster optimization speed, enhanced mesh quality, and reliable preservation of thin structures.
♻ ☆ Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity
Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. Captured ROP Retcam images suffer from poor quality. This paper proposes the use of improved novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. Once trained and validated, the evaluations showed that these novel methods in comparison to traditional imaging processing contribute to better and in many aspects higher accuracy in classifying Plus disease, Stages of ROP and Zones in comparison to peer papers.
comment: 10 pages, 4 figures, 7 tables. arXiv admin note: text overlap with arXiv:1904.08796 by other authors
♻ ☆ VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
comment: ZC, SL, HZ, YX, and XL contributed equally to this project
♻ ☆ Evaluating Task-based Effectiveness of MLLMs on Charts
In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.
comment: The experimental part needs to be revised. Withdraw this version
♻ ☆ MOWA: Multiple-in-One Image Warping Model
While recent image warping approaches achieved remarkable success on existing benchmarks, they still require training separate models for each specific task and cannot generalize well to different camera models or customized manipulations. To address diverse types of warping in practice, we propose a Multiple-in-One image WArping model (named MOWA) in this work. Specifically, we mitigate the difficulty of multi-task learning by disentangling the motion estimation at both the region level and pixel level. To further enable dynamic task-aware image warping, we introduce a lightweight point-based classifier that predicts the task type, serving as prompts to modulate the feature maps for more accurate estimation. To our knowledge, this is the first work that solves multiple practical warping tasks in one single model. Extensive experiments demonstrate that our MOWA, which is trained on six tasks for multiple-in-one image warping, outperforms state-of-the-art task-specific models across most tasks. Moreover, MOWA also exhibits promising potential to generalize into unseen scenes, as evidenced by cross-domain and zero-shot evaluations. The code and more visual results can be found on the project page: https://kangliao929.github.io/projects/mowa/.
comment: Project page: https://kangliao929.github.io/projects/mowa/
♻ ☆ A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come.
♻ ☆ Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph
Locating objects referred to in natural language poses a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object retrieval with simple (bare) queries but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene spatial graph representation with metric edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to form 3D objects, an advanced raycasting algorithm to project them to 2D, and a vision-language model to describe them as graph nodes. On Replica and ScanNet datasets, we show that the designed method accurately constructs 3D object-centric maps. We have demonstrated that their quality takes a leading place for open-vocabulary 3D semantic segmentation against other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On Sr3D and Nr3D benchmarks, our deductive approach demonstrates a significant improvement, enabling retrieving objects by complex queries compared to other state-of-the-art methods. Considering our design solutions, we achieved a processing speed approximately x3 times faster than the closest analog. This promising performance enables our approach for usage in applied intelligent robotics projects. We make the code publicly available at linukc.github.io/bbq/.
comment: 9 pages, 4 figures, 4 tables
♻ ☆ Fine-Grained Domain Generalization with Feature Structuralization
Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning the commonality and specificity within categories. Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is accomplished through joint optimization of five constraints: a decorrelation function applied to disentangled segments, three constraints ensuring common feature consistency and specific feature distinctiveness, and a prediction calibration term. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.2% in FGDG performance. Beyond that, the explainability analysis on explicit concept matching intensity between the shared concepts among categories and the model channels, along with experiments on various mainstream model architectures, substantiates the validity of FS.
♻ ☆ Dual Degradation Representation for Joint Deraining and Low-Light Enhancement in the Dark
Rain in the dark poses a significant challenge to deploying real-world applications such as autonomous driving, surveillance systems, and night photography. Existing low-light enhancement or deraining methods struggle to brighten low-light conditions and remove rain simultaneously. Additionally, cascade approaches like ``deraining followed by low-light enhancement'' or the reverse often result in problematic rain patterns or overly blurred and overexposed images. To address these challenges, we introduce an end-to-end model called L$^{2}$RIRNet, designed to manage both low-light enhancement and deraining in real-world settings. Our model features two main components: a Dual Degradation Representation Network (DDR-Net) and a Restoration Network. The DDR-Net independently learns degradation representations for luminance effects in dark areas and rain patterns in light areas, employing dual degradation loss to guide the training process. The Restoration Network restores the degraded image using a Fourier Detail Guidance (FDG) module, which leverages near-rainless detailed images, focusing on texture details in frequency and spatial domains to inform the restoration process. Furthermore, we contribute a dataset containing both synthetic and real-world low-light-rainy images. Extensive experiments demonstrate that our L$^{2}$RIRNet performs favorably against existing methods in both synthetic and complex real-world scenarios. All the code and dataset can be found in \url{https://github.com/linxin0/Low_light_rainy}.
♻ ☆ Towards Comprehensive Multimodal Perception: Introducing the Touch-Language-Vision Dataset
Tactility provides crucial support and enhancement for the perception and interaction capabilities of both humans and robots. Nevertheless, the multimodal research related to touch primarily focuses on visual and tactile modalities, with limited exploration in the domain of language. Beyond vocabulary, sentence-level descriptions contain richer semantics. Based on this, we construct a touch-language-vision dataset named TLV (Touch-Language-Vision) by human-machine cascade collaboration, featuring sentence-level descriptions for multimode alignment. The new dataset is used to fine-tune our proposed lightweight training framework, STLV-Align (Synergistic Touch-Language-Vision Alignment), achieving effective semantic alignment with minimal parameter adjustments (1%). Project Page: https://xiaoen0.github.io/touch.page/.
comment: Accepted by ICIC 2024
♻ ☆ Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist's reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model's performance.
♻ ☆ A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies
The Stable Diffusion Model (SDM) is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation. Despite various attempts at sampler optimization, model distillation, and network quantification, these approaches typically maintain the original network architecture. The extensive parameter scale and substantial computational demands have limited research into adjusting the model architecture. This study focuses on reducing redundant computation in SDM and optimizes the model through both tuning and tuning-free methods. 1) For the tuning method, we design a model assembly strategy to reconstruct a lightweight model while preserving performance through distillation. Second, to mitigate performance loss due to pruning, we incorporate multi-expert conditional convolution (ME-CondConv) into compressed UNets to enhance network performance by increasing capacity without sacrificing speed. Third, we validate the effectiveness of the multi-UNet switching method for improving network speed. 2) For the tuning-free method, we propose a feature inheritance strategy to accelerate inference by skipping local computations at the block, layer, or unit level within the network structure. We also examine multiple sampling modes for feature inheritance at the time-step level. Experiments demonstrate that both the proposed tuning and the tuning-free methods can improve the speed and performance of the SDM. The lightweight model reconstructed by the model assembly strategy increases generation speed by $22.4%$, while the feature inheritance strategy enhances the SDM generation speed by $40.0%$.
comment: 19 pages, 16 figures, submitted to IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ Human Mesh Recovery from Arbitrary Multi-view Images
Human mesh recovery from arbitrary multi-view images involves two characteristics: the arbitrary camera poses and arbitrary number of camera views. Because of the variability, designing a unified framework to tackle this task is challenging. The challenges can be summarized as the dilemma of being able to simultaneously estimate arbitrary camera poses and recover human mesh from arbitrary multi-view images while maintaining flexibility. To solve this dilemma, we propose a divide and conquer framework for Unified Human Mesh Recovery (U-HMR) from arbitrary multi-view images. In particular, U-HMR consists of a decoupled structure and two main components: camera and body decoupling (CBD), camera pose estimation (CPE), and arbitrary view fusion (AVF). As camera poses and human body mesh are independent of each other, CBD splits the estimation of them into two sub-tasks for two individual sub-networks (ie, CPE and AVF) to handle respectively, thus the two sub-tasks are disentangled. In CPE, since each camera pose is unrelated to the others, we adopt a shared MLP to process all views in a parallel way. In AVF, in order to fuse multi-view information and make the fusion operation independent of the number of views, we introduce a transformer decoder with a SMPL parameters query token to extract cross-view features for mesh recovery. To demonstrate the efficacy and flexibility of the proposed framework and effect of each component, we conduct extensive experiments on three public datasets: Human3.6M, MPI-INF-3DHP, and TotalCapture.
♻ ☆ SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments. Furthermore, we propose a novel prompt mimicking how human interpret charts for more accurate reasoning. Our source code is available at https://github.com/sangwu99/Simplot.
♻ ☆ InterControl: Zero-shot Human Interaction Generation by Controlling Every Joint
Text-conditioned motion synthesis has made remarkable progress with the emergence of diffusion models. However, the majority of these motion diffusion models are primarily designed for a single character and overlook multi-human interactions. In our approach, we strive to explore this problem by synthesizing human motion with interactions for a group of characters of any size in a zero-shot manner. The key aspect of our approach is the adaptation of human-wise interactions as pairs of human joints that can be either in contact or separated by a desired distance. In contrast to existing methods that necessitate training motion generation models on multi-human motion datasets with a fixed number of characters, our approach inherently possesses the flexibility to model human interactions involving an arbitrary number of individuals, thereby transcending the limitations imposed by the training data. We introduce a novel controllable motion generation method, InterControl, to encourage the synthesized motions maintaining the desired distance between joint pairs. It consists of a motion controller and an inverse kinematics guidance module that realistically and accurately aligns the joints of synthesized characters to the desired location. Furthermore, we demonstrate that the distance between joint pairs for human-wise interactions can be generated using an off-the-shelf Large Language Model (LLM). Experimental results highlight the capability of our framework to generate interactions with multiple human characters and its potential to work with off-the-shelf physics-based character simulators.
comment: Generate human interactions with only single-person data via joint contact pairs, code https://github.com/zhenzhiwang/intercontrol
♻ ☆ SeaFormer++: Squeeze-enhanced Axial Transformer for Mobile Visual Recognition ICLR 2023
Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes, Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobilefriendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.
comment: V4 is the ICLR 2023 conference version, and V5 is the extended version
♻ ☆ ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations ICML 2024
As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.
comment: 31 pages, 27 figures, ICML 2024
♻ ☆ Eyes Wide Unshut: Unsupervised Mistake Detection in Egocentric Video by Detecting Unpredictable Gaze
In this paper, we address the challenge of unsupervised mistake detection in egocentric video through the analysis of gaze signals, a critical component for advancing user assistance in smart glasses. Traditional supervised methods, reliant on manually labeled mistakes, suffer from domain-dependence and scalability issues. This research introduces an unsupervised method for detecting mistakes in videos of human activities, overcoming the challenges of domain-specific requirements and the necessity for annotated data. By analyzing unusual gaze patterns that signal user disorientation during tasks, we propose a gaze completion model that forecasts eye gaze trajectories from incomplete inputs. The difference between the anticipated and observed gaze paths acts as an indicator for identifying errors. Our method is validated on the EPIC-Tent dataset, showing its superiority compared to current one-class supervised and unsupervised techniques.
♻ ☆ Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
♻ ☆ RoMe: Towards Large Scale Road Surface Reconstruction via Mesh Representation
In autonomous driving applications, accurate and efficient road surface reconstruction is paramount. This paper introduces RoMe, a novel framework designed for the robust reconstruction of large-scale road surfaces. Leveraging a unique mesh representation, RoMe ensures that the reconstructed road surfaces are accurate and seamlessly aligned with semantics. To address challenges in computational efficiency, we propose a waypoint sampling strategy, enabling RoMe to reconstruct vast environments by focusing on sub-areas and subsequently merging them. Furthermore, we incorporate an extrinsic optimization module to enhance the robustness against inaccuracies in extrinsic calibration. Our extensive evaluations of both public datasets and wild data underscore RoMe's superiority in terms of speed, accuracy, and robustness. For instance, it costs only 2 GPU hours to recover a road surface of 600*600 square meters from thousands of images. Notably, RoMe's capability extends beyond mere reconstruction, offering significant value for autolabeling tasks in autonomous driving applications. All related data and code are available at https://github.com/DRosemei/RoMe.
♻ ☆ Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation CVPR 2024
Achieving robust generalization across diverse data domains remains a significant challenge in computer vision. This challenge is important in safety-critical applications, where deep-neural-network-based systems must perform reliably under various environmental conditions not seen during training. Our study investigates whether the generalization capabilities of Vision Foundation Models (VFMs) and Unsupervised Domain Adaptation (UDA) methods for the semantic segmentation task are complementary. Results show that combining VFMs with UDA has two main benefits: (a) it allows for better UDA performance while maintaining the out-of-distribution performance of VFMs, and (b) it makes certain time-consuming UDA components redundant, thus enabling significant inference speedups. Specifically, with equivalent model sizes, the resulting VFM-UDA method achieves an 8.4$\times$ speed increase over the prior non-VFM state of the art, while also improving performance by +1.2 mIoU in the UDA setting and by +6.1 mIoU in terms of out-of-distribution generalization. Moreover, when we use a VFM with 3.6$\times$ more parameters, the VFM-UDA approach maintains a 3.3$\times$ speed up, while improving the UDA performance by +3.1 mIoU and the out-of-distribution performance by +10.3 mIoU. These results underscore the significant benefits of combining VFMs with UDA, setting new standards and baselines for Unsupervised Domain Adaptation in semantic segmentation.
comment: CVPR 2024 Workshop Proceedings for the Second Workshop on Foundation Models
♻ ☆ A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
♻ ☆ Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback ACL 2024
Recent advancements in large language models have influenced the development of video large multimodal models (VLMMs). The previous approaches for VLMMs involved Supervised Fine-Tuning (SFT) with instruction-tuned datasets, integrating LLM with visual encoders, and adding additional learnable modules. Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data compared to text-only data. We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF), providing self-preference feedback to refine itself and facilitating the alignment of video and text modalities. In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback in order to enrich the understanding of video content. Demonstrating enhanced performance across diverse video benchmarks, our multimodal RLAIF approach, VLM-RLAIF, outperforms existing approaches, including the SFT model. We commit to open-sourcing our code, models, and datasets to foster further research in this area.
comment: ACL 2024
♻ ☆ SGCNeRF: Few-Shot Neural Rendering via Sparse Geometric Consistency Guidance
Neural Radiance Field (NeRF) technology has made significant strides in creating novel viewpoints. However, its effectiveness is hampered when working with sparsely available views, often leading to performance dips due to overfitting. FreeNeRF attempts to overcome this limitation by integrating implicit geometry regularization, which incrementally improves both geometry and textures. Nonetheless, an initial low positional encoding bandwidth results in the exclusion of high-frequency elements. The quest for a holistic approach that simultaneously addresses overfitting and the preservation of high-frequency details remains ongoing. This study introduces a novel feature matching based sparse geometry regularization module. This module excels in pinpointing high-frequency keypoints, thereby safeguarding the integrity of fine details. Through progressive refinement of geometry and textures across NeRF iterations, we unveil an effective few-shot neural rendering architecture, designated as SGCNeRF, for enhanced novel view synthesis. Our experiments demonstrate that SGCNeRF not only achieves superior geometry-consistent outcomes but also surpasses FreeNeRF, with improvements of 0.7 dB and 0.6 dB in PSNR on the LLFF and DTU datasets, respectively.
♻ ☆ An Open and Large-Scale Dataset for Multi-Modal Climate Change-aware Crop Yield Predictions
Precise crop yield predictions are of national importance for ensuring food security and sustainable agricultural practices. While AI-for-science approaches have exhibited promising achievements in solving many scientific problems such as drug discovery, precipitation nowcasting, etc., the development of deep learning models for predicting crop yields is constantly hindered by the lack of an open and large-scale deep learning-ready dataset with multiple modalities to accommodate sufficient information. To remedy this, we introduce the CropNet dataset, the first terabyte-sized, publicly available, and multi-modal dataset specifically targeting climate change-aware crop yield predictions for the contiguous United States (U.S.) continent at the county level. Our CropNet dataset is composed of three modalities of data, i.e., Sentinel-2 Imagery, WRF-HRRR Computed Dataset, and USDA Crop Dataset, for over 2200 U.S. counties spanning 6 years (2017-2022), expected to facilitate researchers in developing versatile deep learning models for timely and precisely predicting crop yields at the county-level, by accounting for the effects of both short-term growing season weather variations and long-term climate change on crop yields. Besides, we develop the CropNet package, offering three types of APIs, for facilitating researchers in downloading the CropNet data on the fly over the time and region of interest, and flexibly building their deep learning models for accurate crop yield predictions. Extensive experiments have been conducted on our CropNet dataset via employing various types of deep learning solutions, with the results validating the general applicability and the efficacy of the CropNet dataset in climate change-aware crop yield predictions.
comment: 13 pages
♻ ☆ Grounding Continuous Representations in Geometry: Equivariant Neural Fields
Recently, Neural Fields have emerged as a powerful modelling paradigm to represent continuous signals. In a conditional neural field, a field is represented by a latent variable that conditions the NeF, whose parametrisation is otherwise shared over an entire dataset. We propose Equivariant Neural Fields based on cross attention transformers, in which NeFs are conditioned on a geometric conditioning variable, a latent point cloud, that enables an equivariant decoding from latent to field. Our equivariant approach induces a steerability property by which both field and latent are grounded in geometry and amenable to transformation laws if the field transforms, the latent represents transforms accordingly and vice versa. Crucially, the equivariance relation ensures that the latent is capable of (1) representing geometric patterns faitfhully, allowing for geometric reasoning in latent space, (2) weightsharing over spatially similar patterns, allowing for efficient learning of datasets of fields. These main properties are validated using classification experiments and a verification of the capability of fitting entire datasets, in comparison to other non-equivariant NeF approaches. We further validate the potential of ENFs by demonstrate unique local field editing properties.
comment: Preprint for Neurips submission
♻ ☆ WildlifeReID-10k: Wildlife re-identification dataset with 10k individual animals
We introduce a new wildlife re-identification dataset WildlifeReID-10k with more than 214k images of 10k individual animals. It is a collection of 30 existing wildlife re-identification datasets with additional processing steps. WildlifeReID-10k contains animals as diverse as marine turtles, primates, birds, African herbivores, marine mammals and domestic animals. Due to the ubiquity of similar images in datasets, we argue that the standard (random) splits into training and testing sets are inadequate for wildlife re-identification and propose a new similarity-aware split based on the similarity of extracted features. To promote fair method comparison, we include similarity-aware splits both for closed-set and open-set settings, use MegaDescriptor - a foundational model for wildlife re-identification - for baseline performance and host a leaderboard with the best results. We publicly publish the dataset and the codes used to create it in the wildlife-datasets library, making WildlifeReID-10k both highly curated and easy to use.
♻ ☆ A Survey of Label-Efficient Deep Learning for 3D Point Clouds
In the past decade, deep neural networks have achieved significant progress in point cloud learning. However, collecting large-scale precisely-annotated training data is extremely laborious and expensive, which hinders the scalability of existing point cloud datasets and poses a bottleneck for efficient exploration of point cloud data in various tasks and applications. Label-efficient learning offers a promising solution by enabling effective deep network training with much-reduced annotation efforts. This paper presents the first comprehensive survey of label-efficient learning of point clouds. We address three critical questions in this emerging research field: i) the importance and urgency of label-efficient learning in point cloud processing, ii) the subfields it encompasses, and iii) the progress achieved in this area. To achieve this, we propose a taxonomy that organizes label-efficient learning methods based on the data prerequisites provided by different types of labels. We categorize four typical label-efficient learning approaches that significantly reduce point cloud annotation efforts: data augmentation, domain transfer learning, weakly-supervised learning, and pretrained foundation models. For each approach, we outline the problem setup and provide an extensive literature review that showcases relevant progress and challenges. Finally, we share insights into current research challenges and potential future directions. A project associated with this survey has been built at https://github.com/xiaoaoran/3D_label_efficient_learning.
comment: Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)
♻ ☆ PiClick: Picking the desired mask from multiple candidates in click-based interactive segmentation
Click-based interactive segmentation aims to generate target masks via human clicking, which facilitates efficient pixel-level annotation and image editing. In such a task, target ambiguity remains a problem hindering the accuracy and efficiency of segmentation. That is, in scenes with rich context, one click may correspond to multiple potential targets, while most previous interactive segmentors only generate a single mask and fail to deal with target ambiguity. In this paper, we propose a novel interactive segmentation network named PiClick, to yield all potentially reasonable masks and suggest the most plausible one for the user. Specifically, PiClick utilizes a Transformer-based architecture to generate all potential target masks by mutually interactive mask queries. Moreover, a Target Reasoning module(TRM) is designed in PiClick to automatically suggest the user-desired mask from all candidates, relieving target ambiguity and extra-human efforts. Extensive experiments on 9 interactive segmentation datasets demonstrate PiClick performs favorably against previous state-of-the-arts considering the segmentation results. Moreover, we show that PiClick effectively reduces human efforts in annotating and picking the desired masks. To ease the usage and inspire future research, we release the source code of PiClick together with a plug-and-play annotation tool at https://github.com/cilinyan/PiClick.
♻ ☆ Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair CVPR 2024
In the image classification task, deep neural networks frequently rely on bias attributes that are spuriously correlated with a target class in the presence of dataset bias, resulting in degraded performance when applied to data without bias attributes. The task of debiasing aims to compel classifiers to learn intrinsic attributes that inherently define a target class rather than focusing on bias attributes. While recent approaches mainly focus on emphasizing the learning of data samples without bias attributes (i.e., bias-conflicting samples) compared to samples with bias attributes (i.e., bias-aligned samples), they fall short of directly guiding models where to focus for learning intrinsic features. To address this limitation, this paper proposes a method that provides the model with explicit spatial guidance that indicates the region of intrinsic features. We first identify the intrinsic features by investigating the class-discerning common features between a bias-aligned (BA) sample and a bias-conflicting (BC) sample (i.e., bias-contrastive pair). Next, we enhance the intrinsic features in the BA sample that are relatively under-exploited for prediction compared to the BC sample. To construct the bias-contrastive pair without using bias information, we introduce a bias-negative score that distinguishes BC samples from BA samples employing a biased model. The experiments demonstrate that our method achieves state-of-the-art performance on synthetic and real-world datasets with various levels of bias severity.
comment: Accepted to CVPR 2024
♻ ☆ SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization ICML 2024
Transformers have become foundational architectures for both natural language and computer vision tasks. However, the high computational cost makes it quite challenging to deploy on resource-constraint devices. This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. However, replacing LayerNorm with more efficient BatchNorm in transformer often leads to inferior performance and collapse in training. To address this problem, we propose a novel method named PRepBN to progressively replace LayerNorm with re-parameterized BatchNorm in training. Moreover, we propose a simplified linear attention (SLA) module that is simple yet effective to achieve strong performance. Extensive experiments on image classification as well as object detection demonstrate the effectiveness of our proposed method. For example, our SLAB-Swin obtains $83.6\%$ top-1 accuracy on ImageNet-1K with $16.2$ms latency, which is $2.4$ms less than that of Flatten-Swin with $0.1\%$ higher accuracy. We also evaluated our method for language modeling task and obtain comparable performance and lower latency.Codes are publicly available at https://github.com/xinghaochen/SLAB and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SLAB.
comment: ICML 2024
♻ ☆ HEDI: First-Time Clinical Application and Results of a Biomechanical Evaluation and Visualisation Tool for Incisional Hernia Repair
Abdominal wall defects often lead to pain, discomfort, and recurrence of incisional hernias, resulting in significant morbidity and repeated surgical repairs worldwide. Mesh repair for large hernias is usually based on the defect area with a fixed overlap, neglecting biomechanical factors such as muscle activation, intra-abdominal pressure, tissue elasticity, and abdominal wall distension. To address this issue, we present a biomechanical approach to incisional hernia repair that takes into account the unstable abdominal wall. Additionally, we introduce HEDI, a tool that uses computed tomography with Valsalva maneuver to automatically detect and assess hernia size, volume, and abdominal wall instability. Our first clinical application of HEDI in the preoperative evaluation of 31 patients shows significantly improved success rates compared to reported rates, with all patients remaining pain-free and experiencing no hernia recurrence after three years of follow-up.
♻ ☆ Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation CVPR 2024
We propose Hydra-MDP, a novel paradigm employing multiple teachers in a teacher-student model. This approach uses knowledge distillation from both human and rule-based teachers to train the student model, which features a multi-head decoder to learn diverse trajectory candidates tailored to various evaluation metrics. With the knowledge of rule-based teachers, Hydra-MDP learns how the environment influences the planning in an end-to-end manner instead of resorting to non-differentiable post-processing. This method achieves the $1^{st}$ place in the Navsim challenge, demonstrating significant improvements in generalization across diverse driving environments and conditions. Code will be available at https://github.com/woxihuanjiangguo/Hydra-MDP
comment: The 1st place solution of End-to-end Driving at Scale at the CVPR 2024 Autonomous Grand Challenge
♻ ☆ 3D Gaussian Splatting as Markov Chain Monte Carlo
While 3D Gaussian Splatting has recently become popular for neural rendering, current methods rely on carefully engineered cloning and splitting strategies for placing Gaussians, which can lead to poor-quality renderings, and reliance on a good initialization. In this work, we rethink the set of 3D Gaussians as a random sample drawn from an underlying probability distribution describing the physical representation of the scene-in other words, Markov Chain Monte Carlo (MCMC) samples. Under this view, we show that the 3D Gaussian updates can be converted as Stochastic Gradient Langevin Dynamics (SGLD) updates by simply introducing noise. We then rewrite the densification and pruning strategies in 3D Gaussian Splatting as simply a deterministic state transition of MCMC samples, removing these heuristics from the framework. To do so, we revise the 'cloning' of Gaussians into a relocalization scheme that approximately preserves sample probability. To encourage efficient use of Gaussians, we introduce a regularizer that promotes the removal of unused Gaussians. On various standard evaluation scenes, we show that our method provides improved rendering quality, easy control over the number of Gaussians, and robustness to initialization.
♻ ☆ Domain Camera Adaptation and Collaborative Multiple Feature Clustering for Unsupervised Person Re-ID
Recently unsupervised person re-identification (re-ID) has drawn much attention due to its open-world scenario settings where limited annotated data is available. Existing supervised methods often fail to generalize well on unseen domains, while the unsupervised methods, mostly lack multi-granularity information and are prone to suffer from confirmation bias. In this paper, we aim at finding better feature representations on the unseen target domain from two aspects, 1) performing unsupervised domain adaptation on the labeled source domain and 2) mining potential similarities on the unlabeled target domain. Besides, a collaborative pseudo re-labeling strategy is proposed to alleviate the influence of confirmation bias. Firstly, a generative adversarial network is utilized to transfer images from the source domain to the target domain. Moreover, person identity preserving and identity mapping losses are introduced to improve the quality of generated images. Secondly, we propose a novel collaborative multiple feature clustering framework (CMFC) to learn the internal data structure of target domain, including global feature and partial feature branches. The global feature branch (GB) employs unsupervised clustering on the global feature of person images while the Partial feature branch (PB) mines similarities within different body regions. Finally, extensive experiments on two benchmark datasets show the competitive performance of our method under unsupervised person re-ID settings.
comment: ACMMM 2022
♻ ☆ Progressive Dual Priori Network for Generalized Breast Tumor Segmentation
To promote the generalization ability of breast tumor segmentation models, as well as to improve the segmentation performance for breast tumors with smaller size, low-contrast and irregular shape, we propose a progressive dual priori network (PDPNet) to segment breast tumors from dynamic enhanced magnetic resonance images (DCE-MRI) acquired at different centers. The PDPNet first cropped tumor regions with a coarse-segmentation based localization module, then the breast tumor mask was progressively refined by using the weak semantic priori and cross-scale correlation prior knowledge. To validate the effectiveness of PDPNet, we compared it with several state-of-the-art methods on multi-center datasets. The results showed that, comparing against the suboptimal method, the DSC and HD95 of PDPNet were improved at least by 5.13% and 7.58% respectively on multi-center test sets. In addition, through ablations, we demonstrated that the proposed localization module can decrease the influence of normal tissues and therefore improve the generalization ability of the model. The weak semantic priors allow focusing on tumor regions to avoid missing small tumors and low-contrast tumors. The cross-scale correlation priors are beneficial for promoting the shape-aware ability for irregular tumors. Thus integrating them in a unified framework improved the multi-center breast tumor segmentation performance. The source code and open data can be accessed at https://github.com/wangli100209/PDPNet.
comment: 14 pages, 12 figures
♻ ☆ Make Your Actor Talk: Generalizable and High-Fidelity Lip Sync with Motion and Appearance Disentanglement
We aim to edit the lip movements in talking video according to the given speech while preserving the personal identity and visual details. The task can be decomposed into two sub-problems: (1) speech-driven lip motion generation and (2) visual appearance synthesis. Current solutions handle the two sub-problems within a single generative model, resulting in a challenging trade-off between lip-sync quality and visual details preservation. Instead, we propose to disentangle the motion and appearance, and then generate them one by one with a speech-to-motion diffusion model and a motion-conditioned appearance generation model. However, there still remain challenges in each stage, such as motion-aware identity preservation in (1) and visual details preservation in (2). Therefore, to preserve personal identity, we adopt landmarks to represent the motion, and further employ a landmark-based identity loss. To capture motion-agnostic visual details, we use separate encoders to encode the lip, non-lip appearance and motion, and then integrate them with a learned fusion module. We train MyTalk on a large-scale and diverse dataset. Experiments show that our method generalizes well to the unknown, even out-of-domain person, in terms of both lip sync and visual detail preservation. We encourage the readers to watch the videos on our project page (https://Ingrid789.github.io/MyTalk/).
comment: 14 pages of main text, 23 pages in total, 9 figures
♻ ☆ YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images
Detecting objects from aerial images poses significant challenges due to the following factors: 1) Aerial images typically have very large sizes, generally with millions or even hundreds of millions of pixels, while computational resources are limited. 2) Small object size leads to insufficient information for effective detection. 3) Non-uniform object distribution leads to computational resource wastage. To address these issues, we propose YOLC (You Only Look Clusters), an efficient and effective framework that builds on an anchor-free object detector, CenterNet. To overcome the challenges posed by large-scale images and non-uniform object distribution, we introduce a Local Scale Module (LSM) that adaptively searches cluster regions for zooming in for accurate detection. Additionally, we modify the regression loss using Gaussian Wasserstein distance (GWD) to obtain high-quality bounding boxes. Deformable convolution and refinement methods are employed in the detection head to enhance the detection of small objects. We perform extensive experiments on two aerial image datasets, including Visdrone2019 and UAVDT, to demonstrate the effectiveness and superiority of our proposed approach. Code is available at https://github.com/dawn-ech/YOLC.
comment: accepted to TITS
♻ ☆ Refining Coded Image in Human Vision Layer Using CNN-Based Post-Processing
Scalable image coding for both humans and machines is a technique that has gained a lot of attention recently. This technology enables the hierarchical decoding of images for human vision and image recognition models. It is a highly effective method when images need to serve both purposes. However, no research has yet incorporated the post-processing commonly used in popular image compression schemes into scalable image coding method for humans and machines. In this paper, we propose a method to enhance the quality of decoded images for humans by integrating post-processing into scalable coding scheme. Experimental results show that the post-processing improves compression performance. Furthermore, the effectiveness of the proposed method is validated through comparisons with traditional methods.
♻ ☆ Efficient Meta-Learning Enabled Lightweight Multiscale Few-Shot Object Detection in Remote Sensing Images
Presently, the task of few-shot object detection (FSOD) in remote sensing images (RSIs) has become a focal point of attention. Numerous few-shot detectors, particularly those based on two-stage detectors, face challenges when dealing with the multiscale complexities inherent in RSIs. Moreover, these detectors present impractical characteristics in real-world applications, mainly due to their unwieldy model parameters when handling large amount of data. In contrast, we recognize the advantages of one-stage detectors, including high detection speed and a global receptive field. Consequently, we choose the YOLOv7 one-stage detector as a baseline and subject it to a novel meta-learning training framework. This transformation allows the detector to adeptly address FSOD tasks while capitalizing on its inherent advantage of lightweight. Additionally, we thoroughly investigate the samples generated by the meta-learning strategy and introduce a novel meta-sampling approach to retain samples produced by our designed meta-detection head. Coupled with our devised meta-cross loss, we deliberately utilize "negative samples" that are often overlooked to extract valuable knowledge from them. This approach serves to enhance detection accuracy and efficiently refine the overall meta-learning strategy. To validate the effectiveness of our proposed detector, we conducted performance comparisons with current state-of-the-art detectors using the DIOR and NWPU VHR-10.v2 datasets, yielding satisfactory results.
♻ ☆ Scalable Image Coding for Humans and Machines Using Feature Fusion Network
As image recognition models become more prevalent, scalable coding methods for machines and humans gain more importance. Applications of image recognition models include traffic monitoring and farm management. In these use cases, the scalable coding method proves effective because the tasks require occasional image checking by humans. Existing image compression methods for humans and machines meet these requirements to some extent. However, these compression methods are effective solely for specific image recognition models. We propose a learning-based scalable image coding method for humans and machines that is compatible with numerous image recognition models. We combine an image compression model for machines with a compression model, providing additional information to facilitate image decoding for humans. The features in these compression models are fused using a feature fusion network to achieve efficient image compression. Our method's additional information compression model is adjusted to reduce the number of parameters by enabling combinations of features of different sizes in the feature fusion network. Our approach confirms that the feature fusion network efficiently combines image compression models while reducing the number of parameters. Furthermore, we demonstrate the effectiveness of the proposed scalable coding method by evaluating the image compression performance in terms of decoded image quality and bitrate.
♻ ☆ Generative Escher Meshes
This paper proposes a fully-automatic, text-guided generative method for producing perfectly-repeating, periodic, tile-able 2D imagery, such as the one seen on floors, mosaics, ceramics, and the work of M.C. Escher. In contrast to square texture images that are seamless when tiled, our method generates non-square tilings which comprise solely of repeating copies of the same object. It achieves this by optimizing both geometry and texture of a 2D mesh, yielding a non-square tile in the shape and appearance of the desired object, with close to no additional background details, that can tile the plane without gaps nor overlaps. We enable optimization of the tile's shape by an unconstrained, differentiable parameterization of the space of all valid tileable meshes for given boundary conditions stemming from a symmetry group. Namely, we construct a differentiable family of linear systems derived from a 2D mesh-mapping technique - Orbifold Tutte Embedding - by considering the mesh's Laplacian matrix as differentiable parameters. We prove that the solution space of these linear systems is exactly all possible valid tiling configurations, thereby providing an end-to-end differentiable representation for the entire space of valid tiles. We render the textured mesh via a differentiable renderer, and leverage a pre-trained image diffusion model to induce a loss on the resulting image, updating the mesh's parameters so as to make its appearance match the text prompt. We show our method is able to produce plausible, appealing results, with non-trivial tiles, for a variety of different periodic tiling patterns.
♻ ☆ VLind-Bench: Measuring Language Priors in Large Vision-Language Models
Large Vision-Language Models (LVLMs) have demonstrated outstanding performance across various multimodal tasks. However, they suffer from a problem known as language prior, where responses are generated based solely on textual patterns while disregarding image information. Addressing the issue of language prior is crucial, as it can lead to undesirable biases or hallucinations when dealing with images that are out of training distribution. Despite its importance, current methods for accurately measuring language priors in LVLMs are poorly studied. Although existing benchmarks based on counterfactual or out-of-distribution images can partially be used to measure language priors, they fail to disentangle language priors from other confounding factors. To this end, we propose a new benchmark called VLind-Bench, which is the first benchmark specifically designed to measure the language priors, or blindness, of LVLMs. It not only includes tests on counterfactual images to assess language priors but also involves a series of tests to evaluate more basic capabilities such as commonsense knowledge, visual perception, and commonsense biases. For each instance in our benchmark, we ensure that all these basic tests are passed before evaluating the language priors, thereby minimizing the influence of other factors on the assessment. The evaluation and analysis of recent LVLMs in our benchmark reveal that almost all models exhibit a significant reliance on language priors, presenting a strong challenge in the field.
♻ ☆ Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images ICIP2024
Object detection in Remote Sensing Images (RSI) is a critical task for numerous applications in Earth Observation (EO). Differing from object detection in natural images, object detection in remote sensing images faces challenges of scarcity of annotated data and the presence of small objects represented by only a few pixels. Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities such as RGB, infrared (IR), lidar, and synthetic aperture radar (SAR). To this end, the fusion of representations at the mid or late stage, produced by parallel subnetworks, is dominant, with the disadvantages of increasing computational complexity in the order of the number of modalities and the creation of additional engineering obstacles. Using the cross-attention mechanism, we propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage, enabling the construction of a coherent input by aligning the different modalities. By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques. Additionally, we enhance the SWIN transformer by integrating convolution layers into the feed-forward of non-shifting blocks. This augmentation strengthens the model's capacity to merge separated windows through local attention, thereby improving small object detection. Extensive experiments prove the effectiveness of the proposed multimodal fusion module and the architecture, demonstrating their applicability to object detection in multimodal aerial imagery.
comment: Accepted by ICIP2024
♻ ☆ An Autonomous Vision-Based Algorithm for Interplanetary Navigation
The surge of deep-space probes makes it unsustainable to navigate them with standard radiometric tracking. Self-driving interplanetary satellites represent a solution to this problem. In this work, a full vision-based navigation algorithm is built by combining an orbit determination method with an image processing pipeline suitable for interplanetary transfers of autonomous platforms. To increase the computational efficiency of the algorithm, a non-dimensional extended Kalman filter is selected as state estimator, fed by the positions of the planets extracted from deep-space images. An enhancement of the estimation accuracy is performed by applying an optimal strategy to select the best pair of planets to track. Moreover, a novel analytical measurement model for deep-space navigation is developed providing a first-order approximation of the light-aberration and light-time effects. Algorithm performance is tested on a high-fidelity, Earth--Mars interplanetary transfer, showing the algorithm applicability for deep-space navigation.
♻ ☆ Robust 6DoF Pose Estimation Against Depth Noise and a Comprehensive Evaluation on a Mobile Dataset
Robust 6DoF pose estimation with mobile devices is the foundation for applications in robotics, augmented reality, and digital twin localization. In this paper, we extensively investigate the robustness of existing RGBD-based 6DoF pose estimation methods against varying levels of depth sensor noise. We highlight that existing 6DoF pose estimation methods suffer significant performance discrepancies due to depth measurement inaccuracies. In response to the robustness issue, we present a simple and effective transformer-based 6DoF pose estimation approach called DTTDNet, featuring a novel geometric feature filtering module and a Chamfer distance loss for training. Moreover, we advance the field of robust 6DoF pose estimation and introduce a new dataset -- Digital Twin Tracking Dataset Mobile (DTTD-Mobile), tailored for digital twin object tracking with noisy depth data from the mobile RGBD sensor suite of the Apple iPhone 14 Pro. Extensive experiments demonstrate that DTTDNet significantly outperforms state-of-the-art methods at least 4.32, up to 60.74 points in ADD metrics on the DTTD-Mobile. More importantly, our approach exhibits superior robustness to varying levels of measurement noise, setting a new benchmark for the robustness to noise measurements. Code and dataset are made publicly available at: https://github.com/augcog/DTTD2
♻ ☆ Gemini: A Family of Highly Capable Multimodal Models
This report introduces a new family of multimodal models, Gemini, that exhibit remarkable capabilities across image, audio, video, and text understanding. The Gemini family consists of Ultra, Pro, and Nano sizes, suitable for applications ranging from complex reasoning tasks to on-device memory-constrained use-cases. Evaluation on a broad range of benchmarks shows that our most-capable Gemini Ultra model advances the state of the art in 30 of 32 of these benchmarks - notably being the first model to achieve human-expert performance on the well-studied exam benchmark MMLU, and improving the state of the art in every one of the 20 multimodal benchmarks we examined. We believe that the new capabilities of the Gemini family in cross-modal reasoning and language understanding will enable a wide variety of use cases. We discuss our approach toward post-training and deploying Gemini models responsibly to users through services including Gemini, Gemini Advanced, Google AI Studio, and Cloud Vertex AI.
Computation and Language 150
☆ mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
☆ Language Modeling with Editable External Knowledge
When the world changes, so does the text that humans write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation, in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on these systems have focused on improving behavior during prediction through better retrieval or reasoning. This paper introduces ERASE, which instead improves model behavior when new documents are acquired, by incrementally deleting or rewriting other entries in the knowledge base each time a document is added. In two new benchmark datasets evaluating models' ability to answer questions about a stream of news articles or conversations, ERASE improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute. Code and data are available at https://github.com/belindal/ERASE
☆ WPO: Enhancing RLHF with Weighted Preference Optimization
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.
☆ On Efficient Language and Vision Assistants for Visually-Situated Natural Language Understanding: What Matters in Reading and Reasoning
Recent advancements in language and vision assistants have showcased impressive capabilities but suffer from a lack of transparency, limiting broader research and reproducibility. While open-source models handle general image tasks effectively, they face challenges with the high computational demands of complex visually-situated text understanding. Such tasks often require increased token inputs and large vision modules to harness high-resolution information. Striking a balance between model size and data importance remains an open question. This study aims to redefine the design of vision-language models by identifying key components and creating efficient models with constrained inference costs. By strategically formulating datasets, optimizing vision modules, and enhancing supervision techniques, we achieve significant improvements in inference throughput while maintaining high performance. Extensive experiments across models ranging from 160M to 13B parameters offer insights into model optimization. We will fully open-source our codebase, models, and datasets at https://github.com/naver-ai/elva .
comment: 17 pages, 8 figures
☆ Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.
☆ How Do Large Language Models Acquire Factual Knowledge During Pretraining?
Despite the recent observation that large language models (LLMs) can store substantial factual knowledge, there is a limited understanding of the mechanisms of how they acquire factual knowledge through pretraining. This work addresses this gap by studying how LLMs acquire factual knowledge during pretraining. The findings reveal several important insights into the dynamics of factual knowledge acquisition during pretraining. First, counterintuitively, we observe that pretraining on more data shows no significant improvement in the model's capability to acquire and maintain factual knowledge. Next, there is a power-law relationship between training steps and forgetting of memorization and generalization of factual knowledge, and LLMs trained with duplicated training data exhibit faster forgetting. Third, training LLMs with larger batch sizes can enhance the models' robustness to forgetting. Overall, our observations suggest that factual knowledge acquisition in LLM pretraining occurs by progressively increasing the probability of factual knowledge presented in the pretraining data at each step. However, this increase is diluted by subsequent forgetting. Based on this interpretation, we demonstrate that we can provide plausible explanations for recently observed behaviors of LLMs, such as the poor performance of LLMs on long-tail knowledge and the benefits of deduplicating the pretraining corpus.
☆ RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.
☆ Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
Ensuring the safe alignment of large language models (LLMs) with human values is critical as they become integral to applications like translation and question answering. Current alignment methods struggle with dynamic user intentions and complex objectives, making models vulnerable to generating harmful content. We propose Safety Arithmetic, a training-free framework enhancing LLM safety across different scenarios: Base models, Supervised fine-tuned models (SFT), and Edited models. Safety Arithmetic involves Harm Direction Removal to avoid harmful content and Safety Alignment to promote safe responses. Additionally, we present NoIntentEdit, a dataset highlighting edit instances that could compromise model safety if used unintentionally. Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility, outperforming existing methods in ensuring safe content generation.
comment: Under Review. Codes are available at: https://github.com/declare-lab/safety-arithmetic
☆ DataComp-LM: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
comment: Project page: https://www.datacomp.ai/dclm/
☆ CELL your Model: Contrastive Explanation Methods for Large Language Models
The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.
☆ MDCR: A Dataset for Multi-Document Conditional Reasoning
The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned conditions. However, it is limited to questions on single documents, neglecting harder cases that may require cross-document reasoning and optimization, for example, "What is the maximum number of scholarships attainable?" Such questions over multiple documents are not only more challenging due to more context having to understand, but also because the model has to (1) explore all possible combinations of unmentioned conditions and (2) understand the relationship between conditions across documents, to reason about the optimal outcome. To evaluate models' capability of answering such questions, we propose a new dataset MDCR, which can reflect real-world challenges and serve as a new test bed for complex conditional reasoning that requires optimization. We evaluate this dataset using the most recent LLMs and demonstrate their limitations in solving this task. We believe this dataset will facilitate future research in answering optimization questions with unknown conditions.
☆ Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
☆ Improving Multi-Agent Debate with Sparse Communication Topology
Multi-agent debate has proven effective in improving large language models quality for reasoning and factuality tasks. While various role-playing strategies in multi-agent debates have been explored, in terms of the communication among agents, existing approaches adopt a brute force algorithm -- each agent can communicate with all other agents. In this paper, we systematically investigate the effect of communication connectivity in multi-agent systems. Our experiments on GPT and Mistral models reveal that multi-agent debates leveraging sparse communication topology can achieve comparable or superior performance while significantly reducing computational costs. Furthermore, we extend the multi-agent debate framework to multimodal reasoning and alignment labeling tasks, showcasing its broad applicability and effectiveness. Our findings underscore the importance of communication connectivity on enhancing the efficiency and effectiveness of the "society of minds" approach.
comment: 13 pages, 9 figures
☆ GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.
comment: Project Website: https://sreyan88.github.io/gamaaudio/
☆ STAR: SocioTechnical Approach to Red Teaming Language Models
This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.
comment: 8 pages, 5 figures, 5 pages appendix. * denotes equal contribution
☆ A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models
Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning.
comment: 13 pages, 5 figures, under peer-review
☆ Multi-Layer Ranking with Large Language Models for News Source Recommendation SIGIR 2024
To seek reliable information sources for news events, we introduce a novel task of expert recommendation, which aims to identify trustworthy sources based on their previously quoted statements. To achieve this, we built a novel dataset, called NewsQuote, consisting of 23,571 quote-speaker pairs sourced from a collection of news articles. We formulate the recommendation task as the retrieval of experts based on their likelihood of being associated with a given query. We also propose a multi-layer ranking framework employing Large Language Models to improve the recommendation performance. Our results show that employing an in-context learning based LLM ranker and a multi-layer ranking-based filter significantly improve both the predictive quality and behavioural quality of the recommender system.
comment: Accepted by the SIGIR 2024. arXiv admin note: text overlap with arXiv:2305.04825
☆ Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS's success, thereby offering valuable insights for future research in this area. Code will be available at \url{https://github.com/xufangzhi/ENVISIONS}.
comment: 18 pages, 6 figures
☆ 1000 African Voices: Advancing inclusive multi-speaker multi-accent speech synthesis
Recent advances in speech synthesis have enabled many useful applications like audio directions in Google Maps, screen readers, and automated content generation on platforms like TikTok. However, these systems are mostly dominated by voices sourced from data-rich geographies with personas representative of their source data. Although 3000 of the world's languages are domiciled in Africa, African voices and personas are under-represented in these systems. As speech synthesis becomes increasingly democratized, it is desirable to increase the representation of African English accents. We present Afro-TTS, the first pan-African accented English speech synthesis system able to generate speech in 86 African accents, with 1000 personas representing the rich phonological diversity across the continent for downstream application in Education, Public Health, and Automated Content Creation. Speaker interpolation retains naturalness and accentedness, enabling the creation of new voices.
comment: Accepted at Interspeech 2024
☆ Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity
Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization.
comment: 33 pages, 14 figures
☆ Refusal in Language Models Is Mediated by a Single Direction
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
☆ Measuring memorization in RLHF for code completion
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.
☆ Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.
Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive with RankLLama, both of which are 7B parameter models trained on over 100K labels. These findings point to the power of automatic prompt optimization for synthetic dataset generation.
☆ Nemotron-4 340B Technical Report
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
☆ Meta Reasoning for Large Language Models
We introduce Meta-Reasoning Prompting (MRP), a novel and efficient system prompting method for large language models (LLMs) inspired by human meta-reasoning. Traditional in-context learning-based reasoning techniques, such as Tree-of-Thoughts, show promise but lack consistent state-of-the-art performance across diverse tasks due to their specialized nature. MRP addresses this limitation by guiding LLMs to dynamically select and apply different reasoning methods based on the specific requirements of each task, optimizing both performance and computational efficiency. With MRP, LLM reasoning operates in two phases. Initially, the LLM identifies the most appropriate reasoning method using task input cues and objective descriptions of available methods. Subsequently, it applies the chosen method to complete the task. This dynamic strategy mirrors human meta-reasoning, allowing the model to excel in a wide range of problem domains. We evaluate the effectiveness of MRP through comprehensive benchmarks. The results demonstrate that MRP achieves or approaches state-of-the-art performance across diverse tasks. MRP represents a significant advancement in enabling LLMs to identify cognitive challenges across problems and leverage benefits across different reasoning approaches, enhancing their ability to handle diverse and complex problem domains efficiently. Every LLM deserves a Meta-Reasoning Prompting to unlock its full potential and ensure adaptability in an ever-evolving landscape of challenges and applications.
☆ Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy
comment: Krista and Michael contributed equally to this work
☆ Tokenization Falling Short: The Curse of Tokenization
Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens-issues we term the curse of tokenization. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We will release our code and data to facilitate further research.
☆ HoLLMwood: Unleashing the Creativity of Large Language Models in Screenwriting via Role Playing
Generative AI has demonstrated unprecedented creativity in the field of computer vision, yet such phenomena have not been observed in natural language processing. In particular, large language models (LLMs) can hardly produce written works at the level of human experts due to the extremely high complexity of literature writing. In this paper, we present HoLLMwood, an automated framework for unleashing the creativity of LLMs and exploring their potential in screenwriting, which is a highly demanding task. Mimicking the human creative process, we assign LLMs to different roles involved in the real-world scenario. In addition to the common practice of treating LLMs as ${Writer}$, we also apply LLMs as ${Editor}$, who is responsible for providing feedback and revision advice to ${Writer}$. Besides, to enrich the characters and deepen the plots, we introduce a role-playing mechanism and adopt LLMs as ${Actors}$ that can communicate and interact with each other. Evaluations on automatically generated screenplays show that HoLLMwood substantially outperforms strong baselines in terms of coherence, relevance, interestingness and overall quality.
☆ Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs' safety on specialized domains, e.g. medicine. However, testing the domain-specific safety of LLMs is challenging due to the lack of domain knowledge-driven attacks in existing benchmarks. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaks from domain knowledge to evaluate the safety of LLMs when applied to those domains. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both relevant to the given knowledge and harmful to the target LLMs. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code: https://github.com/THU-KEG/Knowledge-to-Jailbreak/.
comment: 18 pages, 14 figures, 11 tables
☆ R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models KDD2024
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAG workflows in conjunction with LLMs. Our toolkit, which supports popular built-in RAG workflows and allows for the incorporation of customized testing data on the specific domain, is designed to be user-friendly, modular, and extensible. We conduct an evaluation of 21 RALLMs across three task levels and two representative domains, revealing significant variations in the effectiveness of RALLMs across different tasks and domains. Our analysis emphasizes the importance of considering both task and domain requirements when choosing a RAG workflow and LLM combination. We are committed to continuously maintaining our platform at https://github.com/THU-KEG/R-Eval to facilitate both the industry and the researchers.
comment: 12 pages, 9 figures, Accepted by KDD2024
☆ TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy
Large Language Models (LLMs) are increasingly employed in zero-shot documents ranking, yielding commendable results. However, several significant challenges still persist in LLMs for ranking: (1) LLMs are constrained by limited input length, precluding them from processing a large number of documents simultaneously; (2) The output document sequence is influenced by the input order of documents, resulting in inconsistent ranking outcomes; (3) Achieving a balance between cost and ranking performance is quite challenging. To tackle these issues, we introduce a novel documents ranking method called TourRank, which is inspired by the tournament mechanism. This approach alleviates the impact of LLM's limited input length through intelligent grouping, while the tournament-like points system ensures robust ranking, mitigating the influence of the document input sequence. We test TourRank with different LLMs on the TREC DL datasets and the BEIR benchmark. Experimental results show that TourRank achieves state-of-the-art performance at a reasonable cost.
☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
☆ Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference
The increasing size of large language models (LLMs) challenges their usage on resource-constrained platforms. For example, memory on modern GPUs is insufficient to hold LLMs that are hundreds of Gigabytes in size. Offloading is a popular method to escape this constraint by storing weights of an LLM model to host CPU memory and SSD, then loading each weight to GPU before every use. In our case study of offloaded inference, we found that due to the low bandwidth between storage devices and GPU, the latency of transferring large model weights from its offloaded location to GPU memory becomes the critical bottleneck with actual compute taking nearly 0% of runtime. To effectively reduce the weight transfer latency, we propose a novel sparse format that compresses the unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Endor achieves this by expressing the positions of non-zero elements with a bitmap. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x. When direct weight transfer from SSD to GPU is leveraged, Endor achieves 2.25x speedup on OPT-66B and 2.37x speedup on Llama2-70B.
comment: 14 pages, 16 figures
☆ Benchmarking of LLM Detection: Comparing Two Competing Approaches
This article gives an overview of the field of LLM text recognition. Different approaches and implemented detectors for the recognition of LLM-generated text are presented. In addition to discussing the implementations, the article focuses on benchmarking the detectors. Although there are numerous software products for the recognition of LLM-generated text, with a focus on ChatGPT-like LLMs, the quality of the recognition (recognition rate) is not clear. Furthermore, while it can be seen that scientific contributions presenting their novel approaches strive for some kind of comparison with other approaches, the construction and independence of the evaluation dataset is often not comprehensible. As a result, discrepancies in the performance evaluation of LLM detectors are often visible due to the different benchmarking datasets. This article describes the creation of an evaluation dataset and uses this dataset to investigate the different detectors. The selected detectors are benchmarked against each other.
☆ "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak
"Jailbreak" is a major safety concern of Large Language Models (LLMs), which occurs when malicious prompts lead LLMs to produce harmful outputs, raising issues about the reliability and safety of LLMs. Therefore, an effective evaluation of jailbreaks is very crucial to develop its mitigation strategies. However, our research reveals that many jailbreaks identified by current evaluations may actually be hallucinations-erroneous outputs that are mistaken for genuine safety breaches. This finding suggests that some perceived vulnerabilities might not represent actual threats, indicating a need for more precise red teaming benchmarks. To address this problem, we propose the $\textbf{B}$enchmark for reli$\textbf{AB}$ilit$\textbf{Y}$ and jail$\textbf{B}$reak ha$\textbf{L}$l$\textbf{U}$cination $\textbf{E}$valuation (BabyBLUE). BabyBLUE introduces a specialized validation framework including various evaluators to enhance existing jailbreak benchmarks, ensuring outputs are useful malicious instructions. Additionally, BabyBLUE presents a new dataset as an augmentation to the existing red teaming benchmarks, specifically addressing hallucinations in jailbreaks, aiming to evaluate the true potential of jailbroken LLM outputs to cause harm to human society.
☆ See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.
comment: 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective
☆ Cultural Conditioning or Placebo? On the Effectiveness of Socio-Demographic Prompting
Socio-demographic prompting is a commonly employed approach to study cultural biases in LLMs as well as for aligning models to certain cultures. In this paper, we systematically probe four LLMs (Llama 3, Mistral v0.2, GPT-3.5 Turbo and GPT-4) with prompts that are conditioned on culturally sensitive and non-sensitive cues, on datasets that are supposed to be culturally sensitive (EtiCor and CALI) or neutral (MMLU and ETHICS). We observe that all models except GPT-4 show significant variations in their responses on both kinds of datasets for both kinds of prompts, casting doubt on the robustness of the culturally-conditioned prompting as a method for eliciting cultural bias in models or as an alignment strategy. The work also calls rethinking the control experiment design to tease apart the cultural conditioning of responses from "placebo effect", i.e., random perturbations of model responses due to arbitrary tokens in the prompt.
☆ Can LLM be a Personalized Judge?
Ensuring that large language models (LLMs) reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization.
comment: Our code is available at https://github.com/dong-river/Personalized-Judge
☆ Ruby Teaming: Improving Quality Diversity Search with Memory for Automated Red Teaming
We propose Ruby Teaming, a method that improves on Rainbow Teaming by including a memory cache as its third dimension. The memory dimension provides cues to the mutator to yield better-quality prompts, both in terms of attack success rate (ASR) and quality diversity. The prompt archive generated by Ruby Teaming has an ASR of 74%, which is 20% higher than the baseline. In terms of quality diversity, Ruby Teaming outperforms Rainbow Teaming by 6% and 3% on Shannon's Evenness Index (SEI) and Simpson's Diversity Index (SDI), respectively.
☆ A Two-dimensional Zero-shot Dialogue State Tracking Evaluation Method using GPT-4
Dialogue state tracking (DST) is evaluated by exact matching methods, which rely on large amounts of labeled data and ignore semantic consistency, leading to over-evaluation. Currently, leveraging large language models (LLM) in evaluating natural language processing tasks has achieved promising results. However, using LLM for DST evaluation is still under explored. In this paper, we propose a two-dimensional zero-shot evaluation method for DST using GPT-4, which divides the evaluation into two dimensions: accuracy and completeness. Furthermore, we also design two manual reasoning paths in prompting to further improve the accuracy of evaluation. Experimental results show that our method achieves better performance compared to the baselines, and is consistent with traditional exact matching based methods.
☆ The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.
☆ Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility. In this work, we show that Quality Estimation (QE) reranking, which uses a QE model as a reranker, can be viewed as a variant of MBR. Inspired by this, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes synthetic sources generated by backward translation as ``support hypotheses'' and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR significantly outperforms QE reranking and is competitive with standard MBR decoding. Furthermore, sMBR calls the utility function fewer times compared to MBR. Our findings suggest that sMBR is a promising approach for high-quality NMT decoding.
☆ Can Many-Shot In-Context Learning Help Long-Context LLM Judges? See More, Judge Better!
Leveraging Large Language Models (LLMs) as judges for evaluating the performance of LLMs has recently garnered attention. Nonetheless, this type of approach concurrently introduces potential biases from LLMs, raising concerns about the reliability of the evaluation results. To mitigate this issue, we propose and study two versions of many-shot in-context prompts, Reinforced and Unsupervised ICL, for helping GPT-4o-as-a-Judge in single answer grading. Based on the designed prompts, we investigate the impact of scaling the number of in-context examples on the agreement and quality of the evaluation. Furthermore, we first reveal the symbol bias in GPT-4o-as-a-Judge for pairwise comparison and then propose a simple yet effective approach to mitigate it. Experimental results show that advanced long-context LLMs, such as GPT-4o, perform better in the many-shot regime than in the zero-shot regime. Meanwhile, the experimental results further verify the effectiveness of the symbol bias mitigation approach.
comment: work in progress
☆ Words in Motion: Representation Engineering for Motion Forecasting
Motion forecasting transforms sequences of past movements and environment context into future motion. Recent methods rely on learned representations, resulting in hidden states that are difficult to interpret. In this work, we use natural language to quantize motion features in a human-interpretable way, and measure the degree to which they are embedded in hidden states. Our experiments reveal that hidden states of motion sequences are arranged with respect to our discrete sets of motion features. Following these insights, we fit control vectors to motion features, which allow for controlling motion forecasts at inference. Consequently, our method enables controlling transformer-based motion forecasting models with textual inputs, providing a unique interface to interact with and understand these models. Our implementation is available at https://github.com/kit-mrt/future-motion
☆ Building Knowledge-Guided Lexica to Model Cultural Variation NAACL 2024
Cultural variation exists between nations (e.g., the United States vs. China), but also within regions (e.g., California vs. Texas, Los Angeles vs. San Francisco). Measuring this regional cultural variation can illuminate how and why people think and behave differently. Historically, it has been difficult to computationally model cultural variation due to a lack of training data and scalability constraints. In this work, we introduce a new research problem for the NLP community: How do we measure variation in cultural constructs across regions using language? We then provide a scalable solution: building knowledge-guided lexica to model cultural variation, encouraging future work at the intersection of NLP and cultural understanding. We also highlight modern LLMs' failure to measure cultural variation or generate culturally varied language.
comment: Accepted at NAACL 2024
☆ DELLA-Merging: Reducing Interference in Model Merging through Magnitude-Based Sampling
With the proliferation of domain-specific models, model merging has emerged as a set of techniques that combine the capabilities of multiple models into one that can multitask without the cost of additional training. In this paper, we propose a new model merging technique, Drop and rEscaLe via sampLing with mAgnitude (DELLA-Merging), that employs a novel pruning technique, MAGPRUNE, which shows significant advantages over DARE and TIES. MAGPRUNE first ranks the parameters in order of their magnitude and assigns higher dropout probabilities (p) to parameters with lower ranks corresponding to lower magnitudes. To approximate the original embeddings, MAGPRUNE employs a rescaling operation on the parameters that survive the random dropping by 1/(1 - p). On three different expert models considered for merging (LM, Math, Code) and corresponding benchmark datasets (AlpacaEval, GSM8K, MBPP), DELLA shows an average improvement of 2.4 points over baseline methods employing delta parameter pruning (an improvement of 3.6 points over TIES, 1.2 points over DARE), and 11.1 points over the no-pruning baseline (TA). We release the source code at: https://github.com/declare-lab/della.
☆ Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
☆ Understanding "Democratization" in NLP and ML Research
Recent improvements in natural language processing (NLP) and machine learning (ML) and increased mainstream adoption have led to researchers frequently discussing the "democratization" of artificial intelligence. In this paper, we seek to clarify how democratization is understood in NLP and ML publications, through large-scale mixed-methods analyses of papers using the keyword "democra*" published in NLP and adjacent venues. We find that democratization is most frequently used to convey (ease of) access to or use of technologies, without meaningfully engaging with theories of democratization, while research using other invocations of "democra*" tends to be grounded in theories of deliberation and debate. Based on our findings, we call for researchers to enrich their use of the term democratization with appropriate theory, towards democratic technologies beyond superficial access.
☆ Style Transfer with Multi-iteration Preference Optimization
Numerous recent techniques for text style transfer characterize their approaches as variants of reinforcement learning and preference optimization. In this work, we consider the relationship between these approaches and a class of optimization approaches developed primarily for (non-neural) statistical machine translation, formerly known as 'tuning'. Inspired by these techniques from the past, we improve upon established preference optimization approaches, incorporating multiple iterations of exploration and optimization, and choosing contrastive examples by following a 'hope' vs 'fear' sampling strategy. Cognizant of the difference between machine translation and style transfer, however, we further tailor our framework with a new pseudo-parallel generation method and a dynamic weighted reward aggregation method to tackle the lack of parallel data and the need for a multi-objective reward. We evaluate our model on two commonly used text style transfer datasets. Through automatic and human evaluation results we show the effectiveness and the superiority of our model compared to state-of-the-art baselines.
☆ Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation
High-quality Machine Translation (MT) evaluation relies heavily on human judgments. Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages. On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but are less reliable. In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM. We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.
☆ Mathematical Entities: Corpora and Benchmarks
Mathematics is a highly specialized domain with its own unique set of challenges. Despite this, there has been relatively little research on natural language processing for mathematical texts, and there are few mathematical language resources aimed at NLP. In this paper, we aim to provide annotated corpora that can be used to study the language of mathematics in different contexts, ranging from fundamental concepts found in textbooks to advanced research mathematics. We preprocess the corpora with a neural parsing model and some manual intervention to provide part-of-speech tags, lemmas, and dependency trees. In total, we provide 182397 sentences across three corpora. We then aim to test and evaluate several noteworthy natural language processing models using these corpora, to show how well they can adapt to the domain of mathematics and provide useful tools for exploring mathematical language. We evaluate several neural and symbolic models against benchmarks that we extract from the corpus metadata to show that terminology extraction and definition extraction do not easily generalize to mathematics, and that additional work is needed to achieve good performance on these metrics. Finally, we provide a learning assistant that grants access to the content of these corpora in a context-sensitive manner, utilizing text search and entity linking. Though our corpora and benchmarks provide useful metrics for evaluating mathematical language processing, further work is necessary to adapt models to mathematics in order to provide more effective learning assistants and apply NLP methods to different mathematical domains.
comment: arXiv admin note: substantial text overlap with arXiv:2307.06699
☆ Towards an End-to-End Framework for Invasive Brain Signal Decoding with Large Language Models
In this paper, we introduce a groundbreaking end-to-end (E2E) framework for decoding invasive brain signals, marking a significant advancement in the field of speech neuroprosthesis. Our methodology leverages the comprehensive reasoning abilities of large language models (LLMs) to facilitate direct decoding. By fully integrating LLMs, we achieve results comparable to the state-of-the-art cascade models. Our findings underscore the immense potential of E2E frameworks in speech neuroprosthesis, particularly as the technology behind brain-computer interfaces (BCIs) and the availability of relevant datasets continue to evolve. This work not only showcases the efficacy of combining LLMs with E2E decoding for enhancing speech neuroprosthesis but also sets a new direction for future research in BCI applications, underscoring the impact of LLMs in decoding complex neural signals for communication restoration. Code will be made available at https://github.com/FsFrancis15/BrainLLM.
☆ MEMLA: Enhancing Multilingual Knowledge Editing with Neuron-Masked Low-Rank Adaptation
Knowledge editing aims to adjust the knowledge within large language models (LLMs) to prevent their responses from becoming obsolete or inaccurate. However, existing works on knowledge editing are primarily conducted in a single language, which is inadequate for multilingual language models. In this paper, we focus on multilingual knowledge editing (MKE), which requires propagating updates across multiple languages. This necessity poses a significant challenge for the task. Furthermore, the limited availability of a comprehensive dataset for MKE exacerbates this challenge, hindering progress in this area. Hence, we introduce the Multilingual Knowledge Editing Benchmark (MKEB), a novel dataset comprising 12 languages and providing a complete evaluation framework. Additionally, we propose a method that enhances Multilingual knowledge Editing with neuron-Masked Low-Rank Adaptation (MEMLA). Specifically, we identify two categories of knowledge neurons to improve editing precision. Moreover, we perform LoRA-based editing with neuron masks to efficiently modify parameters and facilitate the propagation of updates across multiple languages. Experiments demonstrate that our method outperforms existing baselines and significantly enhances the multi-hop reasoning capability of the edited model, with minimal impact on its downstream task performance. The dataset and code will be made publicly available.
☆ Extrinsic Evaluation of Cultural Competence in Large Language Models
Productive interactions between diverse users and language technologies require outputs from the latter to be culturally relevant and sensitive. Prior works have evaluated models' knowledge of cultural norms, values, and artifacts, without considering how this knowledge manifests in downstream applications. In this work, we focus on extrinsic evaluation of cultural competence in two text generation tasks, open-ended question answering and story generation. We quantitatively and qualitatively evaluate model outputs when an explicit cue of culture, specifically nationality, is perturbed in the prompts. Although we find that model outputs do vary when varying nationalities and feature culturally relevant words, we also find weak correlations between text similarity of outputs for different countries and the cultural values of these countries. Finally, we discuss important considerations in designing comprehensive evaluation of cultural competence in user-facing tasks.
comment: Under peer review
☆ Input Conditioned Graph Generation for Language Agents
Recent progress in Large Language Models (LLMs) and language agents has demonstrated significant promise for various future applications across multiple disciplines. While traditional approaches to language agents often rely on fixed, handcrafted designs, our research aims to develop both learnable and dynamic agents. Our method uses an existing framework that abstracts language agents as graphs. Within this graph framework, we aim to learn a model that can generate edges for every given input to the language agent. This allows us to generate edges that represent the flow of communication within the graph based on the given input, thereby adjusting the internal communication of a language agent. We learn to generate these edges using a pretrained LLM that is fine-tuned with reinforcement learning. This LLM can be fine-tuned on several datasets simultaneously, and we hypothesize that the model learns to adapt to these different domains during training, achieving good overall performance when encountering data from different domains during deployment. We demonstrate that our approach surpasses the previous static approach by nearly 6% accuracy on a combined dataset of MMLU and CMMLU, and by more than 10% when trained with a sparsity-inducing loss. It also performs superior in additional experiments conducted with the MMLU and Mini Crossword Puzzles datasets. The code is available at https://github.com/lukasVierling/DynamicGPTSwarm.
☆ GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations
Large pre-trained language models have become popular for many applications and form an important backbone of many downstream tasks in natural language processing (NLP). Applying 'explainable artificial intelligence' (XAI) techniques to enrich such models' outputs is considered crucial for assuring their quality and shedding light on their inner workings. However, large language models are trained on a plethora of data containing a variety of biases, such as gender biases, affecting model weights and, potentially, behavior. Currently, it is unclear to what extent such biases also impact model explanations in possibly unfavorable ways. We create a gender-controlled text dataset, GECO, in which otherwise identical sentences appear in male and female forms. This gives rise to ground-truth 'world explanations' for gender classification tasks, enabling the objective evaluation of the correctness of XAI methods. We also provide GECOBench, a rigorous quantitative evaluation framework benchmarking popular XAI methods, applying them to pre-trained language models fine-tuned to different degrees. This allows us to investigate how pre-training induces undesirable bias in model explanations and to what extent fine-tuning can mitigate such explanation bias. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to particularly benefit from fine-tuning or complete retraining of embedding layers. Remarkably, this relationship holds for models achieving similar classification performance on the same task. With that, we highlight the utility of the proposed gender-controlled dataset and novel benchmarking approach for research and development of novel XAI methods. All code including dataset generation, model training, evaluation and visualization is available at: https://github.com/braindatalab/gecobench
comment: Under review
☆ GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline uses Whisper for initial transcription and TorchAudio for forced alignment, combined with multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thus enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus's high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to the Whisper large-v3 model, with merely 10% model parameters. Furthermore, our ASR models trained on Gigaspeech 2 yield superior performance compared to commercial services. We believe that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.
comment: Under review
☆ Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs
Large Language Models (LLMs) excel in various natural language processing tasks but struggle with hallucination issues. Existing solutions have considered utilizing LLMs' inherent reasoning abilities to alleviate hallucination, such as self-correction and diverse sampling methods. However, these methods often overtrust LLMs' initial answers due to inherent biases. The key to alleviating this issue lies in overriding LLMs' inherent biases for answer inspection. To this end, we propose a CounterFactual Multi-Agent Debate (CFMAD) framework. CFMAD presets the stances of LLMs to override their inherent biases by compelling LLMs to generate justifications for a predetermined answer's correctness. The LLMs with different predetermined stances are engaged with a skeptical critic for counterfactual debate on the rationality of generated justifications. Finally, the debate process is evaluated by a third-party judge to determine the final answer. Extensive experiments on four datasets of three tasks demonstrate the superiority of CFMAD over existing methods.
☆ GeoGPT4V: Towards Geometric Multi-modal Large Language Models with Geometric Image Generation
Large language models have seen widespread adoption in math problem-solving. However, in geometry problems that usually require visual aids for better understanding, even the most advanced multi-modal models currently still face challenges in effectively using image information. High-quality data is crucial for enhancing the geometric capabilities of multi-modal models, yet existing open-source datasets and related efforts are either too challenging for direct model learning or suffer from misalignment between text and images. To overcome this issue, we introduce a novel pipeline that leverages GPT-4 and GPT-4V to generate relatively basic geometry problems with aligned text and images, facilitating model learning. We have produced a dataset of 4.9K geometry problems and combined it with 19K open-source data to form our GeoGPT4V dataset. Experimental results demonstrate that the GeoGPT4V dataset significantly improves the geometry performance of various models on the MathVista and MathVision benchmarks. The code is available at https://github.com/Lanyu0303/GeoGPT4V_Project
☆ CrAM: Credibility-Aware Attention Modification in LLMs for Combating Misinformation in RAG
Retrieval-Augmented Generation (RAG) can alleviate hallucinations of Large Language Models (LLMs) by referencing external documents. However, the misinformation in external documents may mislead LLMs' generation. To address this issue, we explore the task of "credibility-aware RAG", in which LLMs automatically adjust the influence of retrieved documents based on their credibility scores to counteract misinformation. To this end, we introduce a plug-and-play method named $\textbf{Cr}$edibility-aware $\textbf{A}$ttention $\textbf{M}$odification (CrAM). CrAM identifies influential attention heads in LLMs and adjusts their attention scores based on the credibility of the documents, thereby reducing the impact of low-credibility documents. Experiments on Natual Questions and TriviaQA using Llama2-13B, Llama3-8B, and Qwen-7B show that CrAM improves the RAG performance of LLMs against misinformation pollution by over 20%, even surpassing supervised fine-tuning methods.
comment: Under review
☆ Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency
This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.
☆ Vocabulary Expansion for Low-resource Cross-lingual Transfer
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.
☆ How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment
Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model's alignment performance. Our findings indicate that the example part is crucial for enhancing the model's alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA's zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following.
comment: 22 pages, 6 figures, work in progress
☆ Promises, Outlooks and Challenges of Diffusion Language Modeling
The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.
☆ Automating Easy Read Text Segmentation
Easy Read text is one of the main forms of access to information for people with reading difficulties. One of the key characteristics of this type of text is the requirement to split sentences into smaller grammatical segments, to facilitate reading. Automated segmentation methods could foster the creation of Easy Read content, but their viability has yet to be addressed. In this work, we study novel methods for the task, leveraging masked and generative language models, along with constituent parsing. We conduct comprehensive automatic and human evaluations in three languages, analysing the strengths and weaknesses of the proposed alternatives, under scarce resource limitations. Our results highlight the viability of automated ER segmentation and remaining deficiencies compared to expert-driven human segmentation.
☆ TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) offers an effective approach for addressing question answering (QA) tasks. However, the imperfections of the retrievers in RAG models often result in the retrieval of irrelevant information, which could introduce noises and degrade the performance, especially when handling multi-hop questions that require multiple steps of reasoning. To enhance the multi-hop reasoning ability of RAG models, we propose TRACE. TRACE constructs knowledge-grounded reasoning chains, which are a series of logically connected knowledge triples, to identify and integrate supporting evidence from the retrieved documents for answering questions. Specifically, TRACE employs a KG Generator to create a knowledge graph (KG) from the retrieved documents, and then uses an Autoregressive Reasoning Chain Constructor to build reasoning chains. Experimental results on three multi-hop QA datasets show that TRACE achieves an average performance improvement of up to 14.03% compared to using all the retrieved documents. Moreover, the results indicate that using reasoning chains as context, rather than the entire documents, is often sufficient to correctly answer questions.
☆ Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction
Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.
☆ Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
comment: Code is available at https://github.com/keven980716/weak-to-strong-deception
☆ A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
☆ Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction NAACL2024
Predicting unseen relations that cannot be observed during the training phase is a challenging task in relation extraction. Previous works have made progress by matching the semantics between input instances and label descriptions. However, fine-grained matching often requires laborious manual annotation, and rich interactions between instances and label descriptions come with significant computational overhead. In this work, we propose an efficient multi-grained matching approach that uses virtual entity matching to reduce manual annotation cost, and fuses coarse-grained recall and fine-grained classification for rich interactions with guaranteed inference speed. Experimental results show that our approach outperforms the previous State Of The Art (SOTA) methods, and achieves a balance between inference efficiency and prediction accuracy in zero-shot relation extraction tasks. Our code is available at https://github.com/longls777/EMMA.
comment: Accepted to the main conference of NAACL2024
☆ DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
☆ Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability
This paper presents an analysis of open-source large language models (LLMs) and their application in Retrieval-Augmented Generation (RAG) tasks, specific for enterprise-specific data sets scraped from their websites. With the increasing reliance on LLMs in natural language processing, it is crucial to evaluate their performance, accessibility, and integration within specific organizational contexts. This study examines various open-source LLMs, explores their integration into RAG frameworks using enterprise-specific data, and assesses the performance of different open-source embeddings in enhancing the retrieval and generation process. Our findings indicate that open-source LLMs, combined with effective embedding techniques, can significantly improve the accuracy and efficiency of RAG systems, offering a viable alternative to proprietary solutions for enterprises.
☆ Dredge Word, Social Media, and Webgraph Networks for Unreliable Website Classification and Identification
In an attempt to mimic the complex paths through which unreliable content spreads between search engines and social media, we explore the impact of incorporating both webgraph and large-scale social media contexts into website credibility classification and discovery systems. We further explore the usage of what we define as \textit{dredge words} on social media -- terms or phrases for which unreliable domains rank highly. Through comprehensive graph neural network ablations, we demonstrate that curriculum-based heterogeneous graph models that leverage context from both webgraphs and social media data outperform homogeneous and single-mode approaches. We further demonstrate that the incorporation of dredge words into our model strongly associates unreliable websites with social media and online commerce platforms. Finally, we show our heterogeneous model greatly outperforms competing systems in the top-k identification of unlabeled unreliable websites. We demonstrate the strong unreliability signals present in the diverse paths that users follow to uncover unreliable content, and we release a novel dataset of dredge words.
☆ BAMBINO-LM: (Bilingual-)Human-Inspired Continual Pretraining of BabyLM
Children from bilingual backgrounds benefit from interactions with parents and teachers to re-acquire their heritage language. In this paper, we investigate how this insight from behavioral study can be incorporated into the learning of small-scale language models. We introduce BAMBINO-LM, a continual pretraining strategy for BabyLM that uses a novel combination of alternation and PPO-based perplexity reward induced from a parent Italian model. Upon evaluation on zero-shot classification tasks for English and Italian, BAMBINO-LM improves the Italian language capability of a BabyLM baseline. Our ablation analysis demonstrates that employing both the alternation strategy and PPO-based modeling is key to this effectiveness gain. We also show that, as a side effect, the proposed method leads to similar degradation in L1 effectiveness as human children would have had in an equivalent learning scenario.
comment: Short paper; Under review
☆ HARE: HumAn pRiors, a key to small language model Efficiency
Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
☆ CodeGemma: Open Code Models Based on Gemma
This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.
☆ Multimodal Structured Generation: CVPR's 2nd MMFM Challenge Technical Report
Multimodal Foundation Models (MMFMs) have shown remarkable performance on various computer vision and natural language processing tasks. However, their performance on particular tasks such as document understanding is still limited. They also require more compute, time, and engineering resources to finetune and deploy compared to traditional, unimodal models. In this report, we present Multimodal Structured Generation, a general framework which constrains the output logits of frozen MMFMs to force them to reason before responding with structured outputs that downstream APIs can parse and use. We provide a detailed account of our approach, including the technical details, theoretical discussions, and final evaluation results in the 2nd Multimodal Foundation Models Challenge hosted by the Computer Vision and Pattern Recognition (CVPR) conference. Our approach achieved the second highest score in the hidden test set for Phase 2 and third highest overall. This shows the method's ability to generalize to unseen tasks. And that simple engineering can beat expensive & complicated modelling steps as we first discussed in our paper, Retrieval Augmented Structured Generation: Business Document Information Extraction as Tool Use. All of our scripts, deployment steps, and evaluation results can be accessed in https://github.com/leloykun/MMFM-Challenge
comment: Conference on Computer Vision and Pattern Recognition's 2nd Multimodal Foundation Models Challenge
☆ Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.
☆ Large Language Models and Knowledge Graphs for Astronomical Entity Disambiguation
This paper presents an experiment conducted during a hackathon, focusing on using large language models (LLMs) and knowledge graph clustering to extract entities and relationships from astronomical text. The study demonstrates an approach to disambiguate entities that can appear in various contexts within the astronomical domain. By collecting excerpts around specific entities and leveraging the GPT-4 language model, relevant entities and relationships are extracted. The extracted information is then used to construct a knowledge graph, which is clustered using the Leiden algorithm. The resulting Leiden communities are utilized to identify the percentage of association of unknown excerpts to each community, thereby enabling disambiguation. The experiment showcases the potential of combining LLMs and knowledge graph clustering techniques for information extraction in astronomical research. The results highlight the effectiveness of the approach in identifying and disambiguating entities, as well as grouping them into meaningful clusters based on their relationships.
☆ MetaGPT: Merging Large Language Models Using Model Exclusive Task Arithmetic
The advent of large language models (LLMs) like GPT-4 has catalyzed the exploration of multi-task learning (MTL), in which a single model demonstrates proficiency across diverse tasks. Task arithmetic has emerged as a cost-effective approach for MTL. It enables performance enhancement across multiple tasks by adding their corresponding task vectors to a pre-trained model. However, the current lack of a method that can simultaneously achieve optimal performance, computational efficiency, and data privacy limits their application to LLMs. In this paper, we propose \textbf{M}odel \textbf{E}xclusive \textbf{T}ask \textbf{A}rithmetic for merging \textbf{GPT}-scale models, which formalizes the objective of model merging into a multi-task learning framework, aiming to minimize the average loss difference between the merged model and each individual task model. Since data privacy limits the use of multi-task training data, we leverage LLMs' local linearity and task vectors' orthogonality to separate the data term and scaling coefficients term and derive a model-exclusive task arithmetic method. Our proposed MetaGPT is data-agnostic and bypasses the heavy search process, making it cost-effective and easy to implement for LLMs.Extensive experiments demonstrate that MetaGPT leads to improvements in task arithmetic and achieves state-of-the-art performance on multiple tasks.
comment: 17 pages
☆ A Realistic Evaluation of LLMs for Quotation Attribution in Literary Texts: A Case Study of LLaMa3
Large Language Models (LLMs) zero-shot and few-shot performance are subject to memorization and data contamination, complicating the assessment of their validity. In literary tasks, the performance of LLMs is often correlated to the degree of book memorization. In this work, we carry out a realistic evaluation of LLMs for quotation attribution in novels, taking the instruction fined-tuned version of Llama3 as an example. We design a task-specific memorization measure and use it to show that Llama3's ability to perform quotation attribution is positively correlated to the novel degree of memorization. However, Llama3 still performs impressively well on books it has not memorized nor seen. Data and code will be made publicly available.
comment: Paper under review
☆ Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
☆ Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.
comment: 5 pages, 3 figures, 1 table (12 pages, 4 figures, 6 tables including references and appendices)
☆ Improving Quotation Attribution with Fictional Character Embeddings
Humans naturally attribute utterances of direct speech to their speaker in literary works. When attributing quotes, we process contextual information but also access mental representations of characters that we build and revise throughout the narrative. Recent methods to automatically attribute such utterances have explored simulating human logic with deterministic rules or learning new implicit rules with neural networks when processing contextual information. However, these systems inherently lack \textit{character} representations, which often leads to errors on more challenging examples of attribution: anaphoric and implicit quotes. In this work, we propose to augment a popular quotation attribution system, BookNLP, with character embeddings that encode global information of characters. To build these embeddings, we create DramaCV, a corpus of English drama plays from the 15th to 20th century focused on Character Verification (CV), a task similar to Authorship Verification (AV), that aims at analyzing fictional characters. We train a model similar to the recently proposed AV model, Universal Authorship Representation (UAR), on this dataset, showing that it outperforms concurrent methods of characters embeddings on the CV task and generalizes better to literary novels. Then, through an extensive evaluation on 22 novels, we show that combining BookNLP's contextual information with our proposed global character embeddings improves the identification of speakers for anaphoric and implicit quotes, reaching state-of-the-art performance. Code and data will be made publicly available.
comment: Paper under review
☆ $\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities
Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. To address this, Retrieval-Augmented Generation (RAG) incorporates external document chunks to expand LLM knowledge. Furthermore, compressing information from document chunks through extraction or summarization can improve LLM performance. Nonetheless, LLMs still struggle to notice and utilize scattered key information, a problem known as the "lost-in-the-middle" syndrome. Therefore, we typically need to restructure the content for LLM to recognize the key information. We propose $\textit{Refiner}$, an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG. $\textit{Refiner}$ leverages a single decoder-only LLM to adaptively extract query-relevant contents verbatim along with the necessary context, and section them based on their interconnectedness, thereby highlights information distinction, and aligns downstream LLMs with the original context effectively. Experiments show that a trained $\textit{Refiner}$ (with 7B parameters) exhibits significant gain to downstream LLM in improving answer accuracy, and outperforms other state-of-the-art advanced RAG and concurrent compressing approaches in various single-hop and multi-hop QA tasks. Notably, $\textit{Refiner}$ achieves a 80.5% tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared to the next best solution. $\textit{Refiner}$ is a plug-and-play solution that can be seamlessly integrated with RAG systems, facilitating its application across diverse open-source frameworks.
comment: 8 pages
☆ Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach
Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
☆ $\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts ICML2024
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, the reliability assessment of MoE lags behind its surging applications. Moreover, when transferred to new domains such as in fine-tuning MoE models sometimes underperform their dense counterparts. Motivated by the research gap and counter-intuitive phenomenon, we propose $\texttt{MoE-RBench}$, the first comprehensive assessment of SMoE reliability from three aspects: $\textit{(i)}$ safety and hallucination, $\textit{(ii)}$ resilience to adversarial attacks, and $\textit{(iii)}$ out-of-distribution robustness. Extensive models and datasets are tested to compare the MoE to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the MoE model more reliably than the dense LLM. In particular, we find that the robustness of SMoE is sensitive to the basic training settings. We hope that this study can provide deeper insights into how to adapt the pre-trained MoE model to other tasks with higher-generation security, quality, and stability. Codes are available at https://github.com/UNITES-Lab/MoE-RBench
comment: 9 pages, 8 figures, camera ready on ICML2024
☆ Full-ECE: A Metric For Token-level Calibration on Large Language Models
Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
☆ A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences
The reasoning abilities of Large Language Models (LLMs) are becoming a central focus of study in NLP. In this paper, we consider the case of syllogistic reasoning, an area of deductive reasoning studied extensively in logic and cognitive psychology. Previous research has shown that pre-trained LLMs exhibit reasoning biases, such as $\textit{content effects}$, avoid answering that $\textit{no conclusion follows}$, display human-like difficulties, and struggle with multi-step reasoning. We contribute to this research line by systematically investigating the effects of chain-of-thought reasoning, in-context learning (ICL), and supervised fine-tuning (SFT) on syllogistic reasoning, considering syllogisms with conclusions that support or violate world knowledge, as well as ones with multiple premises. Crucially, we go beyond the standard focus on accuracy, with an in-depth analysis of the conclusions generated by the models. Our results suggest that the behavior of pre-trained LLMs can be explained by heuristics studied in cognitive science and that both ICL and SFT improve model performance on valid inferences, although only the latter mitigates most reasoning biases without harming model consistency.
☆ Fine-grained Controllable Text Generation through In-context Learning with Feedback
We present a method for rewriting an input sentence to match specific values of nontrivial linguistic features, such as dependency depth. In contrast to earlier work, our method uses in-context learning rather than finetuning, making it applicable in use cases where data is sparse. We show that our model performs accurate rewrites and matches the state of the art on rewriting sentences to a specified school grade level.
☆ Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams
Recent advancements in Large Language Models (LLMs) have demonstrated their potential in delivering accurate answers to questions about world knowledge. Despite this, existing benchmarks for evaluating LLMs in healthcare predominantly focus on medical doctors, leaving other critical healthcare professions underrepresented. To fill this research gap, we introduce the Examinations for Medical Personnel in Chinese (EMPEC), a pioneering large-scale healthcare knowledge benchmark in traditional Chinese. EMPEC consists of 157,803 exam questions across 124 subjects and 20 healthcare professions, including underrepresented occupations like Optometrists and Audiologists. Each question is tagged with its release time and source, ensuring relevance and authenticity. We conducted extensive experiments on 17 LLMs, including proprietary, open-source models, general domain models and medical specific models, evaluating their performance under various settings. Our findings reveal that while leading models like GPT-4 achieve over 75\% accuracy, they still struggle with specialized fields and alternative medicine. Surprisingly, general-purpose LLMs outperformed medical-specific models, and incorporating EMPEC's training data significantly enhanced performance. Additionally, the results on questions released after the models' training cutoff date were consistent with overall performance trends, suggesting that the models' performance on the test set can predict their effectiveness in addressing unseen healthcare-related queries. The transition from traditional to simplified Chinese characters had a negligible impact on model performance, indicating robust linguistic versatility. Our study underscores the importance of expanding benchmarks to cover a broader range of healthcare professions to better assess the applicability of LLMs in real-world healthcare scenarios.
comment: 15 pages, 4 figures
☆ GUICourse: From General Vision Language Models to Versatile GUI Agents
Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.
☆ An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers
The increasing size of transformer-based models in NLP makes the question of compressing them important. In this work, we present a comprehensive analysis of factorization based model compression techniques. Specifically, we focus on comparing straightforward low-rank factorization against the recently introduced Monarch factorization, which exhibits impressive performance preservation on the GLUE benchmark. To mitigate stability issues associated with low-rank factorization of the matrices in pre-trained transformers, we introduce a staged factorization approach wherein layers are factorized one by one instead of being factorized simultaneously. Through this strategy we significantly enhance the stability and reliability of the compression process. Further, we introduce a simple block-wise low-rank factorization method, which has a close relationship to Monarch factorization. Our experiments lead to the surprising conclusion that straightforward low-rank factorization consistently outperforms Monarch factorization across both different compression ratios and six different text classification tasks.
☆ VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.
comment: 38 pages, 44 figures
☆ Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs
The effective alignment of Large Language Models (LLMs) with precise instructions is essential for their application in diverse real-world scenarios. Current methods focus on enhancing the diversity and complexity of training and evaluation samples, yet they fall short in accurately assessing LLMs' ability to follow similar instruction variants. We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants, thereby preserves the original instruction's context and complexity while introducing variability, which is critical for training and evaluating LLMs' instruction-following precision. We developed the DeMoRecon dataset using this method to both fine-tune and evaluate LLMs. Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.
comment: 22 pages
☆ A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models
Text summarization research has undergone several significant transformations with the advent of deep neural networks, pre-trained language models (PLMs), and recent large language models (LLMs). This survey thus provides a comprehensive review of the research progress and evolution in text summarization through the lens of these paradigm shifts. It is organized into two main parts: (1) a detailed overview of datasets, evaluation metrics, and summarization methods before the LLM era, encompassing traditional statistical methods, deep learning approaches, and PLM fine-tuning techniques, and (2) the first detailed examination of recent advancements in benchmarking, modeling, and evaluating summarization in the LLM era. By synthesizing existing literature and presenting a cohesive overview, this survey also discusses research trends, open challenges, and proposes promising research directions in summarization, aiming to guide researchers through the evolving landscape of summarization research.
comment: 30 pages, 8 figures, 6 tables
☆ MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from actual facts due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three tasks: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked 12 diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy artificial intelligence potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.
comment: 22 pages, 8 figures
☆ Self and Cross-Model Distillation for LLMs: Effective Methods for Refusal Pattern Alignment
Large Language Models (LLMs) like OpenAI's GPT series, Anthropic's Claude, and Meta's LLaMa have shown remarkable capabilities in text generation. However, their susceptibility to toxic prompts presents significant security challenges. This paper investigates alignment techniques, including Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), to mitigate these risks. We conduct an empirical study on refusal patterns across nine LLMs, revealing that models with uniform refusal patterns, such as Claude3, exhibit higher security. Based on these findings, we propose self-distilling and cross-model distilling methods to enhance LLM security. Our results show that these methods significantly improve refusal rates and reduce unsafe content, with cross-model distilling achieving refusal rates close to Claude3's 94.51%. These findings underscore the potential of distillation-based alignment in securing LLMs against toxic prompts.
♻ ☆ RE-GAINS & EnCHANT: Intelligent Tool Manipulation Systems For Enhanced Query Responses
Despite the remarkable success of LLMs, they still suffer from tool invocation and tool chaining due to inadequate input queries and/or tool argument descriptions. We propose two novel frameworks, RE-GAINS and EnCHANT, enabling LLMs to tackle tool manipulation for solving complex user queries by making API calls. EnCHANT is an open-source solution that makes use of an LLM format enforcer, an LLM(OpenChat 3.5) and a retriever(ToolBench's API Retriever). RE-GAINS is based on OpenAI models and embeddings using a special prompt based on the RAP paper. Both solutions cost less than $0.01 per query with minimal latency, therefore showcasing the usefulness of the frameworks.
♻ ☆ ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer ACL 2024
While state-of-the-art large language models (LLMs) can excel at adapting text from one style to another, current work does not address the explainability of style transfer models. Recent work has explored generating textual explanations from larger teacher models and distilling them into smaller student models. One challenge with such approach is that LLM outputs may contain errors that require expertise to correct, but gathering and incorporating expert feedback is difficult due to cost and availability. To address this challenge, we propose ICLEF, a novel human-AI collaboration approach to model distillation that incorporates scarce expert human feedback by combining in-context learning and model self-critique. We show that our method leads to generation of high-quality synthetic explainable style transfer datasets for formality (e-GYAFC) and subjective bias (e-WNC). Via automatic and human evaluation, we show that specialized student models fine-tuned on our datasets outperform generalist teacher models on the explainable style transfer task in one-shot settings, and perform competitively compared to few-shot teacher models, highlighting the quality of the data and the role of expert feedback. In an extrinsic task of authorship attribution, we show that explanations generated by smaller models fine-tuned on e-GYAFC are more predictive of authorship than explanations generated by few-shot teacher models.
comment: ACL 2024 Main, Camera Ready
♻ ☆ Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
♻ ☆ InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?
Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, $\beta$-weighted $\textit{Legal Safety Score ($LSS_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs' safety by considering its performance in the $\textit{Binary Statutory Reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA--2 models indicate that the proposed $LSS_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA--2 models increase the $LSS_{\beta}$, improving their usability in the Indian legal domain. Our code is publicly released.
♻ ☆ Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions
Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.
comment: 56 pages
♻ ☆ MLLM-Protector: Ensuring MLLM's Safety without Hurting Performance
The deployment of multimodal large language models (MLLMs) has brought forth a unique vulnerability: susceptibility to malicious attacks through visual inputs. This paper investigates the novel challenge of defending MLLMs against such attacks. Compared to large language models (LLMs), MLLMs include an additional image modality. We discover that images act as a ``foreign language" that is not considered during safety alignment, making MLLMs more prone to producing harmful responses. Unfortunately, unlike the discrete tokens considered in text-based LLMs, the continuous nature of image signals presents significant alignment challenges, which poses difficulty to thoroughly cover all possible scenarios. This vulnerability is exacerbated by the fact that most state-of-the-art MLLMs are fine-tuned on limited image-text pairs that are much fewer than the extensive text-based pretraining corpus, which makes the MLLMs more prone to catastrophic forgetting of their original abilities during safety fine-tuning. To tackle these challenges, we introduce MLLM-Protector, a plug-and-play strategy that solves two subtasks: 1) identifying harmful responses via a lightweight harm detector, and 2) transforming harmful responses into harmless ones via a detoxifier. This approach effectively mitigates the risks posed by malicious visual inputs without compromising the original performance of MLLMs. Our results demonstrate that MLLM-Protector offers a robust solution to a previously unaddressed aspect of MLLM security.
♻ ☆ TaxoLLaMA: WordNet-based Model for Solving Multiple Lexical Semantic Tasks ACL
In this paper, we explore the capabilities of LLMs in capturing lexical-semantic knowledge from WordNet on the example of the LLaMA-2-7b model and test it on multiple lexical semantic tasks. As the outcome of our experiments, we present TaxoLLaMA, the everything-in-one model, lightweight due to 4-bit quantization and LoRA. It achieves 11 SotA results, 4 top-2 results out of 16 tasks for the Taxonomy Enrichment, Hypernym Discovery, Taxonomy Construction, and Lexical Entailment tasks. Moreover, it demonstrates very strong zero-shot performance on Lexical Entailment and Taxonomy Construction with no fine-tuning. We also explore its hidden multilingual and domain adaptation capabilities with a little tuning or few-shot learning. All datasets, code, and model are available online at https://github.com/VityaVitalich/TaxoLLaMA
comment: ACL Main 2024, 18 pages, 8 figures
♻ ☆ VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
In this paper, we present the VideoLLaMA 2, a set of Video Large Language Models (Video-LLMs) designed to enhance spatial-temporal modeling and audio understanding in video and audio-oriented tasks. Building upon its predecessor, VideoLLaMA 2 incorporates a tailor-made Spatial-Temporal Convolution (STC) connector, which effectively captures the intricate spatial and temporal dynamics of video data. Additionally, we integrate an Audio Branch into the model through joint training, thereby enriching the multimodal understanding capabilities of the model by seamlessly incorporating audio cues. Comprehensive evaluations on multiple-choice video question answering (MC-VQA), open-ended video question answering (OE-VQA), and video captioning (VC) tasks demonstrate that VideoLLaMA 2 consistently achieves competitive results among open-source models and even gets close to some proprietary models on several benchmarks. Furthermore, VideoLLaMA 2 exhibits reasonable improvements in audio-only and audio-video question-answering (AQA & OE-AVQA) benchmarks over existing models. These advancements underline VideoLLaMA 2's superior performance in multimodal comprehension, setting a new standard for intelligent video analysis systems. All models are public to facilitate further research.
comment: ZC, SL, HZ, YX, and XL contributed equally to this project
♻ ☆ AbsInstruct: Eliciting Abstraction Ability from LLMs through Explanation Tuning with Plausibility Estimation ACL 2024
Abstraction ability is crucial in human intelligence, which can also benefit various tasks in NLP study. Existing work shows that LLMs are deficient in abstract ability, and how to improve it remains unexplored. In this work, we design the framework AbsInstruct to enhance LLMs' abstraction ability through instruction tuning. The framework builds instructions with in-depth explanations to assist LLMs in capturing the underlying rationale of abstraction. Meanwhile, we introduce a plausibility estimator to select instructions that are more consistent with the abstraction knowledge of LLMs to be aligned. Then, our framework combines abstraction instructions with general-purpose ones to build a hybrid dataset. Extensive experiments and analyses demonstrate that our framework can considerably enhance LLMs' abstraction ability with strong generalization performance while maintaining their general instruction-following abilities.
comment: Accepted by ACL 2024
♻ ☆ Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale ACL 2024
A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.
comment: accepted by ACL 2024
♻ ☆ Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
comment: Fix title typo, update main figure to render properly on non-chrome browsers
♻ ☆ On the Convergence of Zeroth-Order Federated Tuning for Large Language Models KDD'24
The confluence of Federated Learning (FL) and Large Language Models (LLMs) is ushering in a new era in privacy-preserving natural language processing. However, the intensive memory requirements for fine-tuning LLMs pose significant challenges, especially when deploying on clients with limited computational resources. To circumvent this, we explore the novel integration of Memory-efficient Zeroth-Order Optimization within a federated setting, a synergy we term as FedMeZO. Our study is the first to examine the theoretical underpinnings of FedMeZO in the context of LLMs, tackling key questions regarding the influence of large parameter spaces on optimization behavior, the establishment of convergence properties, and the identification of critical parameters for convergence to inform personalized federated strategies. Our extensive empirical evidence supports the theory, showing that FedMeZO not only converges faster than traditional first-order methods such as FedAvg but also significantly reduces GPU memory usage during training to levels comparable to those during inference. Moreover, the proposed personalized FL strategy that is built upon the theoretical insights to customize the client-wise learning rate can effectively accelerate loss reduction. We hope our work can help to bridge theoretical and practical aspects of federated fine-tuning for LLMs, thereby stimulating further advancements and research in this area.
comment: accepted by KDD'24 research track. 21 pages, 10 figures, 8 tables
♻ ☆ Evaluating Task-based Effectiveness of MLLMs on Charts
In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.
comment: The experimental part needs to be revised. Withdraw this version
♻ ☆ Reference-based Metrics Disprove Themselves in Question Generation
Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
comment: Revised Jun 14 2024; Under Review
♻ ☆ Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models
Self-correction has achieved impressive results in enhancing the style and security of the generated output from large language models (LLMs). However, recent studies suggest that self-correction might be limited or even counterproductive in reasoning tasks due to LLMs' difficulties in identifying logical mistakes. In this paper, we aim to enhance the self-checking capabilities of LLMs by constructing training data for checking tasks. Specifically, we apply the Chain of Thought (CoT) methodology to self-checking tasks, utilizing fine-grained step-level analyses and explanations to assess the correctness of reasoning paths. We propose a specialized checking format called "Step CoT Check". Following this format, we construct a checking-correction dataset that includes detailed step-by-step analysis and checking. Then we fine-tune LLMs to enhance their error detection and correction abilities. Our experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs across multiple benchmarks. This approach outperforms other formats, especially in locating the incorrect position, with greater benefits observed in larger models. For reproducibility, all the datasets and code are provided in https://github.com/bammt/Learn-to-check.
♻ ☆ Multi-stage Large Language Model Correction for Speech Recognition
In this paper, we investigate the usage of large language models (LLMs) to improve the performance of competitive speech recognition systems. Different from previous LLM-based ASR error correction methods, we propose a novel multi-stage approach that utilizes uncertainty estimation of ASR outputs and reasoning capability of LLMs. Specifically, the proposed approach has two stages: the first stage is about ASR uncertainty estimation and exploits N-best list hypotheses to identify less reliable transcriptions; The second stage works on these identified transcriptions and performs LLM-based corrections. This correction task is formulated as a multi-step rule-based LLM reasoning process, which uses explicitly written rules in prompts to decompose the task into concrete reasoning steps. Our experimental results demonstrate the effectiveness of the proposed method by showing 10% ~ 20% relative improvement in WER over competitive ASR systems -- across multiple test domains and in zero-shot settings.
♻ ☆ DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models
Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.
♻ ☆ Bileve: Securing Text Provenance in Large Language Models Against Spoofing with Bi-level Signature
Text watermarks for large language models (LLMs) have been commonly used to identify the origins of machine-generated content, which is promising for assessing liability when combating deepfake or harmful content. While existing watermarking techniques typically prioritize robustness against removal attacks, unfortunately, they are vulnerable to spoofing attacks: malicious actors can subtly alter the meanings of LLM-generated responses or even forge harmful content, potentially misattributing blame to the LLM developer. To overcome this, we introduce a bi-level signature scheme, Bileve, which embeds fine-grained signature bits for integrity checks (mitigating spoofing attacks) as well as a coarse-grained signal to trace text sources when the signature is invalid (enhancing detectability) via a novel rank-based sampling strategy. Compared to conventional watermark detectors that only output binary results, Bileve can differentiate 5 scenarios during detection, reliably tracing text provenance and regulating LLMs. The experiments conducted on OPT-1.3B and LLaMA-7B demonstrate the effectiveness of Bileve in defeating spoofing attacks with enhanced detectability.
♻ ☆ Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? ICML 2024
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.
comment: Accepted at ICML 2024
♻ ☆ Retaining Key Information under High Compression Ratios: Query-Guided Compressor for LLMs ACL 2024
The growing popularity of Large Language Models has sparked interest in context compression for Large Language Models (LLMs). However, the performance of previous methods degrades dramatically as compression ratios increase, sometimes even falling to the closed-book level. This decline can be attributed to the loss of key information during the compression process. Our preliminary study supports this hypothesis, emphasizing the significance of retaining key information to maintain model performance under high compression ratios. As a result, we introduce Query-Guided Compressor (QGC), which leverages queries to guide the context compression process, effectively preserving key information within the compressed context. Additionally, we employ a dynamic compression strategy. We validate the effectiveness of our proposed QGC on the Question Answering task, including NaturalQuestions, TriviaQA, and HotpotQA datasets. Experimental results show that QGC can consistently perform well even at high compression ratios, which also offers significant benefits in terms of inference cost and throughput.
comment: Accepted to ACL 2024
♻ ☆ A Thorough Examination of Decoding Methods in the Era of LLMs
Decoding methods play an indispensable role in converting language models from next-token predictors into practical task solvers. Prior research on decoding methods, primarily focusing on task-specific models, may not extend to the current era of general-purpose large language models (LLMs). Moreover, the recent influx of decoding strategies has further complicated this landscape. This paper provides a comprehensive and multifaceted analysis of various decoding methods within the context of LLMs, evaluating their performance, robustness to hyperparameter changes, and decoding speeds across a wide range of tasks, models, and deployment environments. Our findings reveal that decoding method performance is notably task-dependent and influenced by factors such as alignment, model size, and quantization. Intriguingly, sensitivity analysis exposes that certain methods achieve superior performance at the cost of extensive hyperparameter tuning, highlighting the trade-off between attaining optimal results and the practicality of implementation in varying contexts.
♻ ☆ MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis
Recently, numerous embedding models have been made available and widely used for various NLP tasks. The Massive Text Embedding Benchmark (MTEB) has primarily simplified the process of choosing a model that performs well for several tasks in English, but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. We gather 15 existing datasets in an easy-to-use interface and create three new French datasets for a global evaluation of 8 task categories. We compare 51 carefully selected embedding models on a large scale, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well. Our work comes with open-source code, new datasets and a public leaderboard.
♻ ☆ NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
The rapid advancement of Large Language Models (LLMs) has introduced a new frontier in natural language processing, particularly in understanding and processing long-context information. However, the evaluation of these models' long-context abilities remains a challenge due to the limitations of current benchmarks. To address this gap, we introduce NovelQA, a benchmark specifically designed to test the capabilities of LLMs with extended texts. Constructed from English novels, NovelQA offers a unique blend of complexity, length, and narrative coherence, making it an ideal tool for assessing deep textual understanding in LLMs. This paper presents the design and construction of NovelQA, highlighting its manual annotation, and diverse question types. Our evaluation of Long-context LLMs on NovelQA reveals significant insights into the models' performance, particularly emphasizing the challenges they face with multi-hop reasoning, detail-oriented questions, and extremely long input with an average length more than 200,000 tokens. The results underscore the necessity for further advancements in LLMs to improve their long-context comprehension.
♻ ☆ KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection SemEval-2024
SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. Such a detection is important for preventing a potential misuse of large language models (LLMs), the newest of which are very capable in generating multilingual human-like texts. We have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller LLMs for text classification. We have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. Our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.
comment: SemEval-2024 Task 8
♻ ☆ Attention Meets Post-hoc Interpretability: A Mathematical Perspective ICML 2024
Attention-based architectures, in particular transformers, are at the heart of a technological revolution. Interestingly, in addition to helping obtain state-of-the-art results on a wide range of applications, the attention mechanism intrinsically provides meaningful insights on the internal behavior of the model. Can these insights be used as explanations? Debate rages on. In this paper, we mathematically study a simple attention-based architecture and pinpoint the differences between post-hoc and attention-based explanations. We show that they provide quite different results, and that, despite their limitations, post-hoc methods are capable of capturing more useful insights than merely examining the attention weights.
comment: Accepted at ICML 2024
♻ ☆ Syntax-Aware Complex-Valued Neural Machine Translation
Syntax has been proven to be remarkably effective in neural machine translation (NMT). Previous models obtained syntax information from syntactic parsing tools and integrated it into NMT models to improve translation performance. In this work, we propose a method to incorporate syntax information into a complex-valued Encoder-Decoder architecture. The proposed model jointly learns word-level and syntax-level attention scores from the source side to the target side using an attention mechanism. Importantly, it is not dependent on specific network architectures and can be directly integrated into any existing sequence-to-sequence (Seq2Seq) framework. The experimental results demonstrate that the proposed method can bring significant improvements in BLEU scores on two datasets. In particular, the proposed method achieves a greater improvement in BLEU scores in translation tasks involving language pairs with significant syntactic differences.
comment: 12 pages, 5 figures
♻ ☆ Pretraining Vision-Language Model for Difference Visual Question Answering in Longitudinal Chest X-rays
Difference visual question answering (diff-VQA) is a challenging task that requires answering complex questions based on differences between a pair of images. This task is particularly important in reading chest X-ray images because radiologists often compare multiple images of the same patient taken at different times to track disease progression and changes in its severity in their clinical practice. However, previous works focused on designing specific network architectures for the diff-VQA task, missing opportunities to enhance the model's performance using a pretrained vision-language model (VLM). Here, we introduce a novel VLM called PLURAL, which is pretrained on natural and longitudinal chest X-ray data for the diff-VQA task. The model is developed using a step-by-step approach, starting with being pretrained on natural images and texts, followed by being trained using longitudinal chest X-ray data. The longitudinal data consist of pairs of X-ray images, along with question-answer sets and radiologist's reports that describe the changes in lung abnormalities and diseases over time. Our experimental results show that the PLURAL model outperforms state-of-the-art methods not only in diff-VQA for longitudinal X-rays but also in conventional VQA for a single X-ray image. Through extensive experiments, we demonstrate the effectiveness of the proposed VLM architecture and pretraining method in improving the model's performance.
♻ ☆ Adapters Mixup: Mixing Parameter-Efficient Adapters to Enhance the Adversarial Robustness of Fine-tuned Pre-trained Text Classifiers
Existing works show that augmenting the training data of pre-trained language models (PLMs) for classification tasks fine-tuned via parameter-efficient fine-tuning methods (PEFT) using both clean and adversarial examples can enhance their robustness under adversarial attacks. However, this adversarial training paradigm often leads to performance degradation on clean inputs and requires frequent re-training on the entire data to account for new, unknown attacks. To overcome these challenges while still harnessing the benefits of adversarial training and the efficiency of PEFT, this work proposes a novel approach, called AdpMixup, that combines two paradigms: (1) fine-tuning through adapters and (2) adversarial augmentation via mixup to dynamically leverage existing knowledge from a set of pre-known attacks for robust inference. Intuitively, AdpMixup fine-tunes PLMs with multiple adapters with both clean and pre-known adversarial examples and intelligently mixes them up in different ratios during prediction. Our experiments show AdpMixup achieves the best trade-off between training efficiency and robustness under both pre-known and unknown attacks, compared to existing baselines on five downstream tasks across six varied black-box attacks and 2 PLMs. All source code will be available.
♻ ☆ PLAYER*: Enhancing LLM-based Multi-Agent Communication and Interaction in Murder Mystery Games
We propose PLAYER*, a novel framework that addresses the limitations of existing agent-based approaches built on Large Language Models (LLMs) in handling complex questions and understanding interpersonal relationships in dynamic environments. PLAYER* enhances path planning in Murder Mystery Games (MMGs) using an anytime sampling-based planner and a questioning-driven search framework. By equipping agents with a set of sensors, PLAYER* eliminates the need for pre-defined questions and enables agents to navigate complex social interactions. We additionally make a contribution by introducing a quantifiable evaluation method using multiple-choice questions and present WellPlay, a dataset containing 1,482 question-answer pairs. Experimental results demonstrate PLAYER*'s superiority over existing multi-agent methods, enhancing the generalisability and adaptability of agents in MMGs and paving the way for more effective multi-agent interactions.
♻ ☆ SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments. Furthermore, we propose a novel prompt mimicking how human interpret charts for more accurate reasoning. Our source code is available at https://github.com/sangwu99/Simplot.
♻ ☆ On the Limitations of Fine-tuned Judge Models for LLM Evaluation
Recently, there has been a growing trend of utilizing Large Language Model (LLM) to evaluate the quality of other LLMs. Many studies have employed proprietary close-source models, especially GPT-4, as the evaluator. Alternatively, other works have fine-tuned judge models based on open-source LLMs as the evaluator. While the fine-tuned judge models are claimed to achieve comparable evaluation capability with GPT-4, in this study, we conduct an empirical study of judge models. Our findings indicate that although the fine-tuned judge models achieve high performance on in-domain test sets, even surpassing GPT-4, they underperform GPT-4 across several dimensions, including generalizability, fairness, aspect-specific evaluation, and scalability. We also reveal that the fine-tuned judge model inherently operates as a task-specific classifier, consequently imposing the limitations. Finally, we propose an effective indicator to measure the reliability of fine-tuned judges, with the aim of maximizing their utility in LLM evaluation.
♻ ☆ An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation (CVA) methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of five CVA methods on four generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that CVA substantially contributes to LLM inference speedups of up to 271.5\%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.
comment: 26 pages; 18 visualizations (nine figures, nine tables)
♻ ☆ Large Language Model Can Continue Evolving From Mistakes
As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for keeping Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution.
♻ ☆ Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.
♻ ☆ Generative Language Models Exhibit Social Identity Biases
The surge in popularity of large language models has given rise to concerns about biases that these models could learn from humans. We investigate whether ingroup solidarity and outgroup hostility, fundamental social identity biases known from social psychology, are present in 56 large language models. We find that almost all foundational language models and some instruction fine-tuned models exhibit clear ingroup-positive and outgroup-negative associations when prompted to complete sentences (e.g., "We are..."). Our findings suggest that modern language models exhibit fundamental social identity biases to a similar degree as humans, both in the lab and in real-world conversations with LLMs, and that curating training data and instruction fine-tuning can mitigate such biases. Our results have practical implications for creating less biased large-language models and further underscore the need for more research into user interactions with LLMs to prevent potential bias reinforcement in humans.
comment: supplementary material, data, and code see https://osf.io/9ht32/?view_only=f0ab4b23325f4c31ad3e12a7353b55f5
♻ ☆ Quantifying the Persona Effect in LLM Simulations ACL 2024
Large language models (LLMs) have shown remarkable promise in simulating human language and behavior. This study investigates how integrating persona variables-demographic, social, and behavioral factors-impacts LLMs' ability to simulate diverse perspectives. We find that persona variables account for <10% variance in annotations in existing subjective NLP datasets. Nonetheless, incorporating persona variables via prompting in LLMs provides modest but statistically significant improvements. Persona prompting is most effective in samples where many annotators disagree, but their disagreements are relatively minor. Notably, we find a linear relationship in our setting: the stronger the correlation between persona variables and human annotations, the more accurate the LLM predictions are using persona prompting. In a zero-shot setting, a powerful 70b model with persona prompting captures 81% of the annotation variance achievable by linear regression trained on ground truth annotations. However, for most subjective NLP datasets, where persona variables have limited explanatory power, the benefits of persona prompting are limited.
comment: ACL 2024 Main
♻ ☆ HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.
comment: under progress
♻ ☆ KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery
This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. An ontology can then be constructed where a user defines the types of entities and relationships they want to capture. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. Named Entity Recognition (NER) and Relation Classification (RC) models can be trained on the resulting annotations and can be used to annotate the unannotated portion of the documents. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data. Furthermore, we integrate a suite of Large Language Models (LLMs) that can be used for QA and summarisation that is grounded in the included documents via a retrieval component. KnowledgeHub is a unique tool that supports annotation, IE and QA, which gives the user full insight into the knowledge discovery pipeline.
♻ ☆ Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
♻ ☆ P-ICL: Point In-Context Learning for Named Entity Recognition with Large Language Models
In recent years, the rise of large language models (LLMs) has made it possible to directly achieve named entity recognition (NER) without any demonstration samples or only using a few samples through in-context learning (ICL). However, standard ICL only helps LLMs understand task instructions, format and input-label mapping, but neglects the particularity of the NER task itself. In this paper, we propose a new prompting framework P-ICL to better achieve NER with LLMs, in which some point entities are leveraged as the auxiliary information to recognize each entity type. With such significant information, the LLM can achieve entity classification more precisely. To obtain optimal point entities for prompting LLMs, we also proposed a point entity selection method based on K-Means clustering. Our extensive experiments on some representative NER benchmarks verify the effectiveness of our proposed strategies in P-ICL and point entity selection.
♻ ☆ SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM ECAI 2024
While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. We shall release the source code of the proposed architecture in the published version.
comment: 9 pages, 3 figures, submitted to ECAI 2024
♻ ☆ How Much Context Does My Attention-Based ASR System Need?
For the task of speech recognition, the use of more than 30 seconds of acoustic context during training is uncommon and under-investigated in literature. In this work, we conduct an empirical study on the effect of scaling the sequence length used to train/evaluate (dense-attention-based) acoustic models on speech recognition performance. For these experiments, a dataset of roughly 100,000 pseudo-labelled Spotify podcasts is used, with context lengths of 5 seconds to 1 hour being explored. Zero-shot evaluations are presented on the long-format datasets: Earnings-22, Tedlium and Rev16. Results demonstrate a benefit from training with up to 21.8 minutes of acoustic context, showing up to a 14.5\% relative improvement from a baseline trained with 10 seconds of context. We find that the model's width/depth, positional encoding scheme and number of attention heads impact its ability to use longer contexts.
comment: Accepted at Interspeech 2024
♻ ☆ Mitigating Bias for Question Answering Models by Tracking Bias Influence NAACL 2024
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.
comment: To appear at NAACL 2024 main conference
♻ ☆ COBIAS: Contextual Reliability in Bias Assessment
Large Language Models (LLMs) are trained on extensive web corpora, which enable them to understand and generate human-like text. However, this training process also results in inherent biases within the models. These biases arise from web data's diverse and often uncurated nature, containing various stereotypes and prejudices. Previous works on debiasing models rely on benchmark datasets to measure their method's performance. However, these datasets suffer from several pitfalls due to the highly subjective understanding of bias, highlighting a critical need for contextual exploration. We propose understanding the context of inputs by considering the diverse situations in which they may arise. Our contribution is two-fold: (i) we augment 2,291 stereotyped statements from two existing bias-benchmark datasets with points for adding context; (ii) we develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to assess a statement's contextual reliability in measuring bias. Our metric aligns with human judgment on contextual reliability of statements (Spearman's $\rho = 0.65, p = 3.4 * 10^{-60}$) and can be used to create reliable datasets, which would assist bias mitigation works.
♻ ☆ A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models KDD2024
As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-Generated Content (AIGC), the powerful capacity of retrieval in providing additional knowledge enables RAG to assist existing generative AI in producing high-quality outputs. Recently, Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in RA-LLMs, covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we systematically review mainstream relevant work by their architectures, training strategies, and application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research. Updated information about this survey can be found at https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
comment: This is the long version of the corresponding survey paper accepted by KDD2024
♻ ☆ Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations
Incorporating natural language rationales in the prompt and In-Context Learning (ICL) have led to a significant improvement of Large Language Models (LLMs) performance. However, generating high-quality rationales require human-annotation or the use of auxiliary proxy models. In this work, we propose Self-AMPLIFY to automatically generate rationales from post hoc explanation methods applied to Small Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a 3-step method that targets samples, generates rationales and builds a final prompt to leverage ICL. Self-AMPLIFY performance is evaluated on four SLMs and five datasets requiring strong reasoning abilities. Self-AMPLIFY achieves good results against competitors, leading to strong accuracy improvement. Self-AMPLIFY is the first method to apply post hoc explanation methods to autoregressive language models to generate rationales to improve their own performance in a fully automated manner.
♻ ☆ A Unified Framework and Dataset for Assessing Societal Bias in Vision-Language Models
Vision-language models (VLMs) have gained widespread adoption in both industry and academia. In this study, we propose a unified framework for systematically evaluating gender, race, and age biases in VLMs with respect to professions. Our evaluation encompasses all supported inference modes of the recent VLMs, including image-to-text, text-to-text, text-to-image, and image-to-image. Additionally, we propose an automated pipeline to generate high-quality synthetic datasets that intentionally conceal gender, race, and age information across different professional domains, both in generated text and images. The dataset includes action-based descriptions of each profession and serves as a benchmark for evaluating societal biases in vision-language models (VLMs). In our comparative analysis of widely used VLMs, we have identified that varying input-output modalities lead to discernible differences in bias magnitudes and directions. Additionally, we find that VLM models exhibit distinct biases across different bias attributes we investigated. We hope our work will help guide future progress in improving VLMs to learn socially unbiased representations. We will release our data and code.
♻ ☆ ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models
Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both "know what" and "know why". To this end, we present a comprehensive evaluation metric, ValueDCG (Value Discriminator-Critique Gap), to quantitatively assess the two aspects with an engineering implementation. We assess four representative LLMs and provide compelling evidence that the growth rates of LLM's "know what" and "know why" capabilities do not align with increases in parameter numbers, resulting in a decline in the models' capacity to understand human values as larger amounts of parameters. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.
♻ ☆ ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
comment: 22 pages
Machine Learning 150
☆ mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
☆ MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.
comment: This project is available at https://github.com/Liuziyu77/MMDU
☆ Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations COLT 2024
We study the computational and sample complexity of learning a target function $f_*:\mathbb{R}^d\to\mathbb{R}$ with additive structure, that is, $f_*(x) = \frac{1}{\sqrt{M}}\sum_{m=1}^M f_m(\langle x, v_m\rangle)$, where $f_1,f_2,...,f_M:\mathbb{R}\to\mathbb{R}$ are nonlinear link functions of single-index models (ridge functions) with diverse and near-orthogonal index features $\{v_m\}_{m=1}^M$, and the number of additive tasks $M$ grows with the dimensionality $M\asymp d^\gamma$ for $\gamma\ge 0$. This problem setting is motivated by the classical additive model literature, the recent representation learning theory of two-layer neural network, and large-scale pretraining where the model simultaneously acquires a large number of "skills" that are often localized in distinct parts of the trained network. We prove that a large subset of polynomial $f_*$ can be efficiently learned by gradient descent training of a two-layer neural network, with a polynomial statistical and computational complexity that depends on the number of tasks $M$ and the information exponent of $f_m$, despite the unknown link function and $M$ growing with the dimensionality. We complement this learnability guarantee with computational hardness result by establishing statistical query (SQ) lower bounds for both the correlational SQ and full SQ algorithms.
comment: COLT 2024
☆ WPO: Enhancing RLHF with Weighted Preference Optimization
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.
☆ Spectral Introspection Identifies Group Training Dynamics in Deep Neural Networks for Neuroimaging
Neural networks, whice have had a profound effect on how researchers study complex phenomena, do so through a complex, nonlinear mathematical structure which can be difficult for human researchers to interpret. This obstacle can be especially salient when researchers want to better understand the emergence of particular model behaviors such as bias, overfitting, overparametrization, and more. In Neuroimaging, the understanding of how such phenomena emerge is fundamental to preventing and informing users of the potential risks involved in practice. In this work, we present a novel introspection framework for Deep Learning on Neuroimaging data, which exploits the natural structure of gradient computations via the singular value decomposition of gradient components during reverse-mode auto-differentiation. Unlike post-hoc introspection techniques, which require fully-trained models for evaluation, our method allows for the study of training dynamics on the fly, and even more interestingly, allow for the decomposition of gradients based on which samples belong to particular groups of interest. We demonstrate how the gradient spectra for several common deep learning models differ between schizophrenia and control participants from the COBRE study, and illustrate how these trajectories may reveal specific training dynamics helpful for further analysis.
☆ Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.
☆ LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
In recent years, instruction-tuned Large Multimodal Models (LMMs) have been successful at several tasks, including image captioning and visual question answering; yet leveraging these models remains an open question for robotics. Prior LMMs for robotics applications have been extensively trained on language and action data, but their ability to generalize in different settings has often been less than desired. To address this, we introduce LLARVA, a model trained with a novel instruction tuning method that leverages structured prompts to unify a range of robotic learning tasks, scenarios, and environments. Additionally, we show that predicting intermediate 2-D representations, which we refer to as "visual traces", can help further align vision and action spaces for robot learning. We generate 8.5M image-visual trace pairs from the Open X-Embodiment dataset in order to pre-train our model, and we evaluate on 12 different tasks in the RLBench simulator as well as a physical Franka Emika Panda 7-DoF robot. Our experiments yield strong performance, demonstrating that LLARVA - using 2-D and language representations - performs well compared to several contemporary baselines, and can generalize across various robot environments and configurations.
☆ Stochastic Neural Network Symmetrisation in Markov Categories
We consider the problem of symmetrising a neural network along a group homomorphism: given a homomorphism $\varphi : H \to G$, we would like a procedure that converts $H$-equivariant neural networks into $G$-equivariant ones. We formulate this in terms of Markov categories, which allows us to consider neural networks whose outputs may be stochastic, but with measure-theoretic details abstracted away. We obtain a flexible, compositional, and generic framework for symmetrisation that relies on minimal assumptions about the structure of the group and the underlying neural network architecture. Our approach recovers existing methods for deterministic symmetrisation as special cases, and extends directly to provide a novel methodology for stochastic symmetrisation also. Beyond this, we believe our findings also demonstrate the utility of Markov categories for addressing problems in machine learning in a conceptual yet mathematically rigorous way.
☆ Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics
We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting, a setting that uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least square regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.
☆ Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models
Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic physical principles, which consequently makes them untrustworthy and limits generalization. To address this problem, we propose a physics-constrained learning method that combines powerful learning tools and reliable physical models. Our method leverages the data collected from observations by sending them into a Gaussian process that is physically constrained by a distributed Port-Hamiltonian model. Based on the Bayesian nature of the Gaussian process, we not only learn the dynamics of the system, but also enable uncertainty quantification. Furthermore, the proposed approach preserves the compositional nature of Port-Hamiltonian systems.
☆ Efficient Discovery of Significant Patterns with Few-Shot Resampling VLDB 2024
Significant pattern mining is a fundamental task in mining transactional data, requiring to identify patterns significantly associated with the value of a given feature, the target. In several applications, such as biomedicine, basket market analysis, and social networks, the goal is to discover patterns whose association with the target is defined with respect to an underlying population, or process, of which the dataset represents only a collection of observations, or samples. A natural way to capture the association of a pattern with the target is to consider its statistical significance, assessing its deviation from the (null) hypothesis of independence between the pattern and the target. While several algorithms have been proposed to find statistically significant patterns, it remains a computationally demanding task, and for complex patterns such as subgroups, no efficient solution exists. We present FSR, an efficient algorithm to identify statistically significant patterns with rigorous guarantees on the probability of false discoveries. FSR builds on a novel general framework for mining significant patterns that captures some of the most commonly considered patterns, including itemsets, sequential patterns, and subgroups. FSR uses a small number of resampled datasets, obtained by assigning i.i.d. labels to each transaction, to rigorously bound the supremum deviation of a quality statistic measuring the significance of patterns. FSR builds on novel tight bounds on the supremum deviation that require to mine a small number of resampled datasets, while providing a high effectiveness in discovering significant patterns. As a test case, we consider significant subgroup mining, and our evaluation on several real datasets shows that FSR is effective in discovering significant subgroups, while requiring a small number of resampled datasets.
comment: Accepted to VLDB 2024
☆ Mix-Domain Contrastive Learning for Unpaired H&E-to-IHC Stain Translation
H&E-to-IHC stain translation techniques offer a promising solution for precise cancer diagnosis, especially in low-resource regions where there is a shortage of health professionals and limited access to expensive equipment. Considering the pixel-level misalignment of H&E-IHC image pairs, current research explores the pathological consistency between patches from the same positions of the image pair. However, most of them overemphasize the correspondence between domains or patches, overlooking the side information provided by the non-corresponding objects. In this paper, we propose a Mix-Domain Contrastive Learning (MDCL) method to leverage the supervision information in unpaired H&E-to-IHC stain translation. Specifically, the proposed MDCL method aggregates the inter-domain and intra-domain pathology information by estimating the correlation between the anchor patch and all the patches from the matching images, encouraging the network to learn additional contrastive knowledge from mixed domains. With the mix-domain pathology information aggregation, MDCL enhances the pathological consistency between the corresponding patches and the component discrepancy of the patches from the different positions of the generated IHC image. Extensive experiments on two H&E-to-IHC stain translation datasets, namely MIST and BCI, demonstrate that the proposed method achieves state-of-the-art performance across multiple metrics.
☆ DataComp-LM: In search of the next generation of training sets for language models
We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation.
comment: Project page: https://www.datacomp.ai/dclm/
☆ CELL your Model: Contrastive Explanation Methods for Large Language Models
The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.
☆ Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
☆ Provable Guarantees for Model Performance via Mechanistic Interpretability ICML 2024
In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally lower bounding the accuracy of 151 small transformers trained on a Max-of-$k$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we show that shorter proofs seem to require and provide more mechanistic understanding, and that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
comment: Submitted to the ICML 2024 Workshop on Mechanistic Interpretability and The Thirty-eighth Annual Conference on Neural Information Processing Systems
☆ Optimal Transport-Assisted Risk-Sensitive Q-Learning
The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.
☆ Joint Linked Component Analysis for Multiview Data
In this work, we propose the joint linked component analysis (joint\_LCA) for multiview data. Unlike classic methods which extract the shared components in a sequential manner, the objective of joint\_LCA is to identify the view-specific loading matrices and the rank of the common latent subspace simultaneously. We formulate a matrix decomposition model where a joint structure and an individual structure are present in each data view, which enables us to arrive at a clean svd representation for the cross covariance between any pair of data views. An objective function with a novel penalty term is then proposed to achieve simultaneous estimation and rank selection. In addition, a refitting procedure is employed as a remedy to reduce the shrinkage bias caused by the penalization.
☆ A Semantic-based Layer Freezing Approach to Efficient Fine-Tuning of Language Models
Finetuning language models (LMs) is crucial for adapting the models to downstream data and tasks. However, full finetuning is usually costly. Existing work, such as parameter-efficient finetuning (PEFT), often focuses on \textit{how to finetune} but neglects the issue of \textit{where to finetune}. As a pioneering work on answering where to finetune (at the layer level), we conduct a semantic analysis of the LM inference process. We first propose a virtual transition of the latent representation and then trace its factual transition. Based on the deviation in transitions, we estimate the gain of finetuning each model layer, and further, narrow down the scope for finetuning. We perform extensive experiments across well-known LMs and datasets. The results show that our approach is effective and efficient, and outperforms the existing baselines. Our approach is orthogonal to existing efficient techniques, such as PEFT methods, offering practical values on LM finetuning.
comment: 13 pages, 5 figures, under peer-review
☆ Transcendence: Generative Models Can Outperform The Experts That Train Them
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
comment: Code, models, and data at https://transcendence.eddie.win
☆ Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies
Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation. This transforms action inference into a local generative task. We leverage pick and place symmetries underlying the tasks in the generation process and achieve extremely high sample efficiency and generalizability to unseen configurations. Finally, we demonstrate state-of-the-art performance across various tasks on the RLbench benchmark compared with several strong baselines.
☆ A Clipped Trip: the Dynamics of SGD with Gradient Clipping in High-Dimensions
The success of modern machine learning is due in part to the adaptive optimization methods that have been developed to deal with the difficulties of training large models over complex datasets. One such method is gradient clipping: a practical procedure with limited theoretical underpinnings. In this work, we study clipping in a least squares problem under streaming SGD. We develop a theoretical analysis of the learning dynamics in the limit of large intrinsic dimension-a model and dataset dependent notion of dimensionality. In this limit we find a deterministic equation that describes the evolution of the loss. We show that with Gaussian noise clipping cannot improve SGD performance. Yet, in other noisy settings, clipping can provide benefits with tuning of the clipping threshold. In these cases, clipping biases updates in a way beneficial to training which cannot be recovered by SGD under any schedule. We conclude with a discussion about the links between high-dimensional clipping and neural network training.
☆ CHG Shapley: Efficient Data Valuation and Selection towards Trustworthy Machine Learning
Understanding the decision-making process of machine learning models is crucial for ensuring trustworthy machine learning. Data Shapley, a landmark study on data valuation, has significantly advanced this understanding by assessing the contribution of each datum to model accuracy. However, the resource-intensive and time-consuming nature of multiple model retraining poses significant challenges for applying Data Shapley to large datasets. To address this, we propose the CHG (Conduct of Hardness and Gradient) score, which approximates the utility of each data subset on model accuracy during a single model training. By deriving the closed-form expression of the Shapley value for each data point under the CHG score utility function, we reduce the computational complexity to the equivalent of a single model retraining, an exponential improvement over existing methods. Additionally, we employ CHG Shapley for real-time data selection, demonstrating its effectiveness in identifying high-value and noisy data. CHG Shapley facilitates trustworthy model training through efficient data valuation, introducing a novel data-centric perspective on trustworthy machine learning.
☆ Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity
Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization.
comment: 33 pages, 14 figures
☆ Refusal in Language Models Is Mediated by a Single Direction
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
☆ Measuring memorization in RLHF for code completion
Reinforcement learning with human feedback (RLHF) has become the dominant method to align large models to user preferences. Unlike fine-tuning, for which there are many studies regarding training data memorization, it is not clear how memorization is affected by or introduced in the RLHF alignment process. Understanding this relationship is important as real user data may be collected and used to align large models; if user data is memorized during RLHF and later regurgitated, this could raise privacy concerns. In this work, we analyze how training data memorization can surface and propagate through each phase of RLHF. We focus our study on code completion models, as code completion is one of the most popular use cases for large language models. We find that RLHF significantly decreases the chance that data used for reward modeling and reinforcement learning is memorized, in comparison to aligning via directly fine-tuning on this data, but that examples already memorized during the fine-tuning stage of RLHF, will, in the majority of cases, remain memorized after RLHF.
☆ Scalable Expressiveness through Preprocessed Graph Perturbations
Graph Neural Networks (GNNs) have emerged as the predominant method for analyzing graph-structured data. However, canonical GNNs have limited expressive power and generalization capability, thus triggering the development of more expressive yet computationally intensive methods. One such approach is to create a series of perturbed versions of input graphs and then repeatedly conduct multiple message-passing operations on all variations during training. Despite their expressive power, this approach does not scale well on larger graphs. To address this scalability issue, we introduce Scalable Expressiveness through Preprocessed Graph Perturbation (SE2P). This model offers a flexible, configurable balance between scalability and generalizability with four distinct configuration classes. At one extreme, the configuration prioritizes scalability through minimal learnable feature extraction and extensive preprocessing; at the other extreme, it enhances generalizability with more learnable feature extractions, though this increases scalability costs. We conduct extensive experiments on real-world datasets to evaluate the generalizability and scalability of SE2P variants compared to various state-of-the-art benchmarks. Our results indicate that, depending on the chosen SE2P configuration, the model can enhance generalizability compared to benchmarks while achieving significant speed improvements of up to 8-fold.
comment: 13 pages, 2 figures
☆ Tackling the Curse of Dimensionality in Fractional and Tempered Fractional PDEs with Physics-Informed Neural Networks
Fractional and tempered fractional partial differential equations (PDEs) are effective models of long-range interactions, anomalous diffusion, and non-local effects. Traditional numerical methods for these problems are mesh-based, thus struggling with the curse of dimensionality (CoD). Physics-informed neural networks (PINNs) offer a promising solution due to their universal approximation, generalization ability, and mesh-free training. In principle, Monte Carlo fractional PINN (MC-fPINN) estimates fractional derivatives using Monte Carlo methods and thus could lift CoD. However, this may cause significant variance and errors, hence affecting convergence; in addition, MC-fPINN is sensitive to hyperparameters. In general, numerical methods and specifically PINNs for tempered fractional PDEs are under-developed. Herein, we extend MC-fPINN to tempered fractional PDEs to address these issues, resulting in the Monte Carlo tempered fractional PINN (MC-tfPINN). To reduce possible high variance and errors from Monte Carlo sampling, we replace the one-dimensional (1D) Monte Carlo with 1D Gaussian quadrature, applicable to both MC-fPINN and MC-tfPINN. We validate our methods on various forward and inverse problems of fractional and tempered fractional PDEs, scaling up to 100,000 dimensions. Our improved MC-fPINN/MC-tfPINN using quadrature consistently outperforms the original versions in accuracy and convergence speed in very high dimensions.
comment: 15 pages
☆ A First Physical-World Trajectory Prediction Attack via LiDAR-induced Deceptions in Autonomous Driving USENIX Security
Trajectory prediction forecasts nearby agents' moves based on their historical trajectories. Accurate trajectory prediction is crucial for autonomous vehicles. Existing attacks compromise the prediction model of a victim AV by directly manipulating the historical trajectory of an attacker AV, which has limited real-world applicability. This paper, for the first time, explores an indirect attack approach that induces prediction errors via attacks against the perception module of a victim AV. Although it has been shown that physically realizable attacks against LiDAR-based perception are possible by placing a few objects at strategic locations, it is still an open challenge to find an object location from the vast search space in order to launch effective attacks against prediction under varying victim AV velocities. Through analysis, we observe that a prediction model is prone to an attack focusing on a single point in the scene. Consequently, we propose a novel two-stage attack framework to realize the single-point attack. The first stage of prediction-side attack efficiently identifies, guided by the distribution of detection results under object-based attacks against perception, the state perturbations for the prediction model that are effective and velocity-insensitive. In the second stage of location matching, we match the feasible object locations with the found state perturbations. Our evaluation using a public autonomous driving dataset shows that our attack causes a collision rate of up to 63% and various hazardous responses of the victim AV. The effectiveness of our attack is also demonstrated on a real testbed car. To the best of our knowledge, this study is the first security analysis spanning from LiDAR-based perception to prediction in autonomous driving, leading to a realistic attack on prediction. To counteract the proposed attack, potential defenses are discussed.
comment: In Proceedings of the 33rd USENIX Security Symposium 2024
Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive with RankLLama, both of which are 7B parameter models trained on over 100K labels. These findings point to the power of automatic prompt optimization for synthetic dataset generation.
☆ Nemotron-4 340B Technical Report
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
☆ Multiple Descents in Unsupervised Learning: The Role of Noise, Domain Shift and Anomalies
The phenomenon of double descent has recently gained attention in supervised learning. It challenges the conventional wisdom of the bias-variance trade-off by showcasing a surprising behavior. As the complexity of the model increases, the test error initially decreases until reaching a certain point where the model starts to overfit the train set, causing the test error to rise. However, deviating from classical theory, the error exhibits another decline when exceeding a certain degree of over-parameterization. We study the presence of double descent in unsupervised learning, an area that has received little attention and is not yet fully understood. We conduct extensive experiments using under-complete auto-encoders (AEs) for various applications, such as dealing with noisy data, domain shifts, and anomalies. We use synthetic and real data and identify model-wise, epoch-wise, and sample-wise double descent for all the aforementioned applications. Finally, we assessed the usability of the AEs for detecting anomalies and mitigating the domain shift between datasets. Our findings indicate that over-parameterized models can improve performance not only in terms of reconstruction, but also in enhancing capabilities for the downstream task.
☆ Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy
comment: Krista and Michael contributed equally to this work
☆ The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation
In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
☆ Edge Classification on Graphs: New Directions in Topological Imbalance
Recent years have witnessed the remarkable success of applying Graph machine learning (GML) to node/graph classification and link prediction. However, edge classification task that enjoys numerous real-world applications such as social network analysis and cybersecurity, has not seen significant advancement. To address this gap, our study pioneers a comprehensive approach to edge classification. We identify a novel `Topological Imbalance Issue', which arises from the skewed distribution of edges across different classes, affecting the local subgraph of each edge and harming the performance of edge classifications. Inspired by the recent studies in node classification that the performance discrepancy exists with varying local structural patterns, we aim to investigate if the performance discrepancy in topological imbalanced edge classification can also be mitigated by characterizing the local class distribution variance. To overcome this challenge, we introduce Topological Entropy (TE), a novel topological-based metric that measures the topological imbalance for each edge. Our empirical studies confirm that TE effectively measures local class distribution variance, and indicate that prioritizing edges with high TE values can help address the issue of topological imbalance. Based on this, we develop two strategies - Topological Reweighting and TE Wedge-based Mixup - to focus training on (synthetic) edges based on their TEs. While topological reweighting directly manipulates training edge weights according to TE, our wedge-based mixup interpolates synthetic edges between high TE wedges. Ultimately, we integrate these strategies into a novel topological imbalance strategy for edge classification: TopoEdge. Through extensive experiments, we demonstrate the efficacy of our proposed strategies on newly curated datasets and thus establish a new benchmark for (imbalanced) edge classification.
☆ Score-fPINN: Fractional Score-Based Physics-Informed Neural Networks for High-Dimensional Fokker-Planck-Levy Equations
We introduce an innovative approach for solving high-dimensional Fokker-Planck-L\'evy (FPL) equations in modeling non-Brownian processes across disciplines such as physics, finance, and ecology. We utilize a fractional score function and Physical-informed neural networks (PINN) to lift the curse of dimensionality (CoD) and alleviate numerical overflow from exponentially decaying solutions with dimensions. The introduction of a fractional score function allows us to transform the FPL equation into a second-order partial differential equation without fractional Laplacian and thus can be readily solved with standard physics-informed neural networks (PINNs). We propose two methods to obtain a fractional score function: fractional score matching (FSM) and score-fPINN for fitting the fractional score function. While FSM is more cost-effective, it relies on known conditional distributions. On the other hand, score-fPINN is independent of specific stochastic differential equations (SDEs) but requires evaluating the PINN model's derivatives, which may be more costly. We conduct our experiments on various SDEs and demonstrate numerical stability and effectiveness of our method in dealing with high-dimensional problems, marking a significant advancement in addressing the CoD in FPL equations.
comment: 16 pages, 1 figure
☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
☆ Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'?
The empirical risk minimization (ERM) principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning: in particular, is there a weaker oracle than ERM which can nevertheless enable learnability? We answer this question affirmatively, showing that in the realizable setting of PAC learning for binary classification, a concept class can be learned using an oracle which only returns a single bit indicating whether a given dataset is realizable by some concept in the class. The sample complexity and oracle complexity of our algorithm depend polynomially on the VC dimension of the hypothesis class, thus showing that there is only a polynomial price to pay for use of our weaker oracle. Our results extend to the agnostic learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings. In the setting of partial concept classes, prior to our work no oracle-efficient algorithms were known, even with a standard ERM oracle. Thus, our results address a question of Alon et al. (2021) who asked whether there are algorithmic principles which enable efficient learnability in this setting.
☆ ROTI-GCV: Generalized Cross-Validation for right-ROTationally Invariant Data
Two key tasks in high-dimensional regularized regression are tuning the regularization strength for good predictions and estimating the out-of-sample risk. It is known that the standard approach -- $k$-fold cross-validation -- is inconsistent in modern high-dimensional settings. While leave-one-out and generalized cross-validation remain consistent in some high-dimensional cases, they become inconsistent when samples are dependent or contain heavy-tailed covariates. To model structured sample dependence and heavy tails, we use right-rotationally invariant covariate distributions - a crucial concept from compressed sensing. In the common modern proportional asymptotics regime where the number of features and samples grow comparably, we introduce a new framework, ROTI-GCV, for reliably performing cross-validation. Along the way, we propose new estimators for the signal-to-noise ratio and noise variance under these challenging conditions. We conduct extensive experiments that demonstrate the power of our approach and its superiority over existing methods.
comment: 25 pages, 3 figures
☆ Diffusion Generative Modelling for Divide-and-Conquer MCMC
Divide-and-conquer MCMC is a strategy for parallelising Markov Chain Monte Carlo sampling by running independent samplers on disjoint subsets of a dataset and merging their output. An ongoing challenge in the literature is to efficiently perform this merging without imposing distributional assumptions on the posteriors. We propose using diffusion generative modelling to fit density approximations to the subposterior distributions. This approach outperforms existing methods on challenging merging problems, while its computational cost scales more efficiently to high dimensional problems than existing density estimation approaches.
comment: 16 pages, 5 figures
☆ Multimodal Learning To Improve Segmentation With Intraoperative CBCT & Preoperative CT
Intraoperative medical imaging, particularly Cone-beam computed tomography (CBCT), is an important tool facilitating computer aided interventions, despite a lower visual quality. While this degraded image quality can affect downstream segmentation, the availability of high quality preoperative scans represents potential for improvements. Here we consider a setting where preoperative CT and intraoperative CBCT scans are available, however, the alignment (registration) between the scans is imperfect. We propose a multimodal learning method that fuses roughly aligned CBCT and CT scans and investigate the effect of CBCT quality and misalignment (affine and elastic transformations facilitating misalignment) on the final segmentation performance. As an application scenario, we focus on the segmentation of liver and liver tumor semantic segmentation and evaluate the effect of intraoperative image quality and misalignment on segmentation performance. To accomplish this, high quality, labelled CTs are defined as preoperative and used as a basis to simulate intraoperative CBCT. We show that the fusion of preoperative CT and simulated, intraoperative CBCT mostly improves segmentation performance and that even clearly misaligned preoperative data has the potential to improve segmentation performance.
comment: Submitted to SASHIMI2024
☆ Making Old Things New: A Unified Algorithm for Differentially Private Clustering ICML 2024
As a staple of data analysis and unsupervised learning, the problem of private clustering has been widely studied under various privacy models. Centralized differential privacy is the first of them, and the problem has also been studied for the local and the shuffle variation. In each case, the goal is to design an algorithm that computes privately a clustering, with the smallest possible error. The study of each variation gave rise to new algorithms: the landscape of private clustering algorithms is therefore quite intricate. In this paper, we show that a 20-year-old algorithm can be slightly modified to work for any of these models. This provides a unified picture: while matching almost all previously known results, it allows us to improve some of them and extend it to a new privacy model, the continual observation setting, where the input is changing over time and the algorithm must output a new solution at each time step.
comment: Oral presentation at ICML 2024
☆ Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions
One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.
☆ Feasibility of Federated Learning from Client Databases with Different Brain Diseases and MRI Modalities
Segmentation models for brain lesions in MRI are commonly developed for a specific disease and trained on data with a predefined set of MRI modalities. Each such model cannot segment the disease using data with a different set of MRI modalities, nor can it segment any other type of disease. Moreover, this training paradigm does not allow a model to benefit from learning from heterogeneous databases that may contain scans and segmentation labels for different types of brain pathologies and diverse sets of MRI modalities. Is it feasible to use Federated Learning (FL) for training a single model on client databases that contain scans and labels of different brain pathologies and diverse sets of MRI modalities? We demonstrate promising results by combining appropriate, simple, and practical modifications to the model and training strategy: Designing a model with input channels that cover the whole set of modalities available across clients, training with random modality drop, and exploring the effects of feature normalization methods. Evaluation on 7 brain MRI databases with 5 different diseases shows that such FL framework can train a single model that is shown to be very promising in segmenting all disease types seen during training. Importantly, it is able to segment these diseases in new databases that contain sets of modalities different from those in training clients. These results demonstrate, for the first time, feasibility and effectiveness of using FL to train a single segmentation model on decentralised data with diverse brain diseases and MRI modalities, a necessary step towards leveraging heterogeneous real-world databases. Code will be made available at: https://github.com/FelixWag/FL-MultiDisease-MRI
☆ The Liouville Generator for Producing Integrable Expressions SC
There has been a growing need to devise processes that can create comprehensive datasets in the world of Computer Algebra, both for accurate benchmarking and for new intersections with machine learning technology. We present here a method to generate integrands that are guaranteed to be integrable, dubbed the LIOUVILLE method. It is based on Liouville's theorem and the Parallel Risch Algorithm for symbolic integration. We show that this data generation method retains the best qualities of previous data generation methods, while overcoming some of the issues built into that prior work. The LIOUVILLE generator is able to generate sufficiently complex and realistic integrands, and could be used for benchmarking or machine learning training tasks related to symbolic integration.
comment: To appear in proc. CASC (2024)
☆ Words in Motion: Representation Engineering for Motion Forecasting
Motion forecasting transforms sequences of past movements and environment context into future motion. Recent methods rely on learned representations, resulting in hidden states that are difficult to interpret. In this work, we use natural language to quantize motion features in a human-interpretable way, and measure the degree to which they are embedded in hidden states. Our experiments reveal that hidden states of motion sequences are arranged with respect to our discrete sets of motion features. Following these insights, we fit control vectors to motion features, which allow for controlling motion forecasts at inference. Consequently, our method enables controlling transformer-based motion forecasting models with textual inputs, providing a unique interface to interact with and understand these models. Our implementation is available at https://github.com/kit-mrt/future-motion
☆ AV-CrossNet: an Audiovisual Complex Spectral Mapping Network for Speech Separation By Leveraging Narrow- and Cross-Band Modeling
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an \gls{av} system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, and COG-MHEAR challenge. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
comment: 10 pages, 4 Figures, and 4 Tables
☆ Long Code Arena: a Set of Benchmarks for Long-Context Code Models
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
comment: 54 pages, 4 figures, 22 tables
☆ Standardizing Structural Causal Models
Synthetic datasets generated by structural causal models (SCMs) are commonly used for benchmarking causal structure learning algorithms. However, the variances and pairwise correlations in SCM data tend to increase along the causal ordering. Several popular algorithms exploit these artifacts, possibly leading to conclusions that do not generalize to real-world settings. Existing metrics like $\operatorname{Var}$-sortability and $\operatorname{R^2}$-sortability quantify these patterns, but they do not provide tools to remedy them. To address this, we propose internally-standardized structural causal models (iSCMs), a modification of SCMs that introduces a standardization operation at each variable during the generative process. By construction, iSCMs are not $\operatorname{Var}$-sortable, and as we show experimentally, not $\operatorname{R^2}$-sortable either for commonly-used graph families. Moreover, contrary to the post-hoc standardization of data generated by standard SCMs, we prove that linear iSCMs are less identifiable from prior knowledge on the weights and do not collapse to deterministic relationships in large systems, which may make iSCMs a useful model in causal inference beyond the benchmarking problem studied here.
☆ On GNN explanability with activation rules
GNNs are powerful models based on node representation learning that perform particularly well in many machine learning problems related to graphs. The major obstacle to the deployment of GNNs is mostly a problem of societal acceptability and trustworthiness, properties which require making explicit the internal functioning of such models. Here, we propose to mine activation rules in the hidden layers to understand how the GNNs perceive the world. The problem is not to discover activation rules that are individually highly discriminating for an output of the model. Instead, the challenge is to provide a small set of rules that cover all input graphs. To this end, we introduce the subjective activation pattern domain. We define an effective and principled algorithm to enumerate activations rules in each hidden layer. The proposed approach for quantifying the interest of these rules is rooted in information theory and is able to account for background knowledge on the input graph data. The activation rules can then be redescribed thanks to pattern languages involving interpretable features. We show that the activation rules provide insights on the characteristics used by the GNN to classify the graphs. Especially, this allows to identify the hidden features built by the GNN through its different layers. Also, these rules can subsequently be used for explaining GNN decisions. Experiments on both synthetic and real-life datasets show highly competitive performance, with up to 200% improvement in fidelity on explaining graph classification over the SOTA methods.
Pre-Training and Personalized Fine-Tuning via Over-the-Air Federated Meta-Learning: Convergence-Generalization Trade-Offs
For modern artificial intelligence (AI) applications such as large language models (LLMs), the training paradigm has recently shifted to pre-training followed by fine-tuning. Furthermore, owing to dwindling open repositories of data and thanks to efforts to democratize access to AI models, pre-training is expected to increasingly migrate from the current centralized deployments to federated learning (FL) implementations. Meta-learning provides a general framework in which pre-training and fine-tuning can be formalized. Meta-learning-based personalized FL (meta-pFL) moves beyond basic personalization by targeting generalization to new agents and tasks. This paper studies the generalization performance of meta-pFL for a wireless setting in which the agents participating in the pre-training phase, i.e., meta-learning, are connected via a shared wireless channel to the server. Adopting over-the-air computing, we study the trade-off between generalization to new agents and tasks, on the one hand, and convergence, on the other hand. The trade-off arises from the fact that channel impairments may enhance generalization, while degrading convergence. Extensive numerical results validate the theory.
comment: 37 pages, 7 figures, submitted for possible journal publication
☆ An Imitative Reinforcement Learning Framework for Autonomous Dogfight
Unmanned Combat Aerial Vehicle (UCAV) dogfight, which refers to a fight between two or more UCAVs usually at close quarters, plays a decisive role on the aerial battlefields. With the evolution of artificial intelligence, dogfight progressively transits towards intelligent and autonomous modes. However, the development of autonomous dogfight policy learning is hindered by challenges such as weak exploration capabilities, low learning efficiency, and unrealistic simulated environments. To overcome these challenges, this paper proposes a novel imitative reinforcement learning framework, which efficiently leverages expert data while enabling autonomous exploration. The proposed framework not only enhances learning efficiency through expert imitation, but also ensures adaptability to dynamic environments via autonomous exploration with reinforcement learning. Therefore, the proposed framework can learn a successful dogfight policy of 'pursuit-lock-launch' for UCAVs. To support data-driven learning, we establish a dogfight environment based on the Harfang3D sandbox, where we conduct extensive experiments. The results indicate that the proposed framework excels in multistage dogfight, significantly outperforms state-of-the-art reinforcement learning and imitation learning methods. Thanks to the ability of imitating experts and autonomous exploration, our framework can quickly learn the critical knowledge in complex aerial combat tasks, achieving up to a 100% success rate and demonstrating excellent robustness.
☆ GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations
Large pre-trained language models have become popular for many applications and form an important backbone of many downstream tasks in natural language processing (NLP). Applying 'explainable artificial intelligence' (XAI) techniques to enrich such models' outputs is considered crucial for assuring their quality and shedding light on their inner workings. However, large language models are trained on a plethora of data containing a variety of biases, such as gender biases, affecting model weights and, potentially, behavior. Currently, it is unclear to what extent such biases also impact model explanations in possibly unfavorable ways. We create a gender-controlled text dataset, GECO, in which otherwise identical sentences appear in male and female forms. This gives rise to ground-truth 'world explanations' for gender classification tasks, enabling the objective evaluation of the correctness of XAI methods. We also provide GECOBench, a rigorous quantitative evaluation framework benchmarking popular XAI methods, applying them to pre-trained language models fine-tuned to different degrees. This allows us to investigate how pre-training induces undesirable bias in model explanations and to what extent fine-tuning can mitigate such explanation bias. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to particularly benefit from fine-tuning or complete retraining of embedding layers. Remarkably, this relationship holds for models achieving similar classification performance on the same task. With that, we highlight the utility of the proposed gender-controlled dataset and novel benchmarking approach for research and development of novel XAI methods. All code including dataset generation, model training, evaluation and visualization is available at: https://github.com/braindatalab/gecobench
comment: Under review
☆ Do Parameters Reveal More than Loss for Membership Inference? ICML 2024
Membership inference attacks aim to infer whether an individual record was used to train a model, serving as a key tool for disclosure auditing. While such evaluations are useful to demonstrate risk, they are computationally expensive and often make strong assumptions about potential adversaries' access to models and training environments, and thus do not provide very tight bounds on leakage from potential attacks. We show how prior claims around black-box access being sufficient for optimal membership inference do not hold for most useful settings such as stochastic gradient descent, and that optimal membership inference indeed requires white-box access. We validate our findings with a new white-box inference attack IHA (Inverse Hessian Attack) that explicitly uses model parameters by taking advantage of computing inverse-Hessian vector products. Our results show that both audits and adversaries may be able to benefit from access to model parameters, and we advocate for further research into white-box methods for membership privacy auditing.
comment: Accepted at High-dimensional Learning Dynamics (HiLD) Workshop, ICML 2024
☆ Explainable Artificial Intelligence and Multicollinearity : A Mini Review of Current Approaches
Explainable Artificial Intelligence (XAI) methods help to understand the internal mechanism of machine learning models and how they reach a specific decision or made a specific action. The list of informative features is one of the most common output of XAI methods. Multicollinearity is one of the big issue that should be considered when XAI generates the explanation in terms of the most informative features in an AI system. No review has been dedicated to investigate the current approaches to handle such significant issue. In this paper, we provide a review of the current state-of-the-art approaches in relation to the XAI in the context of recent advances in dealing with the multicollinearity issue. To do so, we searched in three repositories that are: Web of Science, Scopus and IEEE Xplore to find pertinent published papers. After excluding irrelevant papers, seven papers were considered in the review. In addition, we discuss the current XAI methods and their limitations in dealing with the multicollinearity and suggest future directions.
☆ FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks
Modern machine learning models are sensitive to the manipulation of both the training data (poisoning attacks) and inference data (adversarial examples). Recognizing this issue, the community has developed many empirical defenses against both attacks and, more recently, provable certification methods against inference-time attacks. However, such guarantees are still largely lacking for training-time attacks. In this work, we present FullCert, the first end-to-end certifier with sound, deterministic bounds, which proves robustness against both training-time and inference-time attacks. We first bound all possible perturbations an adversary can make to the training data under the considered threat model. Using these constraints, we bound the perturbations' influence on the model's parameters. Finally, we bound the impact of these parameter changes on the model's prediction, resulting in joint robustness guarantees against poisoning and adversarial examples. To facilitate this novel certification paradigm, we combine our theoretical work with a new open-source library BoundFlow, which enables model training on bounded datasets. We experimentally demonstrate FullCert's feasibility on two different datasets.
☆ Revisiting Spurious Correlation in Domain Generalization
Without loss of generality, existing machine learning techniques may learn spurious correlation dependent on the domain, which exacerbates the generalization of models in out-of-distribution (OOD) scenarios. To address this issue, recent works build a structural causal model (SCM) to describe the causality within data generation process, thereby motivating methods to avoid the learning of spurious correlation by models. However, from the machine learning viewpoint, such a theoretical analysis omits the nuanced difference between the data generation process and representation learning process, resulting in that the causal analysis based on the former cannot well adapt to the latter. To this end, we explore to build a SCM for representation learning process and further conduct a thorough analysis of the mechanisms underlying spurious correlation. We underscore that adjusting erroneous covariates introduces bias, thus necessitating the correct selection of spurious correlation mechanisms based on practical application scenarios. In this regard, we substantiate the correctness of the proposed SCM and further propose to control confounding bias in OOD generalization by introducing a propensity score weighted estimator, which can be integrated into any existing OOD method as a plug-and-play module. The empirical results comprehensively demonstrate the effectiveness of our method on synthetic and large-scale real OOD datasets.
☆ On the Feasibility of Fidelity$^-$ for Graph Pruning IJCAI
As one of popular quantitative metrics to assess the quality of explanation of graph neural networks (GNNs), fidelity measures the output difference after removing unimportant parts of the input graph. Fidelity has been widely used due to its straightforward interpretation that the underlying model should produce similar predictions when features deemed unimportant from the explanation are removed. This raises a natural question: "Does fidelity induce a global (soft) mask for graph pruning?" To solve this, we aim to explore the potential of the fidelity measure to be used for graph pruning, eventually enhancing the GNN models for better efficiency. To this end, we propose Fidelity$^-$-inspired Pruning (FiP), an effective framework to construct global edge masks from local explanations. Our empirical observations using 7 edge attribution methods demonstrate that, surprisingly, general eXplainable AI methods outperform methods tailored to GNNs in terms of graph pruning performance.
comment: 6 pages, 3 figures, 2 tables; IJCAI Workshop on Explainable AI (XAI 2024) (to appear) (Please cite our workshop version.)
☆ Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality
Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks.
☆ Interventional Imbalanced Multi-Modal Representation Learning via $β$-Generalization Front-Door Criterion
Multi-modal methods establish comprehensive superiority over uni-modal methods. However, the imbalanced contributions of different modalities to task-dependent predictions constantly degrade the discriminative performance of canonical multi-modal methods. Based on the contribution to task-dependent predictions, modalities can be identified as predominant and auxiliary modalities. Benchmark methods raise a tractable solution: augmenting the auxiliary modality with a minor contribution during training. However, our empirical explorations challenge the fundamental idea behind such behavior, and we further conclude that benchmark approaches suffer from certain defects: insufficient theoretical interpretability and limited exploration capability of discriminative knowledge. To this end, we revisit multi-modal representation learning from a causal perspective and build the Structural Causal Model. Following the empirical explorations, we determine to capture the true causality between the discriminative knowledge of predominant modality and predictive label while considering the auxiliary modality. Thus, we introduce the $\beta$-generalization front-door criterion. Furthermore, we propose a novel network for sufficiently exploring multi-modal discriminative knowledge. Rigorous theoretical analyses and various empirical evaluations are provided to support the effectiveness of the innate mechanism behind our proposed method.
☆ Analysing zero-shot temporal relation extraction on clinical notes using temporal consistency
This paper presents the first study for temporal relation extraction in a zero-shot setting focusing on biomedical text. We employ two types of prompts and five LLMs (GPT-3.5, Mixtral, Llama 2, Gemma, and PMC-LLaMA) to obtain responses about the temporal relations between two events. Our experiments demonstrate that LLMs struggle in the zero-shot setting performing worse than fine-tuned specialized models in terms of F1 score, showing that this is a challenging task for LLMs. We further contribute a novel comprehensive temporal analysis by calculating consistency scores for each LLM. Our findings reveal that LLMs face challenges in providing responses consistent to the temporal properties of uniqueness and transitivity. Moreover, we study the relation between the temporal consistency of an LLM and its accuracy and whether the latter can be improved by solving temporal inconsistencies. Our analysis shows that even when temporal consistency is achieved, the predictions can remain inaccurate.
☆ Active clustering with bandit feedback
We investigate the Active Clustering Problem (ACP). A learner interacts with an $N$-armed stochastic bandit with $d$-dimensional subGaussian feedback. There exists a hidden partition of the arms into $K$ groups, such that arms within the same group, share the same mean vector. The learner's task is to uncover this hidden partition with the smallest budget - i.e., the least number of observation - and with a probability of error smaller than a prescribed constant $\delta$. In this paper, (i) we derive a non-asymptotic lower bound for the budget, and (ii) we introduce the computationally efficient ACB algorithm, whose budget matches the lower bound in most regimes. We improve on the performance of a uniform sampling strategy. Importantly, contrary to the batch setting, we establish that there is no computation-information gap in the active setting.
comment: 50 pages
☆ Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
comment: arXiv admin note: text overlap with arXiv:2402.02042; text overlap with arXiv:2202.00150 by other authors
Just How Flexible are Neural Networks in Practice?
It is widely believed that a neural network can fit a training set containing at least as many samples as it has parameters, underpinning notions of overparameterized and underparameterized models. In practice, however, we only find solutions accessible via our training procedure, including the optimizer and regularizers, limiting flexibility. Moreover, the exact parameterization of the function class, built into an architecture, shapes its loss surface and impacts the minima we find. In this work, we examine the ability of neural networks to fit data in practice. Our findings indicate that: (1) standard optimizers find minima where the model can only fit training sets with significantly fewer samples than it has parameters; (2) convolutional networks are more parameter-efficient than MLPs and ViTs, even on randomly labeled data; (3) while stochastic training is thought to have a regularizing effect, SGD actually finds minima that fit more training data than full-batch gradient descent; (4) the difference in capacity to fit correctly labeled and incorrectly labeled samples can be predictive of generalization; (5) ReLU activation functions result in finding minima that fit more data despite being designed to avoid vanishing and exploding gradients in deep architectures.
☆ Adversaries With Incentives: A Strategic Alternative to Adversarial Robustness
Adversarial training aims to defend against *adversaries*: malicious opponents whose sole aim is to harm predictive performance in any way possible - a rather harsh perspective, which we assert results in unnecessarily conservative models. Instead, we propose to model opponents as simply pursuing their own goals, rather than working directly against the classifier. Employing tools from strategic modeling, our approach uses knowledge or beliefs regarding the opponent's possible incentives as inductive bias for learning. Our method of *strategic training* is designed to defend against opponents within an *incentive uncertainty set*: this resorts to adversarial learning when the set is maximal, but offers potential gains when it can be appropriately reduced. We conduct a series of experiments that show how even mild knowledge regarding the adversary's incentives can be useful, and that the degree of potential gains depends on how incentives relate to the structure of the learning task.
☆ Calibrating Where It Matters: Constrained Temperature Scaling NeurIPS 2023
We consider calibration of convolutional classifiers for diagnostic decision making. Clinical decision makers can use calibrated classifiers to minimise expected costs given their own cost function. Such functions are usually unknown at training time. If minimising expected costs is the primary aim, algorithms should focus on tuning calibration in regions of probability simplex likely to effect decisions. We give an example, modifying temperature scaling calibration, and demonstrate improved calibration where it matters using convnets trained to classify dermoscopy images.
comment: Presented at Medical Imaging Meets NeurIPS 2023
☆ PrAViC: Probabilistic Adaptation Framework for Real-Time Video Classification
Video processing is generally divided into two main categories: processing of the entire video, which typically yields optimal classification outcomes, and real-time processing, where the objective is to make a decision as promptly as possible. The latter is often driven by the need to identify rapidly potential critical or dangerous situations. These could include machine failure, traffic accidents, heart problems, or dangerous behavior. Although the models dedicated to the processing of entire videos are typically well-defined and clearly presented in the literature, this is not the case for online processing, where a plethora of hand-devised methods exist. To address this, we present \our{}, a novel, unified, and theoretically-based adaptation framework for dealing with the online classification problem for video data. The initial phase of our study is to establish a robust mathematical foundation for the theory of classification of sequential data, with the potential to make a decision at an early stage. This allows us to construct a natural function that encourages the model to return an outcome much faster. The subsequent phase is to demonstrate a straightforward and readily implementable method for adapting offline models to online and recurrent operations. Finally, by comparing the proposed approach to the non-online state-of-the-art baseline, it is demonstrated that the use of \our{} encourages the network to make earlier classification decisions without compromising accuracy.
☆ Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks
The landscape of deep learning has vastly expanded the frontiers of source code analysis, particularly through the utilization of structural representations such as Abstract Syntax Trees (ASTs). While these methodologies have demonstrated effectiveness in classification tasks, their efficacy in regression applications, such as execution time prediction from source code, remains underexplored. This paper endeavours to decode the behaviour of tree-based neural network models in the context of such regression challenges. We extend the application of established models--tree-based Convolutional Neural Networks (CNNs), Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our comparative analysis reveals that while these models are benchmarks in code representation, they exhibit limitations when tasked with regression. To address these deficiencies, we propose a novel dual-transformer approach that operates on both source code tokens and AST representations, employing cross-attention mechanisms to enhance interpretability between the two domains. Furthermore, we explore the adaptation of Graph Neural Networks (GNNs) to this tree-based problem, theorizing the inherent compatibility due to the graphical nature of ASTs. Empirical evaluations on real-world datasets showcase that our dual-transformer model outperforms all other tree-based neural networks and the GNN-based models. Moreover, our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.
comment: This Paper is submitted to IEEE Transactions on Neural Networks and Learning Systems
☆ DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
☆ Dredge Word, Social Media, and Webgraph Networks for Unreliable Website Classification and Identification
In an attempt to mimic the complex paths through which unreliable content spreads between search engines and social media, we explore the impact of incorporating both webgraph and large-scale social media contexts into website credibility classification and discovery systems. We further explore the usage of what we define as \textit{dredge words} on social media -- terms or phrases for which unreliable domains rank highly. Through comprehensive graph neural network ablations, we demonstrate that curriculum-based heterogeneous graph models that leverage context from both webgraphs and social media data outperform homogeneous and single-mode approaches. We further demonstrate that the incorporation of dredge words into our model strongly associates unreliable websites with social media and online commerce platforms. Finally, we show our heterogeneous model greatly outperforms competing systems in the top-k identification of unlabeled unreliable websites. We demonstrate the strong unreliability signals present in the diverse paths that users follow to uncover unreliable content, and we release a novel dataset of dredge words.
☆ Cross-domain Open-world Discovery
In many real-world applications, test data may commonly exhibit categorical shifts, characterized by the emergence of novel classes, as well as distribution shifts arising from feature distributions different from the ones the model was trained on. However, existing methods either discover novel classes in the open-world setting or assume domain shifts without the ability to discover novel classes. In this work, we consider a cross-domain open-world discovery setting, where the goal is to assign samples to seen classes and discover unseen classes under a domain shift. To address this challenging problem, we present CROW, a prototype-based approach that introduces a cluster-then-match strategy enabled by a well-structured representation space of foundation models. In this way, CROW discovers novel classes by robustly matching clusters with previously seen classes, followed by fine-tuning the representation space using an objective designed for cross-domain open-world discovery. Extensive experimental results on image classification benchmark datasets demonstrate that CROW outperforms alternative baselines, achieving an 8% average performance improvement across 75 experimental settings.
comment: 18 pages, 6 figures, 24 tables
☆ Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.
☆ DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred.
☆ P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models ACL 2024
A multitude of industries depend on accurate and reasonable tabular data augmentation for their business processes. Contemporary methodologies in generating tabular data revolve around utilizing Generative Adversarial Networks (GAN) or fine-tuning Large Language Models (LLM). However, GAN-based approaches are documented to produce samples with common-sense errors attributed to the absence of external knowledge. On the other hand, LLM-based methods exhibit a limited capacity to capture the disparities between synthesized and actual data distribution due to the absence of feedback from a discriminator during training. Furthermore, the decoding of LLM-based generation introduces gradient breakpoints, impeding the backpropagation of loss from a discriminator, thereby complicating the integration of these two approaches. To solve this challenge, we propose using proximal policy optimization (PPO) to apply GANs, guiding LLMs to enhance the probability distribution of tabular features. This approach enables the utilization of LLMs as generators for GANs in synthesizing tabular data. Our experiments demonstrate that PPO leads to an approximately 4\% improvement in the accuracy of models trained on synthetically generated data over state-of-the-art across three real-world datasets.
comment: The paper was accepted by findings of ACL 2024
☆ Unfolding Time: Generative Modeling for Turbulent Flows in 4D ICML 2024
A recent study in turbulent flow simulation demonstrated the potential of generative diffusion models for fast 3D surrogate modeling. This approach eliminates the need for specifying initial states or performing lengthy simulations, significantly accelerating the process. While adept at sampling individual frames from the learned manifold of turbulent flow states, the previous model lacks the capability to generate sequences, hindering analysis of dynamic phenomena. This work addresses this limitation by introducing a 4D generative diffusion model and a physics-informed guidance technique that enables the generation of realistic sequences of flow states. Our findings indicate that the proposed method can successfully sample entire subsequences from the turbulent manifold, even though generalizing from individual frames to sequences remains a challenging task. This advancement opens doors for the application of generative modeling in analyzing the temporal evolution of turbulent flows, providing valuable insights into their complex dynamics.
comment: AI4Science Workshop @ ICML 2024
☆ SEFraud: Graph-based Self-Explainable Fraud Detection via Interpretative Mask Learning KDD 2024
Graph-based fraud detection has widespread application in modern industry scenarios, such as spam review and malicious account detection. While considerable efforts have been devoted to designing adequate fraud detectors, the interpretability of their results has often been overlooked. Previous works have attempted to generate explanations for specific instances using post-hoc explaining methods such as a GNNExplainer. However, post-hoc explanations can not facilitate the model predictions and the computational cost of these methods cannot meet practical requirements, thus limiting their application in real-world scenarios. To address these issues, we propose SEFraud, a novel graph-based self-explainable fraud detection framework that simultaneously tackles fraud detection and result in interpretability. Concretely, SEFraud first leverages customized heterogeneous graph transformer networks with learnable feature masks and edge masks to learn expressive representations from the informative heterogeneously typed transactions. A new triplet loss is further designed to enhance the performance of mask learning. Empirical results on various datasets demonstrate the effectiveness of SEFraud as it shows considerable advantages in both the fraud detection performance and interpretability of prediction results. Moreover, SEFraud has been deployed and offers explainable fraud detection service for the largest bank in China, Industrial and Commercial Bank of China Limited (ICBC). Results collected from the production environment of ICBC show that SEFraud can provide accurate detection results and comprehensive explanations that align with the expert business understanding, confirming its efficiency and applicability in large-scale online services.
comment: Accepted by KDD 2024
☆ Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.
comment: 5 pages, 3 figures, 1 table (12 pages, 4 figures, 6 tables including references and appendices)
☆ $\texttt{MoE-RBench}$: Towards Building Reliable Language Models with Sparse Mixture-of-Experts ICML2024
Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, the reliability assessment of MoE lags behind its surging applications. Moreover, when transferred to new domains such as in fine-tuning MoE models sometimes underperform their dense counterparts. Motivated by the research gap and counter-intuitive phenomenon, we propose $\texttt{MoE-RBench}$, the first comprehensive assessment of SMoE reliability from three aspects: $\textit{(i)}$ safety and hallucination, $\textit{(ii)}$ resilience to adversarial attacks, and $\textit{(iii)}$ out-of-distribution robustness. Extensive models and datasets are tested to compare the MoE to dense networks from these reliability dimensions. Our empirical observations suggest that with appropriate hyperparameters, training recipes, and inference techniques, we can build the MoE model more reliably than the dense LLM. In particular, we find that the robustness of SMoE is sensitive to the basic training settings. We hope that this study can provide deeper insights into how to adapt the pre-trained MoE model to other tasks with higher-generation security, quality, and stability. Codes are available at https://github.com/UNITES-Lab/MoE-RBench
comment: 9 pages, 8 figures, camera ready on ICML2024
☆ CM2-Net: Continual Cross-Modal Mapping Network for Driver Action Recognition
Driver action recognition has significantly advanced in enhancing driver-vehicle interactions and ensuring driving safety by integrating multiple modalities, such as infrared and depth. Nevertheless, compared to RGB modality only, it is always laborious and costly to collect extensive data for all types of non-RGB modalities in car cabin environments. Therefore, previous works have suggested independently learning each non-RGB modality by fine-tuning a model pre-trained on RGB videos, but these methods are less effective in extracting informative features when faced with newly-incoming modalities due to large domain gaps. In contrast, we propose a Continual Cross-Modal Mapping Network (CM2-Net) to continually learn each newly-incoming modality with instructive prompts from the previously-learned modalities. Specifically, we have developed Accumulative Cross-modal Mapping Prompting (ACMP), to map the discriminative and informative features learned from previous modalities into the feature space of newly-incoming modalities. Then, when faced with newly-incoming modalities, these mapped features are able to provide effective prompts for which features should be extracted and prioritized. These prompts are accumulating throughout the continual learning process, thereby boosting further recognition performances. Extensive experiments conducted on the Drive&Act dataset demonstrate the performance superiority of CM2-Net on both uni- and multi-modal driver action recognition.
☆ They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
☆ Deep-Learning-Based Channel Estimation for Distributed MIMO with 1-bit Radio-Over-Fiber Fronthaul
We consider the problem of pilot-aided, uplink channel estimation in a distributed massive multiple-input multiple-output (MIMO) architecture, in which the access points are connected to a central processing unit via fiber-optical fronthaul links, carrying a two-level-quantized version of the received analog radio-frequency signal. We adapt to this architecture the deep-learning-based channel-estimation algorithm recently proposed by Nguyen et al. (2023), and explore its robustness to the additional signal distortions (beyond 1-bit quantization) introduced in the considered architecture by the automatic gain controllers (AGCs) and by the comparators. These components are used at the access points to generate the two-level analog waveform from the received signal. Via simulation results, we illustrate that the proposed channel-estimation method outperforms significantly the Bussgang linear minimum mean-square error channel estimator, and it is robust against the additional impairments introduced by the AGCs and the comparators.
☆ Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning
Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.
comment: This paper has been submitted to IEEE Journal. The source code has been released at: https://github.com/qiongwu86/RIS-VEC-MARL.git
☆ Improved Algorithms for Contextual Dynamic Pricing
In contextual dynamic pricing, a seller sequentially prices goods based on contextual information. Buyers will purchase products only if the prices are below their valuations. The goal of the seller is to design a pricing strategy that collects as much revenue as possible. We focus on two different valuation models. The first assumes that valuations linearly depend on the context and are further distorted by noise. Under minor regularity assumptions, our algorithm achieves an optimal regret bound of $\tilde{\mathcal{O}}(T^{2/3})$, improving the existing results. The second model removes the linearity assumption, requiring only that the expected buyer valuation is $\beta$-H\"older in the context. For this model, our algorithm obtains a regret $\tilde{\mathcal{O}}(T^{d+2\beta/d+3\beta})$, where $d$ is the dimension of the context space.
☆ Federated Active Learning Framework for Efficient Annotation Strategy in Skin-lesion Classification
Federated Learning (FL) enables multiple institutes to train models collaboratively without sharing private data. Current FL research focuses on communication efficiency, privacy protection, and personalization and assumes that the data of FL have already been ideally collected. In medical scenarios, however, data annotation demands both expertise and intensive labor, which is a critical problem in FL. Active learning (AL), has shown promising performance in reducing the number of data annotations in medical image analysis. We propose a federated AL (FedAL) framework in which AL is executed periodically and interactively under FL. We exploit a local model in each hospital and a global model acquired from FL to construct an ensemble. We use ensemble-entropy-based AL as an efficient data-annotation strategy in FL. Therefore, our FedAL framework can decrease the amount of annotated data and preserve patient privacy while maintaining the performance of FL. To our knowledge, this is the first FedAL framework applied to medical images. We validated our framework on real-world dermoscopic datasets. Using only 50% of samples, our framework was able to achieve state-of-the-art performance on a skin-lesion classification task. Our framework performed better than several state-of-the-art AL methods under FL and achieved comparable performance to full-data FL.
comment: 14 pages, 3 figures
☆ Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?
In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs).
comment: 30 pages, 10 figures
☆ Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs
The effective alignment of Large Language Models (LLMs) with precise instructions is essential for their application in diverse real-world scenarios. Current methods focus on enhancing the diversity and complexity of training and evaluation samples, yet they fall short in accurately assessing LLMs' ability to follow similar instruction variants. We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants, thereby preserves the original instruction's context and complexity while introducing variability, which is critical for training and evaluating LLMs' instruction-following precision. We developed the DeMoRecon dataset using this method to both fine-tune and evaluate LLMs. Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.
comment: 22 pages
☆ Statistical Learning of Distributionally Robust Stochastic Control in Continuous State Spaces
We explore the control of stochastic systems with potentially continuous state and action spaces, characterized by the state dynamics $X_{t+1} = f(X_t, A_t, W_t)$. Here, $X$, $A$, and $W$ represent the state, action, and exogenous random noise processes, respectively, with $f$ denoting a known function that describes state transitions. Traditionally, the noise process $\{W_t, t \geq 0\}$ is assumed to be independent and identically distributed, with a distribution that is either fully known or can be consistently estimated. However, the occurrence of distributional shifts, typical in engineering settings, necessitates the consideration of the robustness of the policy. This paper introduces a distributionally robust stochastic control paradigm that accommodates possibly adaptive adversarial perturbation to the noise distribution within a prescribed ambiguity set. We examine two adversary models: current-action-aware and current-action-unaware, leading to different dynamic programming equations. Furthermore, we characterize the optimal finite sample minimax rates for achieving uniform learning of the robust value function across continuum states under both adversary types, considering ambiguity sets defined by $f_k$-divergence and Wasserstein distance. Finally, we demonstrate the applicability of our framework across various real-world settings.
♻ ☆ Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
♻ ☆ Grokking Group Multiplication with Cosets
The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem.
♻ ☆ Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection
Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex real-world tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful.
♻ ☆ Ultrasound Imaging based on the Variance of a Diffusion Restoration Model
Despite today's prevalence of ultrasound imaging in medicine, ultrasound signal-to-noise ratio is still affected by several sources of noise and artefacts. Moreover, enhancing ultrasound image quality involves balancing concurrent factors like contrast, resolution, and speckle preservation. Recently, there has been progress in both model-based and learning-based approaches addressing the problem of ultrasound image reconstruction. Bringing the best from both worlds, we propose a hybrid reconstruction method combining an ultrasound linear direct model with a learning-based prior coming from a generative Denoising Diffusion model. More specifically, we rely on the unsupervised fine-tuning of a pre-trained Denoising Diffusion Restoration Model (DDRM). Given the nature of multiplicative noise inherent to ultrasound, this paper proposes an empirical model to characterize the stochasticity of diffusion reconstruction of ultrasound images, and shows the interest of its variance as an echogenicity map estimator. We conduct experiments on synthetic, in-vitro, and in-vivo data, demonstrating the efficacy of our variance imaging approach in achieving high-quality image reconstructions from single plane-wave acquisitions and in comparison to state-of-the-art methods. The code is available at: https://github.com/Yuxin-Zhang-Jasmine/DRUSvar
comment: 5 pages; accepted by EUSIPCO 2024. arXiv admin note: text overlap with arXiv:2310.20618
♻ ☆ Quantifying Local Model Validity using Active Learning
Real-world applications of machine learning models are often subject to legal or policy-based regulations. Some of these regulations require ensuring the validity of the model, i.e., the approximation error being smaller than a threshold. A global metric is generally too insensitive to determine the validity of a specific prediction, whereas evaluating local validity is costly since it requires gathering additional data.We propose learning the model error to acquire a local validity estimate while reducing the amount of required data through active learning. Using model validation benchmarks, we provide empirical evidence that the proposed method can lead to an error model with sufficient discriminative properties using a relatively small amount of data. Furthermore, an increased sensitivity to local changes of the validity bounds compared to alternative approaches is demonstrated.
comment: 40th Conference on Uncertainty in Artificial Intelligence
♻ ☆ Novel Fundus Image Preprocessing for Retcam Images to Improve Deep Learning Classification of Retinopathy of Prematurity
Retinopathy of Prematurity (ROP) is a potentially blinding eye disorder because of damage to the eye's retina which can affect babies born prematurely. Screening of ROP is essential for early detection and treatment. This is a laborious and manual process which requires trained physician performing dilated ophthalmological examination which can be subjective resulting in lower diagnosis success for clinically significant disease. Automated diagnostic methods can assist ophthalmologists increase diagnosis accuracy using deep learning. Several research groups have highlighted various approaches. Captured ROP Retcam images suffer from poor quality. This paper proposes the use of improved novel fundus preprocessing methods using pretrained transfer learning frameworks to create hybrid models to give higher diagnosis accuracy. Once trained and validated, the evaluations showed that these novel methods in comparison to traditional imaging processing contribute to better and in many aspects higher accuracy in classifying Plus disease, Stages of ROP and Zones in comparison to peer papers.
comment: 10 pages, 4 figures, 7 tables. arXiv admin note: text overlap with arXiv:1904.08796 by other authors
♻ ☆ Reward Machines for Deep RL in Noisy and Uncertain Environments
Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.
♻ ☆ CIMRL: Combining IMitation and Reinforcement Learning for Safe Autonomous Driving
Modern approaches to autonomous driving rely heavily on learned components trained with large amounts of human driving data via imitation learning. However, these methods require large amounts of expensive data collection and even then face challenges with safely handling long-tail scenarios and compounding errors over time. At the same time, pure Reinforcement Learning (RL) methods can fail to learn performant policies in sparse, constrained, and challenging-to-define reward settings like driving. Both of these challenges make deploying purely cloned policies in safety critical applications like autonomous vehicles challenging. In this paper we propose Combining IMitation and Reinforcement Learning (CIMRL) approach - a framework that enables training driving policies in simulation through leveraging imitative motion priors and safety constraints. CIMRL does not require extensive reward specification and improves on the closed loop behavior of pure cloning methods. By combining RL and imitation, we demonstrate that our method achieves state-of-the-art results in closed loop simulation driving benchmarks.
♻ ☆ Rethink Tree Traversal
We will show how to implement binary decision tree traversal in the language of matrix computation. Our main contribution is to propose some equivalent algorithms of binary tree traversal based on a novel matrix representation of the hierarchical structure of the decision tree. Our key idea is to travel the binary decision tree by maximum inner product search. We not only implement decision tree methods without the recursive traverse but also delve into the partitioning nature of tree-based methods.
comment: Fix some typoes
♻ ☆ Topology-aware Federated Learning in Edge Computing: A Comprehensive Survey
The ultra-low latency requirements of 5G/6G applications and privacy constraints call for distributed machine learning systems to be deployed at the edge. With its simple yet effective approach, federated learning (FL) is a natural solution for massive user-owned devices in edge computing with distributed and private training data. FL methods based on FedAvg typically follow a naive star topology, ignoring the heterogeneity and hierarchy of the volatile edge computing architectures and topologies in reality. Several other network topologies exist and can address the limitations and bottlenecks of the star topology. This motivates us to survey network topology-related FL solutions. In this paper, we conduct a comprehensive survey of the existing FL works focusing on network topologies. After a brief overview of FL and edge computing networks, we discuss various edge network topologies as well as their advantages and disadvantages. Lastly, we discuss the remaining challenges and future works for applying FL to topology-specific edge networks.
comment: Accepted by ACM Computing Surveys
♻ ☆ t-DGR: A Trajectory-Based Deep Generative Replay Method for Continual Learning in Decision Making
Deep generative replay has emerged as a promising approach for continual learning in decision-making tasks. This approach addresses the problem of catastrophic forgetting by leveraging the generation of trajectories from previously encountered tasks to augment the current dataset. However, existing deep generative replay methods for continual learning rely on autoregressive models, which suffer from compounding errors in the generated trajectories. In this paper, we propose a simple, scalable, and non-autoregressive method for continual learning in decision-making tasks using a generative model that generates task samples conditioned on the trajectory timestep. We evaluate our method on Continual World benchmarks and find that our approach achieves state-of-the-art performance on the average success rate metric among continual learning methods. Code is available at https://github.com/WilliamYue37/t-DGR.
comment: Published at 3rd Conference on Lifelong Learning Agents (CoLLAs), 2024
♻ ☆ On the Convergence of Zeroth-Order Federated Tuning for Large Language Models KDD'24
The confluence of Federated Learning (FL) and Large Language Models (LLMs) is ushering in a new era in privacy-preserving natural language processing. However, the intensive memory requirements for fine-tuning LLMs pose significant challenges, especially when deploying on clients with limited computational resources. To circumvent this, we explore the novel integration of Memory-efficient Zeroth-Order Optimization within a federated setting, a synergy we term as FedMeZO. Our study is the first to examine the theoretical underpinnings of FedMeZO in the context of LLMs, tackling key questions regarding the influence of large parameter spaces on optimization behavior, the establishment of convergence properties, and the identification of critical parameters for convergence to inform personalized federated strategies. Our extensive empirical evidence supports the theory, showing that FedMeZO not only converges faster than traditional first-order methods such as FedAvg but also significantly reduces GPU memory usage during training to levels comparable to those during inference. Moreover, the proposed personalized FL strategy that is built upon the theoretical insights to customize the client-wise learning rate can effectively accelerate loss reduction. We hope our work can help to bridge theoretical and practical aspects of federated fine-tuning for LLMs, thereby stimulating further advancements and research in this area.
comment: accepted by KDD'24 research track. 21 pages, 10 figures, 8 tables
♻ ☆ An Interactive Agent Foundation Model
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
♻ ☆ FedML-HE: An Efficient Homomorphic-Encryption-Based Privacy-Preserving Federated Learning System
Federated Learning trains machine learning models on distributed devices by aggregating local model updates instead of local data. However, privacy concerns arise as the aggregated local models on the server may reveal sensitive personal information by inversion attacks. Privacy-preserving methods, such as homomorphic encryption (HE), then become necessary for FL training. Despite HE's privacy advantages, its applications suffer from impractical overheads, especially for foundation models. In this paper, we present FedML-HE, the first practical federated learning system with efficient HE-based secure model aggregation. FedML-HE proposes to selectively encrypt sensitive parameters, significantly reducing both computation and communication overheads during training while providing customizable privacy preservation. Our optimized system demonstrates considerable overhead reduction, particularly for large foundation models (e.g., ~10x reduction for ResNet-50, and up to ~40x reduction for BERT), demonstrating the potential for scalable HE-based FL deployment.
♻ ☆ Reference-based Metrics Disprove Themselves in Question Generation
Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
comment: Revised Jun 14 2024; Under Review
♻ ☆ A novel hybrid time-varying graph neural network for traffic flow forecasting
Real-time and precise traffic flow prediction is vital for the efficiency of intelligent transportation systems. Traditional methods often employ graph neural networks (GNNs) with predefined graphs to describe spatial correlations among traffic nodes in urban road networks. However, these pre-defined graphs are limited by existing knowledge and graph generation methodologies, offering an incomplete picture of spatial correlations. While time-varying graphs based on data-driven learning have attempted to address these limitations, they still struggle with adequately capturing the inherent spatial correlations in traffic data. Moreover, most current methods for capturing dynamic temporal correlations rely on a unified calculation scheme using a temporal multi-head self-attention mechanism, which at some level might leads to inaccuracies. In order to overcome these challenges, we have proposed a novel hybrid time-varying graph neural network (HTVGNN) for traffic flow prediction. Firstly, a novel enhanced temporal perception multi-head self-attention mechanism based on time-varying mask enhancement was reported to more accurately model the dynamic temporal dependencies among distinct traffic nodes in the traffic network. Secondly, we have proposed a novel graph learning strategy to concurrently learn both static and dynamic spatial associations between different traffic nodes in road networks. Meanwhile, in order to enhance the learning ability of time-varying graphs, a coupled graph learning mechanism was designed to couple the graphs learned at each time step. Finally, the effectiveness of the proposed method HTVGNN was demonstrated with four real data sets. Simulation results revealed that HTVGNN achieves superior prediction accuracy compared to the state of the art spatio-temporal graph neural network models. Additionally, the ablation experiment verifies that the coupled graph learning mechanism can effectively improve the long-term prediction performance of HTVGNN.
comment: 16 pages,7 figures
♻ ☆ A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME
eXplainable artificial intelligence (XAI) methods have emerged to convert the black box of machine learning (ML) models into a more digestible form. These methods help to communicate how the model works with the aim of making ML models more transparent and increasing the trust of end-users into their output. SHapley Additive exPlanations (SHAP) and Local Interpretable Model Agnostic Explanation (LIME) are two widely used XAI methods, particularly with tabular data. In this perspective piece, we discuss the way the explainability metrics of these two methods are generated and propose a framework for interpretation of their outputs, highlighting their weaknesses and strengths. Specifically, we discuss their outcomes in terms of model-dependency and in the presence of collinearity among the features, relying on a case study from the biomedical domain (classification of individuals with or without myocardial infarction). The results indicate that SHAP and LIME are highly affected by the adopted ML model and feature collinearity, raising a note of caution on their usage and interpretation.
♻ ☆ Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? ICML 2024
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.
comment: Accepted at ICML 2024
♻ ☆ Killer Apps: Low-Speed, Large-Scale AI Weapons
The accelerating advancements in Artificial Intelligence (AI) and Machine Learning (ML), highlighted by the development of cutting-edge Generative Pre-trained Transformer (GPT) models by organizations such as OpenAI, Meta, and Anthropic, present new challenges and opportunities in warfare and security. Much of the current focus is on AI's integration within weapons systems and its role in rapid decision-making in kinetic conflict. However, an equally important but often overlooked aspect is the potential of AI-based psychological manipulation at internet scales within the information domain. These capabilities could pose significant threats to individuals, organizations, and societies globally. This paper explores the concept of AI weapons, their deployment, detection, and potential countermeasures.
comment: 10 pages with 10 pages of appendices. 3 Figures, 2 code listings
♻ ☆ MLXP: A Framework for Conducting Replicable Experiments in Python
Replicability in machine learning (ML) research is increasingly concerning due to the utilization of complex non-deterministic algorithms and the dependence on numerous hyper-parameter choices, such as model architecture and training datasets. Ensuring reproducible and replicable results is crucial for advancing the field, yet often requires significant technical effort to conduct systematic and well-organized experiments that yield robust conclusions. Several tools have been developed to facilitate experiment management and enhance reproducibility; however, they often introduce complexity that hinders adoption within the research community, despite being well-handled in industrial settings. To address the challenge of low adoption, we propose MLXP, an open-source, simple, and lightweight experiment management tool based on Python, available at https://github.com/inria-thoth/mlxp . MLXP streamlines the experimental process with minimal practitioner overhead while ensuring a high level of reproducibility.
♻ ☆ Colored Noise in PPO: Improved Exploration and Performance through Correlated Action Sampling
Proximal Policy Optimization (PPO), a popular on-policy deep reinforcement learning method, employs a stochastic policy for exploration. In this paper, we propose a colored noise-based stochastic policy variant of PPO. Previous research highlighted the importance of temporal correlation in action noise for effective exploration in off-policy reinforcement learning. Building on this, we investigate whether correlated noise can also enhance exploration in on-policy methods like PPO. We discovered that correlated noise for action selection improves learning performance and outperforms the currently popular uncorrelated white noise approach in on-policy methods. Unlike off-policy learning, where pink noise was found to be highly effective, we found that a colored noise, intermediate between white and pink, performed best for on-policy learning in PPO. We examined the impact of varying the amount of data collected for each update by modifying the number of parallel simulation environments for data collection and observed that with a larger number of parallel environments, more strongly correlated noise is beneficial. Due to the significant impact and ease of implementation, we recommend switching to correlated noise as the default noise source in PPO.
♻ ☆ MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis
Recently, numerous embedding models have been made available and widely used for various NLP tasks. The Massive Text Embedding Benchmark (MTEB) has primarily simplified the process of choosing a model that performs well for several tasks in English, but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. We gather 15 existing datasets in an easy-to-use interface and create three new French datasets for a global evaluation of 8 task categories. We compare 51 carefully selected embedding models on a large scale, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well. Our work comes with open-source code, new datasets and a public leaderboard.
♻ ☆ Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture
Neural models produce promising results when solving Vehicle Routing Problems (VRPs), but often fall short in generalization. Recent attempts to enhance model generalization often incur unnecessarily large training cost or cannot be directly applied to other models solving different VRP variants. To address these issues, we take a novel perspective on model architecture in this study. Specifically, we propose a plug-and-play Entropy-based Scaling Factor (ESF) and a Distribution-Specific (DS) decoder to enhance the size and distribution generalization, respectively. ESF adjusts the attention weight pattern of the model towards familiar ones discovered during training when solving VRPs of varying sizes. The DS decoder explicitly models VRPs of multiple training distribution patterns through multiple auxiliary light decoders, expanding the model representation space to encompass a broader range of distributional scenarios. We conduct extensive experiments on both synthetic and widely recognized real-world benchmarking datasets and compare the performance with seven baseline models. The results demonstrate the effectiveness of using ESF and DS decoder to obtain a more generalizable model and showcase their applicability to solve different VRP variants, i.e., travelling salesman problem and capacitated VRP. Notably, our proposed generic components require minimal computational resources, and can be effortlessly integrated into conventional generalization strategies to further elevate model generalization.
comment: 13 pages, 6 figures, and 6 tables
♻ ☆ Deep learning probability flows and entropy production rates in active matter
Active matter systems, from self-propelled colloids to motile bacteria, are characterized by the conversion of free energy into useful work at the microscopic scale. They involve physics beyond the reach of equilibrium statistical mechanics, and a persistent challenge has been to understand the nature of their nonequilibrium states. The entropy production rate and the probability current provide quantitative ways to do so by measuring the breakdown of time-reversal symmetry. Yet, their efficient computation has remained elusive, as they depend on the system's unknown and high-dimensional probability density. Here, building upon recent advances in generative modeling, we develop a deep learning framework to estimate the score of this density. We show that the score, together with the microscopic equations of motion, gives access to the entropy production rate, the probability current, and their decomposition into local contributions from individual particles. To represent the score, we introduce a novel, spatially-local transformer network architecture that learns high-order interactions between particles while respecting their underlying permutation symmetry. We demonstrate the broad utility and scalability of the method by applying it to several high-dimensional systems of active particles undergoing motility-induced phase separation (MIPS). We show that a single network trained on a system of 4096 particles at one packing fraction can generalize to other regions of the phase diagram, including systems with as many as 32768 particles. We use this observation to quantify the spatial structure of the departure from equilibrium in MIPS as a function of the number of particles and the packing fraction.
♻ ☆ Physics-informed Neural Network Estimation of Material Properties in Soft Tissue Nonlinear Biomechanical Models
The development of biophysical models for clinical applications is rapidly advancing in the research community, thanks to their predictive nature and their ability to assist the interpretation of clinical data. However, high-resolution and accurate multi-physics computational models are computationally expensive and their personalisation involves fine calibration of a large number of parameters, which may be space-dependent, challenging their clinical translation. In this work, we propose a new approach which relies on the combination of physics-informed neural networks (PINNs) with three-dimensional soft tissue nonlinear biomechanical models, capable of reconstructing displacement fields and estimating heterogeneous patient-specific biophysical properties. The proposed learning algorithm encodes information from a limited amount of displacement and, in some cases, strain data, that can be routinely acquired in the clinical setting, and combines it with the physics of the problem, represented by a mathematical model based on partial differential equations, to regularise the problem and improve its convergence properties. Several benchmarks are presented to show the accuracy and robustness of the proposed method and its great potential to enable the robust and effective identification of patient-specific, heterogeneous physical properties, s.a. tissue stiffness properties. In particular, we demonstrate the capability of the PINN to detect the presence, location and severity of scar tissue, which is beneficial to develop personalised simulation models for disease diagnosis, especially for cardiac applications.
comment: Accepted for publication in Computational Mechanics
♻ ☆ Backdoor for Debias: Mitigating Model Bias with Backdoor Attack-based Artificial Bias
With the swift advancement of deep learning, state-of-the-art algorithms have been utilized in various social situations. Nonetheless, some algorithms have been discovered to exhibit biases and provide unequal results. The current debiasing methods face challenges such as poor utilization of data or intricate training requirements. In this work, we found that the backdoor attack can construct an artificial bias similar to the model bias derived in standard training. Considering the strong adjustability of backdoor triggers, we are motivated to mitigate the model bias by carefully designing reverse artificial bias created from backdoor attack. Based on this, we propose a backdoor debiasing framework based on knowledge distillation, which effectively reduces the model bias from original data and minimizes security risks from the backdoor attack. The proposed solution is validated on both image and structured datasets, showing promising results. This work advances the understanding of backdoor attacks and highlights its potential for beneficial applications. The code for the study can be found at \url{https://anonymous.4open.science/r/DwB-BC07/}.
♻ ☆ MindStar: Enhancing Math Reasoning in Pre-trained LLMs at Inference Time
Although Large Language Models (LLMs) achieve remarkable performance across various tasks, they often struggle with complex reasoning tasks, such as answering mathematical questions. Recent efforts to address this issue have primarily focused on leveraging mathematical datasets through supervised fine-tuning or self-improvement techniques. However, these methods often depend on high-quality datasets that are difficult to prepare, or they require substantial computational resources for fine-tuning. Inspired by findings that LLMs know how to produce the right answer but struggle to select the correct reasoning path, we propose a purely inference-based searching method -- MindStar (M*). This method formulates reasoning tasks as searching problems and proposes two search ideas to identify the optimal reasoning paths. We evaluate the M* framework on both the GSM8K and MATH datasets, comparing its performance with existing open and closed-source LLMs. Our results demonstrate that M* significantly enhances the reasoning abilities of open-source models, such as Llama-2-13B and Mistral-7B, and achieves comparable performance to GPT-3.5 and Grok-1, but with substantially reduced model size and computational costs.
♻ ☆ CAT: A Causally Graph Attention Network for Trimming Heterophilic Graph
Local Attention-guided Message Passing Mechanism (LAMP) adopted in Graph Attention Networks (GATs) is designed to adaptively learn the importance of neighboring nodes for better local aggregation on the graph, which can bring the representations of similar neighbors closer effectively, thus showing stronger discrimination ability. However, existing GATs suffer from a significant discrimination ability decline in heterophilic graphs because the high proportion of dissimilar neighbors can weaken the self-attention of the central node, jointly resulting in the deviation of the central node from similar nodes in the representation space. This kind of effect generated by neighboring nodes is called the Distraction Effect (DE) in this paper. To estimate and weaken the DE of neighboring nodes, we propose a Causally graph Attention network for Trimming heterophilic graph (CAT). To estimate the DE, since the DE are generated through two paths (grab the attention assigned to neighbors and reduce the self-attention of the central node), we use Total Effect to model DE, which is a kind of causal estimand and can be estimated from intervened data; To weaken the DE, we identify the neighbors with the highest DE (we call them Distraction Neighbors) and remove them. We adopt three representative GATs as the base model within the proposed CAT framework and conduct experiments on seven heterophilic datasets in three different sizes. Comparative experiments show that CAT can improve the node classification accuracy of all base GAT models. Ablation experiments and visualization further validate the enhancement of discrimination ability brought by CAT. The source code is available at https://github.com/GeoX-Lab/CAT.
comment: 25 pages, 18 figures, 5 tables
♻ ☆ Attention Meets Post-hoc Interpretability: A Mathematical Perspective ICML 2024
Attention-based architectures, in particular transformers, are at the heart of a technological revolution. Interestingly, in addition to helping obtain state-of-the-art results on a wide range of applications, the attention mechanism intrinsically provides meaningful insights on the internal behavior of the model. Can these insights be used as explanations? Debate rages on. In this paper, we mathematically study a simple attention-based architecture and pinpoint the differences between post-hoc and attention-based explanations. We show that they provide quite different results, and that, despite their limitations, post-hoc methods are capable of capturing more useful insights than merely examining the attention weights.
comment: Accepted at ICML 2024
♻ ☆ Flexible Parametric Inference for Space-Time Hawkes Processes
Many modern spatio-temporal data sets, in sociology, epidemiology or seismology, for example, exhibit self-exciting characteristics, triggering and clustering behaviors both at the same time, that a suitable Hawkes space-time process can accurately capture. This paper aims to develop a fast and flexible parametric inference technique to recover the parameters of the kernel functions involved in the intensity function of a space-time Hawkes process based on such data. Our statistical approach combines three key ingredients: 1) kernels with finite support are considered, 2) the space-time domain is appropriately discretized, and 3) (approximate) precomputations are used. The inference technique we propose then consists of a $\ell_2$ gradient-based solver that is fast and statistically accurate. In addition to describing the algorithmic aspects, numerical experiments have been carried out on synthetic and real spatio-temporal data, providing solid empirical evidence of the relevance of the proposed methodology.
♻ ☆ Uncovering Challenges of Solving the Continuous Gromov-Wasserstein Problem
Recently, the Gromov-Wasserstein Optimal Transport (GWOT) problem has attracted the special attention of the ML community. In this problem, given two distributions supported on two (possibly different) spaces, one has to find the most isometric map between them. In the discrete variant of GWOT, the task is to learn an assignment between given discrete sets of points. In the more advanced continuous formulation, one aims at recovering a parametric mapping between unknown continuous distributions based on i.i.d. samples derived from them. The clear geometrical intuition behind the GWOT makes it a natural choice for several practical use cases, giving rise to a number of proposed solvers. Some of them claim to solve the continuous version of the problem. At the same time, GWOT is notoriously hard, both theoretically and numerically. Moreover, all existing continuous GWOT solvers still heavily rely on discrete techniques. Natural questions arise: to what extent existing methods unravel GWOT problem, what difficulties they encounter, and under which conditions they are successful. Our benchmark paper is an attempt to answer these questions. We specifically focus on the continuous GWOT as the most interesting and debatable setup. We crash-test existing continuous GWOT approaches on different scenarios, carefully record and analyze the obtained results, and identify issues. Our findings experimentally testify that the scientific community is still missing a reliable continuous GWOT solver, which necessitates further research efforts. As the first step in this direction, we propose a new continuous GWOT method which does not rely on discrete techniques and partially solves some of the problems of the competitors. Our code is available at https://github.com/Ark-130994/GW-Solvers.
♻ ☆ Probabilistic Constrained Reinforcement Learning with Formal Interpretability
Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
comment: 25 pages, 9 figures, containing Appendix
♻ ☆ MADA: Meta-Adaptive Optimizers through hyper-gradient Descent
Following the introduction of Adam, several novel adaptive optimizers for deep learning have been proposed. These optimizers typically excel in some tasks but may not outperform Adam uniformly across all tasks. In this work, we introduce Meta-Adaptive Optimizers (MADA), a unified optimizer framework that can generalize several known optimizers and dynamically learn the most suitable one during training. The key idea in MADA is to parameterize the space of optimizers and dynamically search through it using hyper-gradient descent during training. We empirically compare MADA to other popular optimizers on vision and language tasks, and find that MADA consistently outperforms Adam and other popular optimizers, and is robust against sub-optimally tuned hyper-parameters. MADA achieves a greater validation performance improvement over Adam compared to other popular optimizers during GPT-2 training and fine-tuning. We also propose AVGrad, a modification of AMSGrad that replaces the maximum operator with averaging, which is more suitable for hyper-gradient optimization. Finally, we provide a convergence analysis to show that parameterized interpolations of optimizers can improve their error bounds (up to constants), hinting at an advantage for meta-optimizers.
♻ ☆ Quantitative CLTs in Deep Neural Networks
We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-\gamma}$ for $\gamma>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.
♻ ☆ MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities
In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.
♻ ☆ Position: Understanding LLMs Requires More Than Statistical Generalization ICML2024
The last decade has seen blossoming research in deep learning theory attempting to answer, "Why does deep learning generalize?" A powerful shift in perspective precipitated this progress: the study of overparametrized models in the interpolation regime. In this paper, we argue that another perspective shift is due, since some of the desirable qualities of LLMs are not a consequence of good statistical generalization and require a separate theoretical explanation. Our core argument relies on the observation that AR probabilistic models are inherently non-identifiable: models zero or near-zero KL divergence apart -- thus, equivalent test loss -- can exhibit markedly different behaviors. We support our position with mathematical examples and empirical observations, illustrating why non-identifiability has practical relevance through three case studies: (1) the non-identifiability of zero-shot rule extrapolation; (2) the approximate non-identifiability of in-context learning; and (3) the non-identifiability of fine-tunability. We review promising research directions focusing on LLM-relevant generalization measures, transferability, and inductive biases.
comment: Accepted as a position paper at ICML2024, Code: https://github.com/rpatrik96/llm-non-identifiability
♻ ☆ Towards Global Optimality for Practical Average Reward Reinforcement Learning without Mixing Time Oracles
In the context of average-reward reinforcement learning, the requirement for oracle knowledge of the mixing time, a measure of the duration a Markov chain under a fixed policy needs to achieve its stationary distribution, poses a significant challenge for the global convergence of policy gradient methods. This requirement is particularly problematic due to the difficulty and expense of estimating mixing time in environments with large state spaces, leading to the necessity of impractically long trajectories for effective gradient estimation in practical applications.To address this limitation, we consider the Multi-level Actor-Critic (MAC) framework, which incorporates a Multi-level Monte-Carlo (MLMC) gradient estimator. With our approach, we effectively alleviate the dependency on mixing time knowledge, a first for average-reward MDPs global convergence. Furthermore, our approach exhibits the tightest available dependence of $\mathcal{O}\left( \sqrt{\tau_{mix}} \right)$known from prior work. With a 2D grid world goal-reaching navigation experiment, we demonstrate that MAC outperforms the existing state-of-the-art policy gradient-based method for average reward settings.
comment: 26 Pages, 2 Figures
♻ ☆ The Scaling Law in Stellar Light Curves ICML 2024
Analyzing time series of fluxes from stars, known as stellar light curves, can reveal valuable information about stellar properties. However, most current methods rely on extracting summary statistics, and studies using deep learning have been limited to supervised approaches. In this research, we investigate the scaling law properties that emerge when learning from astronomical time series data using self-supervised techniques. By employing the GPT-2 architecture, we show the learned representation improves as the number of parameters increases from $10^4$ to $10^9$, with no signs of performance plateauing. We demonstrate that a self-supervised Transformer model achieves 3-10 times the sample efficiency compared to the state-of-the-art supervised learning model when inferring the surface gravity of stars as a downstream task. Our research lays the groundwork for analyzing stellar light curves by examining them through large-scale auto-regressive generative models.
comment: 11 pages, 5 figures, ICML 2024 AI4Science workshop
♻ ☆ Fixed points of nonnegative neural networks
We use fixed point theory to analyze nonnegative neural networks, which we define as neural networks that map nonnegative vectors to nonnegative vectors. We first show that nonnegative neural networks with nonnegative weights and biases can be recognized as monotonic and (weakly) scalable mappings within the framework of nonlinear Perron-Frobenius theory. This fact enables us to provide conditions for the existence of fixed points of nonnegative neural networks having inputs and outputs of the same dimension, and these conditions are weaker than those recently obtained using arguments in convex analysis. Furthermore, we prove that the shape of the fixed point set of nonnegative neural networks with nonnegative weights and biases is an interval, which under mild conditions degenerates to a point. These results are then used to obtain the existence of fixed points of more general nonnegative neural networks. From a practical perspective, our results contribute to the understanding of the behavior of autoencoders, and we also offer valuable mathematical machinery for future developments in deep equilibrium models.
comment: License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v25/23-0167.html
♻ ☆ An Online Gradient-Based Caching Policy with Logarithmic Complexity and Regret Guarantees
Commonly used caching policies, such as LRU (Least Recently Used) or LFU (Least Frequently Used), exhibit optimal performance only under specific traffic patterns. Even advanced machine learning-based methods, which detect patterns in historical request data, struggle when future requests deviate from past trends. Recently, a new class of policies has emerged that are robust to varying traffic patterns. These algorithms address an online optimization problem, enabling continuous adaptation to the context. They offer theoretical guarantees on the regret metric, which measures the performance gap between the online policy and the optimal static cache allocation in hindsight. However, the high computational complexity of these solutions hinders their practical adoption. In this study, we introduce a new variant of the gradient-based online caching policy that achieves groundbreaking logarithmic computational complexity relative to catalog size, while also providing regret guarantees. This advancement allows us to test the policy on large-scale, real-world traces featuring millions of requests and items - a significant achievement, as such scales have been beyond the reach of existing policies with regret guarantees. To the best of our knowledge, our experimental results demonstrate for the first time that the regret guarantees of gradient-based caching policies offer substantial benefits in practical scenarios.
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.
comment: 26 pages; 18 visualizations (nine figures, nine tables)
♻ ☆ Large Language Model Can Continue Evolving From Mistakes
As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for keeping Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution.
♻ ☆ H-Fac: Memory-Efficient Optimization with Factorized Hamiltonian Descent
In this study, we introduce a novel adaptive optimizer, H-Fac, which incorporates a factorized approach to momentum and scaling parameters. Our algorithm demonstrates competitive performances on both ResNets and Vision Transformers, while achieving sublinear memory costs through the use of rank-1 parameterizations for moment estimators. We develop our algorithms based on principles derived from Hamiltonian dynamics, providing robust theoretical underpinnings. These optimization algorithms are designed to be both straightforward and adaptable, facilitating easy implementation in diverse settings.
comment: 21 pages, 4 figures
♻ ☆ Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.
♻ ☆ ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations ICML 2024
As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.
comment: 31 pages, 27 figures, ICML 2024
♻ ☆ Distributed Maximum Consensus over Noisy Links
We introduce a distributed algorithm, termed noise-robust distributed maximum consensus (RD-MC), for estimating the maximum value within a multi-agent network in the presence of noisy communication links. Our approach entails redefining the maximum consensus problem as a distributed optimization problem, allowing a solution using the alternating direction method of multipliers. Unlike existing algorithms that rely on multiple sets of noise-corrupted estimates, RD-MC employs a single set, enhancing both robustness and efficiency. To further mitigate the effects of link noise and improve robustness, we apply moving averaging to the local estimates. Through extensive simulations, we demonstrate that RD-MC is significantly more robust to communication link noise compared to existing maximum-consensus algorithms.
comment: 5 pages, 7 figures, Presented to EUSIPCO 2024 conference
♻ ☆ HiFT: A Hierarchical Full Parameter Fine-Tuning Strategy
Full-parameter fine-tuning has become the go-to choice for adapting language models (LMs) to downstream tasks due to its excellent performance. As LMs grow in size, fine-tuning the full parameters of LMs requires a prohibitively large amount of GPU memory. Existing approaches utilize zeroth-order optimizer to conserve GPU memory, which can potentially compromise the performance of LMs as non-zero order optimizers tend to converge more readily on most downstream tasks. In this paper, we propose a novel optimizer-independent end-to-end hierarchical fine-tuning strategy, HiFT, which only updates a subset of parameters at each training step. HiFT can significantly reduce the amount of gradients and optimizer state parameters residing in GPU memory at the same time, thereby reducing GPU memory usage. Our results demonstrate that: (1) HiFT achieves comparable performance to parameter-efficient fine-tuning and standard full parameter fine-tuning. (2) HiFT supports various optimizers including AdamW, AdaGrad, SGD, etc. (3) HiFT can save more than 60\% GPU memory compared with standard full-parameter fine-tuning for 7B model. (4) HiFT enables full-parameter fine-tuning of a 7B model on single 48G A6000 with a precision of 32 using the AdamW optimizer, without using any memory saving techniques.
comment: under progress
♻ ☆ Pointer Networks with Q-Learning for Combinatorial Optimization
We introduce the Pointer Q-Network (PQN), a hybrid neural architecture that integrates model-free Q-value policy approximation with Pointer Networks (Ptr-Nets) to enhance the optimality of attention-based sequence generation, focusing on long-term outcomes. This integration proves particularly effective in solving combinatorial optimization (CO) tasks, especially the Travelling Salesman Problem (TSP), which is the focus of our study. We address this challenge by defining a Markov Decision Process (MDP) compatible with PQN, which involves iterative graph embedding, encoding and decoding by an LSTM-based recurrent neural network. This process generates a context vector and computes raw attention scores, which are dynamically adjusted by Q-values calculated for all available state-action pairs before applying softmax. The resulting attention vector is utilized as an action distribution, with actions selected hinged to exploration-exploitation dynamic adaptibility of PQN. Our empirical results demonstrate the efficacy of this approach, also testing the model in unstable environments.
♻ ☆ AFS-BM: Enhancing Model Performance through Adaptive Feature Selection with Binary Masking
We study the problem of feature selection in general machine learning (ML) context, which is one of the most critical subjects in the field. Although, there exist many feature selection methods, however, these methods face challenges such as scalability, managing high-dimensional data, dealing with correlated features, adapting to variable feature importance, and integrating domain knowledge. To this end, we introduce the "Adaptive Feature Selection with Binary Masking" (AFS-BM) which remedies these problems. AFS-BM achieves this by joint optimization for simultaneous feature selection and model training. In particular, we do the joint optimization and binary masking to continuously adapt the set of features and model parameters during the training process. This approach leads to significant improvements in model accuracy and a reduction in computational requirements. We provide an extensive set of experiments where we compare AFS-BM with the established feature selection methods using well-known datasets from real-life competitions. Our results show that AFS-BM makes significant improvement in terms of accuracy and requires significantly less computational complexity. This is due to AFS-BM's ability to dynamically adjust to the changing importance of features during the training process, which an important contribution to the field. We openly share our code for the replicability of our results and to facilitate further research.
♻ ☆ Benchmarking General Purpose In-Context Learning
In-context learning (ICL) is becoming increasingly appealing to the AI community due to its flexibility, generality, sample efficiency, and exemption from artificial optimization skills. It is desirable to further enhance the generality and capability of ICL, which gives rise to the concept of general-purpose in-context learning (GPICL). We aim to extend ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, albeit with relatively limited zero-shot generalization. To this end, we introduce two lightweight but insightful benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark includes a vast number of tasks characterized by significant task variance, featuring minimal inductive bias. These tasks are also designed to facilitate lifelong in-context learning through continuous generation and interaction. These features pose significant challenges for models that rely on context or interactions to improve their proficiency, including language models, decision models, and world models. Our experiments reveal that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.
♻ ☆ Minusformer: Improving Time Series Forecasting by Progressively Learning Residuals
In this paper, we find that ubiquitous time series (TS) forecasting models are prone to severe overfitting. To cope with this problem, we embrace a de-redundancy approach to progressively reinstate the intrinsic values of TS for future intervals. Specifically, we introduce a dual-stream and subtraction mechanism, which is a deep Boosting ensemble learning method. And the vanilla Transformer is renovated by reorienting the information aggregation mechanism from addition to subtraction. Then, we incorporate an auxiliary output branch into each block of the original model to construct a highway leading to the ultimate prediction. The output of subsequent modules in this branch will subtract the previously learned results, enabling the model to learn the residuals of the supervision signal, layer by layer. This designing facilitates the learning-driven implicit progressive decomposition of the input and output streams, empowering the model with heightened versatility, interpretability, and resilience against overfitting. Since all aggregations in the model are minus signs, which is called Minusformer. Extensive experiments demonstrate the proposed method outperform existing state-of-the-art methods, yielding an average performance improvement of 11.9% across various datasets.The code has been released at https://github.com/Anoise/Minusformer.
♻ ☆ Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
♻ ☆ Stockformer: A Price-Volume Factor Stock Selection Model Based on Wavelet Transform and Multi-Task Self-Attention Networks
As the Chinese stock market continues to evolve and its market structure grows increasingly complex, traditional quantitative trading methods are facing escalating challenges. Particularly, due to policy uncertainty and the frequent market fluctuations triggered by sudden economic events, existing models often struggle to accurately predict market dynamics. To address these challenges, this paper introduces Stockformer, a price-volume factor stock selection model that integrates wavelet transformation and a multitask self-attention network, aimed at enhancing responsiveness and predictive accuracy regarding market instabilities. Through discrete wavelet transform, Stockformer decomposes stock returns into high and low frequencies, meticulously capturing long-term market trends and short-term fluctuations, including abrupt events. Moreover, the model incorporates a Dual-Frequency Spatiotemporal Encoder and graph embedding techniques to effectively capture complex temporal and spatial relationships among stocks. Employing a multitask learning strategy, it simultaneously predicts stock returns and directional trends. Experimental results show that Stockformer outperforms existing advanced methods on multiple real stock market datasets. In strategy backtesting, Stockformer consistently demonstrates exceptional stability and reliability across market conditions-whether rising, falling, or fluctuating-particularly maintaining high performance during downturns or volatile periods, indicating a high adaptability to market fluctuations. To foster innovation and collaboration in the financial analysis sector, the Stockformer model's code has been open-sourced and is available on the GitHub repository: https://github.com/Eric991005/Multitask-Stockformer.
comment: Currently under consideration for publication in the Expert Systems With Applications
♻ ☆ Mitigating Bias for Question Answering Models by Tracking Bias Influence NAACL 2024
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.
comment: To appear at NAACL 2024 main conference
♻ ☆ Quantum Algorithms for the Pathwise Lasso
We present a novel quantum high-dimensional linear regression algorithm with an $\ell_1$-penalty based on the classical LARS (Least Angle Regression) pathwise algorithm. Similarly to available classical algorithms for Lasso, our quantum algorithm provides the full regularisation path as the penalty term varies, but quadratically faster per iteration under specific conditions. A quadratic speedup on the number of features $d$ is possible by using the quantum minimum-finding routine from D\"urr and Hoyer (arXiv'96) in order to obtain the joining time at each iteration. We then improve upon this simple quantum algorithm and obtain a quadratic speedup both in the number of features $d$ and the number of observations $n$ by using the approximate quantum minimum-finding routine from Chen and de Wolf (ICALP'23). As one of our main contributions, we construct a quantum unitary to approximately compute the joining times to be searched over by the approximate quantum minimum finding. Since the joining times are no longer exactly computed, it is no longer clear that the resulting approximate quantum algorithm obtains a good solution. As our second main contribution, we prove, via an approximate version of the KKT conditions and a duality gap, that the LARS algorithm (and thus our quantum algorithm) is robust to errors. This means that it still outputs a path that minimises the Lasso cost function up to a small error if the joining times are approximately computed. Moreover, we show that, when the observations are sampled from a Gaussian distribution, our quantum algorithm's complexity only depends polylogarithmically on $n$, exponentially better than the classical LARS algorithm, while keeping the quadratic improvement on $d$. Finally, we propose a dequantised algorithm that also retains the polylogarithmic dependence on $n$, albeit with the linear scaling on $d$ from the standard LARS algorithm.
comment: 48 pages. v2: several improvements, typos fixed, references added, fixed a bug in Theorem 28, exponentially improved the complexity dependence on the number of observations $n$ for a random Gaussian input matrix
♻ ☆ Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask KDD 2024
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
comment: Published as a full paper by KDD 2024 Research Track (12 pages as main paper and 11 pages as appendix). Source code available at https://github.com/llcresearch/TSDE
♻ ☆ Self-AMPLIFY: Improving Small Language Models with Self Post Hoc Explanations
Incorporating natural language rationales in the prompt and In-Context Learning (ICL) have led to a significant improvement of Large Language Models (LLMs) performance. However, generating high-quality rationales require human-annotation or the use of auxiliary proxy models. In this work, we propose Self-AMPLIFY to automatically generate rationales from post hoc explanation methods applied to Small Language Models (SLMs) to improve their own performance. Self-AMPLIFY is a 3-step method that targets samples, generates rationales and builds a final prompt to leverage ICL. Self-AMPLIFY performance is evaluated on four SLMs and five datasets requiring strong reasoning abilities. Self-AMPLIFY achieves good results against competitors, leading to strong accuracy improvement. Self-AMPLIFY is the first method to apply post hoc explanation methods to autoregressive language models to generate rationales to improve their own performance in a fully automated manner.
♻ ☆ Semantic-Aware Spectrum Sharing in Internet of Vehicles Based on Deep Reinforcement Learning
This work aims to investigate semantic communication in high-speed mobile Internet of vehicles (IoV) environments, with a focus on the spectrum sharing between vehicle-to-vehicle (V2V) and vehicle-to-infrastructure (V2I) communications. We specifically address spectrum scarcity and network traffic and then propose a semantic-aware spectrum sharing algorithm (SSS) based on the deep reinforcement learning (DRL) soft actor-critic (SAC) approach. Firstly, we delve into the extraction of semantic information. Secondly, we redefine metrics for semantic information in V2V and V2I spectrum sharing in IoV environments, introducing high-speed semantic spectrum efficiency (HSSE) and semantic transmission rate (HSR). Finally, we employ the SAC algorithm for decision optimization in V2V and V2I spectrum sharing based on semantic information. This optimization encompasses the optimal link of V2V and V2I sharing strategies, the transmission power for vehicles sending semantic information and the length of transmitted semantic symbols, aiming at maximizing HSSE of V2I and enhancing success rate of effective semantic information transmission (SRS) of V2V. Experimental results demonstrate that the SSS algorithm outperforms other baseline algorithms, including other traditional-communication-based spectrum sharing algorithms and spectrum sharing algorithm using other reinforcement learning approaches. The SSS algorithm exhibits a 15% increase in HSSE and approximately a 7% increase in SRS.
comment: This paper has been submitted to IEEE Journal. The source code has been released at: https://github.com/qiongwu86/Semantic-Aware-Spectrum-Sharing-in-Internet-of-Vehicles-Based-on-Deep-Reinforcement-Learning
♻ ☆ Optimal Attack and Defense for Reinforcement Learning
To ensure the usefulness of Reinforcement Learning (RL) in real systems, it is crucial to ensure they are robust to noise and adversarial attacks. In adversarial RL, an external attacker has the power to manipulate the victim agent's interaction with the environment. We study the full class of online manipulation attacks, which include (i) state attacks, (ii) observation attacks (which are a generalization of perceived-state attacks), (iii) action attacks, and (iv) reward attacks. We show the attacker's problem of designing a stealthy attack that maximizes its own expected reward, which often corresponds to minimizing the victim's value, is captured by a Markov Decision Process (MDP) that we call a meta-MDP since it is not the true environment but a higher level environment induced by the attacked interaction. We show that the attacker can derive optimal attacks by planning in polynomial time or learning with polynomial sample complexity using standard RL techniques. We argue that the optimal defense policy for the victim can be computed as the solution to a stochastic Stackelberg game, which can be further simplified into a partially-observable turn-based stochastic game (POTBSG). Neither the attacker nor the victim would benefit from deviating from their respective optimal policies, thus such solutions are truly robust. Although the defense problem is NP-hard, we show that optimal Markovian defenses can be computed (learned) in polynomial time (sample complexity) in many scenarios.
♻ ☆ An Open and Large-Scale Dataset for Multi-Modal Climate Change-aware Crop Yield Predictions
Precise crop yield predictions are of national importance for ensuring food security and sustainable agricultural practices. While AI-for-science approaches have exhibited promising achievements in solving many scientific problems such as drug discovery, precipitation nowcasting, etc., the development of deep learning models for predicting crop yields is constantly hindered by the lack of an open and large-scale deep learning-ready dataset with multiple modalities to accommodate sufficient information. To remedy this, we introduce the CropNet dataset, the first terabyte-sized, publicly available, and multi-modal dataset specifically targeting climate change-aware crop yield predictions for the contiguous United States (U.S.) continent at the county level. Our CropNet dataset is composed of three modalities of data, i.e., Sentinel-2 Imagery, WRF-HRRR Computed Dataset, and USDA Crop Dataset, for over 2200 U.S. counties spanning 6 years (2017-2022), expected to facilitate researchers in developing versatile deep learning models for timely and precisely predicting crop yields at the county-level, by accounting for the effects of both short-term growing season weather variations and long-term climate change on crop yields. Besides, we develop the CropNet package, offering three types of APIs, for facilitating researchers in downloading the CropNet data on the fly over the time and region of interest, and flexibly building their deep learning models for accurate crop yield predictions. Extensive experiments have been conducted on our CropNet dataset via employing various types of deep learning solutions, with the results validating the general applicability and the efficacy of the CropNet dataset in climate change-aware crop yield predictions.
comment: 13 pages
♻ ☆ Grounding Continuous Representations in Geometry: Equivariant Neural Fields
Recently, Neural Fields have emerged as a powerful modelling paradigm to represent continuous signals. In a conditional neural field, a field is represented by a latent variable that conditions the NeF, whose parametrisation is otherwise shared over an entire dataset. We propose Equivariant Neural Fields based on cross attention transformers, in which NeFs are conditioned on a geometric conditioning variable, a latent point cloud, that enables an equivariant decoding from latent to field. Our equivariant approach induces a steerability property by which both field and latent are grounded in geometry and amenable to transformation laws if the field transforms, the latent represents transforms accordingly and vice versa. Crucially, the equivariance relation ensures that the latent is capable of (1) representing geometric patterns faitfhully, allowing for geometric reasoning in latent space, (2) weightsharing over spatially similar patterns, allowing for efficient learning of datasets of fields. These main properties are validated using classification experiments and a verification of the capability of fitting entire datasets, in comparison to other non-equivariant NeF approaches. We further validate the potential of ENFs by demonstrate unique local field editing properties.
comment: Preprint for Neurips submission
♻ ☆ A New Index for Clustering Evaluation Based on Density Estimation
A new index for internal evaluation of clustering is introduced. The index is defined as a mixture of two sub-indices. The first sub-index $ I_a $ is called the Ambiguous Index; the second sub-index $ I_s $ is called the Similarity Index. Calculation of the two sub-indices is based on density estimation to each cluster of a partition of the data. An experiment is conducted to test the performance of the new index, and compared with six other internal clustering evaluation indices -- Calinski-Harabasz index, Silhouette coefficient, Davies-Bouldin index, CDbw, DBCV, and VIASCKDE, on a set of 145 datasets. The result shows the new index significantly improves other internal clustering evaluation indices.
Systems and Control 29
☆ Computationally Efficient RL under Linear Bellman Completeness for Deterministic Dynamics
We study computationally and statistically efficient Reinforcement Learning algorithms for the linear Bellman Complete setting, a setting that uses linear function approximation to capture value functions and unifies existing models like linear Markov Decision Processes (MDP) and Linear Quadratic Regulators (LQR). While it is known from the prior works that this setting is statistically tractable, it remained open whether a computationally efficient algorithm exists. Our work provides a computationally efficient algorithm for the linear Bellman complete setting that works for MDPs with large action spaces, random initial states, and random rewards but relies on the underlying dynamics to be deterministic. Our approach is based on randomization: we inject random noise into least square regression problems to perform optimistic value iteration. Our key technical contribution is to carefully design the noise to only act in the null space of the training data to ensure optimism while circumventing a subtle error amplification issue.
☆ Physics-Constrained Learning for PDE Systems with Uncertainty Quantified Port-Hamiltonian Models
Modeling the dynamics of flexible objects has become an emerging topic in the community as these objects become more present in many applications, e.g., soft robotics. Due to the properties of flexible materials, the movements of soft objects are often highly nonlinear and, thus, complex to predict. Data-driven approaches seem promising for modeling those complex dynamics but often neglect basic physical principles, which consequently makes them untrustworthy and limits generalization. To address this problem, we propose a physics-constrained learning method that combines powerful learning tools and reliable physical models. Our method leverages the data collected from observations by sending them into a Gaussian process that is physically constrained by a distributed Port-Hamiltonian model. Based on the Bayesian nature of the Gaussian process, we not only learn the dynamics of the system, but also enable uncertainty quantification. Furthermore, the proposed approach preserves the compositional nature of Port-Hamiltonian systems.
☆ Run Time Assured Reinforcement Learning for Six Degree-of-Freedom Spacecraft Inspection
The trial and error approach of reinforcement learning (RL) results in high performance across many complex tasks, but it can also lead to unsafe behavior. Run time assurance (RTA) approaches can be used to assure safety of the agent during training, allowing it to safely explore the environment. This paper investigates the application of RTA during RL training for a 6-Degree-of-Freedom spacecraft inspection task, where the agent must control its translational motion and attitude to inspect a passive chief spacecraft. Several safety constraints are developed based on position, velocity, attitude, temperature, and power of the spacecraft, and are all enforced simultaneously during training through the use of control barrier functions. This paper also explores simulating the RL agent and RTA at different frequencies to best balance training performance and safety assurance. The agent is trained with and without RTA, and the performance is compared across several metrics including inspection percentage and fuel usage.
☆ Optimal Transport-Assisted Risk-Sensitive Q-Learning
The primary goal of reinforcement learning is to develop decision-making policies that prioritize optimal performance without considering risk or safety. In contrast, safe reinforcement learning aims to mitigate or avoid unsafe states. This paper presents a risk-sensitive Q-learning algorithm that leverages optimal transport theory to enhance the agent safety. By integrating optimal transport into the Q-learning framework, our approach seeks to optimize the policy's expected return while minimizing the Wasserstein distance between the policy's stationary distribution and a predefined risk distribution, which encapsulates safety preferences from domain experts. We validate the proposed algorithm in a Gridworld environment. The results indicate that our method significantly reduces the frequency of visits to risky states and achieves faster convergence to a stable policy compared to the traditional Q-learning algorithm.
☆ Control of Unknown Quadrotors from a Single Throw IROS 2024
This paper presents a method to recover quadrotor UAV from a throw, when no control parameters are known before the throw. We leverage the availability of high-frequency rotor speed feedback available in racing drone hardware and software to find control effectiveness values and fit a motor model using recursive least squares (RLS) estimation. Furthermore, we propose an excitation sequence that provides large actuation commands while guaranteeing to stay within gyroscope sensing limits. After 450ms of excitation, an INDI attitude controller uses the 52 fitted parameters to arrest rotational motion and recover an upright attitude. Finally, a NDI position controller drives the craft to a position setpoint. The proposed algorithm runs efficiently on microcontrollers found in common UAV flight controllers, and was shown to recover an agile quadrotor every time in 57 live experiments with as low as 3.5m throw height, demonstrating robustness against initial rotations and noise. We also demonstrate control of randomized quadrotors in simulated throws, where the parameter fitting RMS error is typically within 10% of the true value.
comment: 7 pages, 5 figures, 2 tables. Submitted to the IROS 2024 conference
☆ GridSweep Simulation: Measuring Subsynchronous Impedance Spectra of Distribution Feeder
Peaks and troughs in the subsynchronous impedance spectrum of a distribution feeder may be a useful indication of oscillation risk, or more importantly lack of oscillation risk, if inverter-based resource (IBR) deployments are increased on that feeder. GridSweep is a new instrument for measuring the subsynchronous impedance spectra of distribution feeders. It combines an active probing device that modulates a 120-volt 1-kW load sinusoidally at a user-selected GPS-phase locked frequency from 1.0 to 40.0 Hz, and with a recorder that takes ultra-high-precision continuous point-on-wave (CPOW) 120-volt synchrowaveforms at 4 kHz. This paper presents a computer simulation of GridSweep's probing and measurement capability. We construct an electromagnetic transient (EMT) simulation of a single-phase distribution feeder equipped with multiple inverter-based resources (IBRs). We include a model of the GridSweep probing device, then demonstrate the model's capability to measure the subsynchronous apparent impedance spectrum of the feeder. Peaks in that spectrum align with the system's dominant oscillation modes caused by IBRs.
comment: 10 pages, 18 figures
☆ Robust positive model reduction via monotone matrices
This work focuses on the problem of exact model reduction of positive linear systems, by leveraging minimal realization theory. While even determining the existence of a positive reachable realization remains an open problem, we are able to fully characterize the cases in which the new model is obtained with non-negative reduction matrices, and hence positivity of the reduced model is robust with respect to small perturbations of original system. The characterization is obtained by specializing monotone matrix theory to positive matrices. In addition, we provide positive reduction also when minimal ones are not available, by exploiting algebraic techniques.
☆ Communication-Efficient MARL for Platoon Stability and Energy-efficiency Co-optimization in Cooperative Adaptive Cruise Control of CAVs
Cooperative adaptive cruise control (CACC) has been recognized as a fundamental function of autonomous driving, in which platoon stability and energy efficiency are outstanding challenges that are difficult to accommodate in real-world operations. This paper studied the CACC of connected and autonomous vehicles (CAVs) based on the multi-agent reinforcement learning algorithm (MARL) to optimize platoon stability and energy efficiency simultaneously. The optimal use of communication bandwidth is the key to guaranteeing learning performance in real-world driving, and thus this paper proposes a communication-efficient MARL by incorporating the quantified stochastic gradient descent (QSGD) and a binary differential consensus (BDC) method into a fully-decentralized MARL framework. We benchmarked the performance of our proposed BDC-MARL algorithm against several several non-communicative andcommunicative MARL algorithms, e.g., IA2C, FPrint, and DIAL, through the evaluation of platoon stability, fuel economy, and driving comfort. Our results show that BDC-MARL achieved the highest energy savings, improving by up to 5.8%, with an average velocity of 15.26 m/s and an inter-vehicle spacing of 20.76 m. In addition, we conducted different information-sharing analyses to assess communication efficacy, along with sensitivity analyses and scalability tests with varying platoon sizes. The practical effectiveness of our approach is further demonstrated using real-world scenarios sourced from open-sourced OpenACC.
☆ Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading
The rapid advancement and exploration of open-hardware RISC-V platforms are driving significant changes in sectors like autonomous vehicles, smart-city infrastructure, and medical devices. OpenTitan stands out as a groundbreaking open-source RISC-V design with a comprehensive security toolkit as a standalone system-on-chip (SoC). OpenTitan includes Earl Grey, a fully implemented and silicon-proven SoC, and Darjeeling, announced but not yet fully implemented. Earl Grey targets standalone SoC implementations, while Darjeeling is for integrable implementations. The literature lacks a silicon-ready embedded implementation of an open-source Root of Trust, despite lowRISC's efforts on Darjeeling. We address the limitations of existing implementations by optimizing data transfer latency between memory and cryptographic accelerators to prevent under-utilization and ensure efficient task acceleration. Our contributions include a comprehensive methodology for integrating custom extensions and IPs into the Earl Grey architecture, architectural enhancements for system-level integration, support for varied boot modes, and improved data movement across the platform. These advancements facilitate deploying OpenTitan in broader SoCs, even without specific technology-dependent IPs, providing a deployment-ready research vehicle for the community. We integrated the extended Earl Grey architecture into a reference architecture in a 22nm FDX technology node, benchmarking the enhanced architecture's performance. The results show significant improvements in cryptographic processing speed, achieving up to 2.7x speedup for SHA-256/HMAC and 1.6x for AES accelerators compared to the baseline Earl Grey architecture.
☆ Leveraging Cooperative Connected Automated Vehicles for Mixed Traffic Safety
The introduction of connected and automated vehicles (CAV) is believed to reduce congestion, enhance safety, and improve traffic efficiency. Numerous research studies have focused on controlling pure CAV platoons in fully connected automated traffic, as well as single or multiple CAVs in mixed traffic with human-driven vehicles (HVs). CAV cruising control designs have been proposed to stabilize the car-following traffic dynamics, but few studies has considered their safety impact, particularly the trade-offs between stability and safety. In this paper, we study how cooperative control strategies for CAVs can be designed to enhance the safety and smoothness of mixed traffic under varying penetrations of connectivity and automation. Considering mixed traffic where a pair of CAVs travels amongst HVs, we design cooperative feedback controllers for the pair CAVs to stabilize traffic via cooperation and, possibly, by also leveraging connectivity with HVs. The real-time safety impact of the CAV controllers is investigated using control barrier functions (CBF). We construct CBF safety constraints, based on which we propose safety-critical control designs to guarantee CAV safety, HV safety and platoon safety. Both theoretical and numerical analyses have been conducted to explore the effect of CAV cooperation and HV connectivity on stability and safety. Our results show that the cooperation of CAVs helps to stabilize the mixed traffic while safety can be guaranteed with the safety filters. Moreover, connectivity between CAVs and HVs offers additional benefits: if an HV connects to an upstream CAV (i.e., the CAV looks ahead), it helps the CAV to stabilize the upstream traffic, while if an HV connects to a downstream CAV (i.e., the CAV looks behind), the safety of this connected HV can be enhanced.
☆ LFPLM: A General and Flexible Load Forecasting Framework based on Pre-trained Language Model
Accurate load forecasting is essential for maintaining the power balance between generators and consumers, especially with the increasing integration of renewable energy sources, which introduce significant intermittent volatility. With the development of data-driven methods, machine learning and deep learning-based models have become the predominant approach for load forecasting tasks. In recent years, pre-trained language models (PLMs) have made significant advancements, demonstrating superior performance in various fields. This paper proposes a load forecasting method based on PLMs, which offers not only accurate predictive ability but also general and flexible applicability. Additionally, a data modeling method is proposed to effectively transform load sequence data into natural language for PLM training. Furthermore, we introduce a data enhancement strategy that eliminate the impact of PLM hallucinations on forecasting results. The effectiveness of the proposed method has been validated on two real-world datasets. Compared with existing methods, our approach shows state-of-the-art performance across all validation metrics.
comment: 7 pages, 5 figures and 5 tables
☆ Joint Distributed Generation Maximization and Radial Distribution Network Reconfiguration
This paper studies an optimization problem for joint radial distribution system network reconfiguration and power dispatch for distributed generation (DG) maximization. We provide counterexamples to show that for DG maximization, standard techniques such as interior point method (as in Matpower), linear approximation and second order cone relaxation (e.g., by Jabr et.~al.~and by Farivar and Low) do not deliver the desired control. Instead, we propose a control decision model based on exact DistFlow equations bypassing relaxation and a solution approach based on spatial branch-and-bound algorithm. We justify our work with comparative studies and numerical demonstrations with benchmarks and a 533-bus real example, performing reconfiguration and power dispatch in a time scale relevant for control center applications.
☆ Energy efficiency analysis of ammonia-fueled power systems for vehicles considering residual heat recovery
Ammonia, known as a good hydrogen carrier, shows great potential for use as a zero-carbon fuel for vehicles. However, both the internal combustion engine (ICE) and the proton exchange membrane fuel cell (PEMFC), the currently available engines used by the vehicle, require hydrogen decomposed from ammonia. On-board hydrogen production is an energy-intensive process that significantly reduces system efficiency. Therefore, energy recovery from the system's residual heat is essential to promote system efficiency. ICEs and FCs require different amounts of hydrogen, and they produce residual heat of different quality and quantity, so the system efficiency is not only determined by the engine operating point, but also by the measures and ratios of residual heat recovery. To thoroughly understand the relationships between system energy efficiency and system configuration as well as system parameters, this paper takes three typical power systems with different configurations as our objects. Models of three systems are set up for system energy efficiency analysis, and carry out simulations under different conditions to conduct system output power and energy efficiency. By analyzing the simulation results, the factors that most significantly impact the system efficiency are identified, the guidelines for system design and parameter optimization are proposed.
☆ Newton and Secant Methods for Iterative Remnant Control of Preisach Hysteresis Operators
We study the properties of remnant function, which is a function of output remnant versus amplitude of the input signal, of Preisach hysteresis operators. The remnant behavior (or the leftover memory when the input reaches zero) enables an energy-optimal application of piezoactuator systems where the applied electrical field can be removed when the desired strain/displacement has been attained. We show that when the underlying weight of Preisach operators is positive, the resulting remnant curve is monotonically increasing and accordingly a Newton and secant update laws for the iterative remnant control are proposed that allows faster convergence to the desired remnant value than the existing iterative remnant control algorithm in literature as validated by numerical simulation.
comment: Accepted/inpress for LCSS
☆ Balancing Performance and Cost for Two-Hop Cooperative Communications: Stackelberg Game and Distributed Multi-Agent Reinforcement Learning
This paper aims to balance performance and cost in a two-hop wireless cooperative communication network where the source and relays have contradictory optimization goals and make decisions in a distributed manner. This differs from most existing works that have typically assumed that source and relay nodes follow a schedule created implicitly by a central controller. We propose that the relays form an alliance in an attempt to maximize the benefit of relaying while the source aims to increase the channel capacity cost-effectively. To this end, we establish the trade problem as a Stackelberg game, and prove the existence of its equilibrium. Another important aspect is that we use multi-agent reinforcement learning (MARL) to approach the equilibrium in a situation where the instantaneous channel state information (CSI) is unavailable, and the source and relays do not have knowledge of each other's goal. A multi-agent deep deterministic policy gradient-based framework is designed, where the relay alliance and the source act as agents. Experiments demonstrate that the proposed method can obtain an acceptable performance that is close to the game-theoretic equilibrium for all players under time-invariant environments, which considerably outperforms its potential alternatives and is only about 2.9% away from the optimal solution.
☆ An Internal Model Principle For Robots
When designing a robot's internal system, one often makes assumptions about the structure of the intended environment of the robot. One may even assign meaning to various internal components of the robot in terms of expected environmental correlates. In this paper we want to make the distinction between robot's internal and external worlds clear-cut. Can the robot learn about its environment, relying only on internally available information, including the sensor data? Are there mathematical conditions on the internal robot system which can be internally verified and make the robot's internal system mirror the structure of the environment? We prove that sufficiency is such a mathematical principle, and mathematically describe the emergence of the robot's internal structure isomorphic or bisimulation equivalent to that of the environment. A connection to the free-energy principle is established, when sufficiency is interpreted as a limit case of surprise minimization. As such, we show that surprise minimization leads to having an internal model isomorphic to the environment. This also parallels the Good Regulator Principle which states that controlling a system sufficiently well means having a model of it. Unlike the mentioned theories, ours is discrete, and non-probabilistic.
comment: 19 pages, 2 figures
☆ Low-probability of Intercept/Detect (LPI/LPD) Secure Communications Using Antenna Arrays Employing Rapid Sidelobe Time Modulation
We present an electronically-reconfigurable antenna array offering low probability of intercept/detect (LPI/LPD) and secure communications capabilities simultaneously at the physical layer. This antenna array is designed to provide rapidly time-varying sidelobes and a stationary main lobe. By performing rapid sidelobe time modulation (SLTM), the signal transmitted in the undesired directions (i.e., through sidelobes) undergoes spread-spectrum distortion making it more difficult to be detected, intercepted, and deciphered while the signal transmitted in the desired direction (i.e., through the main lobe) is unaffected. Therefore, the intended receiver would not need additional modifications (i.e. encryption keys) to detect and recover the signal. We describe the operating principles of this SLTM array and validate its spread-spectrum SLTM sequence generation in undesired directions through theory, simulations, and experiments. Using a fabricated SLTM prototype operating at X band, we conducted system-level measurements to demonstrate its LPI/LPD, secure communications, and jamming resilience capabilities. The presented method is a physical layer technique, which can bring LPI/LPD capabilities to existing communications systems by simply replacing their antennas with SLTM arrays. This technique can be used independently or in combination with additional coding and signal-processing techniques to achieve further enhancements in LPI/LPD and secure communications.
comment: This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
☆ A Swift and Omnidirectional Formation Approach based on Hierarchical Reorganization
Current formations commonly rely on invariant hierarchical structures, such as predetermined leaders or enumerated formation shapes. These structures could be unidirectional and sluggish, constraining their adaptability and agility when encountering cluttered environments. To surmount these constraints, this work proposes an omnidirectional affine formation approach with hierarchical reorganizations. We first delineate the critical conditions requisite for facilitating hierarchical reorganizations within formations, which informs the development of the omnidirectional affine criterion. Central to our approach is the fluid leadership and authority redistribution, for which we develop a minimum time-driven leadership evaluation algorithm and a power transition control algorithm. These algorithms facilitate autonomous leader selection and ensure smooth power transitions, enabling the swarm to adapt hierarchically in alignment with the external environment. Furthermore, we deploy a power-centric topology switching mechanism tailored for the dynamic reorganization of in-team connections. Finally, simulations and experiments demonstrate the performance of the proposed method. The formation successfully performs several hierarchical reorganizations, with the longest reorganization taking only 0.047s. This swift adaptability allows five aerial robots to carry out complex tasks, including executing swerving movements and navigating through hoops at velocities up to 1.9m/s.
☆ Dynamic Modeling and Control for an Offshore Semisubmersible Floating Wind Turbine
Floating wind turbines (FWTs) hold significant potential for the exploitation of offshore renewable energy resources. Nevertheless, prior to the construction of FWTs, it is imperative to tackle several critical challenges, especially the issue of performance degradation under combined wind and wave loads. This study initiates with the development of a simplified nonlinear dynamical model for a semi-submersible FWT. In particular, both the rotor dynamics and the finite rotations of the platform are considered in presented modeling approach, thereby effectively capturing the complex interplay between the platform, tower, nacelle, and rotor under combined wind and wave loads. Subsequently, based on the developed FWT model, a novel adaptive nonlinear pitch controller is formulated with the goal of striking a trade-off between regulating power generation and reducing platform motion. Notably, the proposed control strategy adopts a continuous control approach, strategically beneficial in circumventing the chattering phenomenon commonly associated with sliding mode control. Furthermore, the controller integrates an online approximator and a robust integral of the sign of the tracking error, facilitating real-time learning of system unknown dynamics while compensating for bounded disturbances. Finally, both the accuracy of the established nonlinear FWT model in predicting key dynamics and the superiority of the presented pitch controller are validated through comprehensive comparative studies.
☆ Parametric Resonance in Networked Oscillators
We investigate parametric resonance in oscillator networks subjected to periodically time-varying oscillations in the edge strengths. Such models are inspired by the well-known parametric resonance phenomena for single oscillators, as well as the potential rich phenomenology when such parametric excitations are present in a variety of applications like deep brain stimulation, AC power transmission networks, as well as vehicular flocking formations. We consider cases where a single edge, a subgraph, or the entire network is subjected to forcing, and in each case, we characterize an interesting interplay between the parametric resonance modes and the eigenvalues/vectors of the graph Laplacian. Our analysis is based on a novel treatment of multiple-scale perturbation analysis that we develop for the underlying high-dimensional dynamic equations.
comment: 16 pages, 11 figures
☆ Hacking Encrypted Wireless Power: Cyber-Security of Dynamic Charging
Recently, energy encryption for wireless power transfer has been developed for energy safety, which is important in public places to suppress unauthorized energy extraction. Most techniques vary the frequency so that unauthorized receivers cannot extract energy because of non-resonance. However, this strategy is unreliable. To stimulate the progress of energy encryption technology and point out security holes, this paper proposes a decryption method for the fundamental principle of encrypted frequency-varying wireless power transfer. The paper uses an auxiliary coil to detect the frequency and a switched-capacitor array to adaptively compensate the receiver for a wide frequency range. The switched-capacitor array contains two capacitors and one semi-conductor switch. One capacitor compensates the receiver all the time while the other's active time during one wireless power transfer cycle is regulated by the switch. Thus, the proposed hacking receiver controls the equivalent capacitance of the compensation and steals energy. Finally, a detailed simulation model and experimental results prove the effectiveness of the attack on frequency-hopping energy encryption. Although any nonnegligible energy extracted would be problematic, we achieved to steal 78% to 84% of the energy an authorized receiver could get. When the frequency changes, the interceptor is coarsely tuned very quickly, which can hack fast frequency-varying encrypted system.
comment: 10 pages, 17 figures
☆ QoE Maximization for Multiple-UAV-Assisted Multi-Access Edge Computing: An Online Joint Optimization Approach
In disaster scenarios, conventional terrestrial multi-access edge computing (MEC) paradigms, which rely on fixed infrastructure, may become unavailable due to infrastructure damage. With high-probability line-of-sight (LoS) communication, flexible mobility, and low cost, unmanned aerial vehicle (UAV)-assisted MEC is emerging as a new promising paradigm to provide edge computing services for ground user devices (UDs) in disaster-stricken areas. However, the limited battery capacity, computing resources, and spectrum resources also pose serious challenges for UAV-assisted MEC, which can potentially shorten the service time of UAVs and degrade the quality of experience (QoE) of UDs without an effective control approach. To this end, in this work, we first present a hierarchical architecture of multiple-UAV-assisted MEC networks that enables the coordinated provision of edge computing services by multiple UAVs. Then, we formulate a joint task offloading, resource allocation, and UAV trajectory planning optimization problem (JTRTOP) to maximize the QoE of UDs while considering the energy consumption constraints of UAVs. Since the problem is proven to be a future-dependent and NP-hard problem, we propose a novel online joint task offloading, resource allocation, and UAV trajectory planning approach (OJTRTA) to solve the problem. Specifically, the JTRTOP is first transformed into a per-slot real-time optimization problem (PROP) using the Lyapunov optimization framework. Then, a two-stage optimization method based on game theory and convex optimization is proposed to solve the PROP. Simulation results provide empirical evidence supporting the superior system performance of the proposed OJTRTA in comparison to alternative approaches.
♻ ☆ Structural Controllability of Bilinear Systems on $\mathbb{SE(n)}$
Structural controllability challenges arise from imprecise system modeling and system interconnections in large scale systems. In this paper, we study structural control of bilinear systems on the special Euclidean group. We employ graph theoretic methods to analyze the structural controllability problem for driftless bilinear systems and structural accessibility for bilinear systems with drift. This facilitates the identification of a sparsest pattern necessary for achieving structural controllability and discerning redundant connections. To obtain a graph theoretic characterization of structural controllability and accessibility on the special Euclidean group, we introduce a novel idea of solid and broken edges on graphs; subsequently, we use the notion of transitive closure of graphs.
♻ ☆ Enhancing Mobile Robot Navigation Safety and Efficiency through NMPC with Relaxed CBF in Dynamic Environments
In this paper, a safety-critical control strategy for a nonholonomic robot is developed to generate control signals that result in optimal, obstacle-free paths through dynamic environments. We formulate the control synthesis problem as an Optimal Control Problem (OCP) that enforces Control Lyapunov Function (CLF) constraints for system stability as well as safety-critical constraints using Control Barrier Function (CBF) with a relaxing decay rate of the barrier function. A Nonlinear Model Predictive Control (NMPC) integrates with CLF and CBF to ensure system safety and facilitate optimal performance within a short prediction horizon, reducing the computational burden in real-time implementation. Additionally, we incorporate an obstacle avoidance constraint based on the Euclidean norm into the NMPC framework, showcasing the CBF approach's superiority in addressing mobile robotic systems' point stabilisation and trajectory tracking challenges. Through extensive simulations, the proposed controller demonstrates proficiency in static and dynamic obstacle avoidance under various scenarios. Experimental validations conducted using the Husky A200 robot align with simulation results, reinforcing the applicability of our proposed approach in real-world scenarios, notably improving the computational efficiency and safety in practical mobile robot applications.
comment: 6 pages, 6 figures, 2024 IEEE 20th International Conference on Automation Science and Engineering
♻ ☆ Probabilistic Constrained Reinforcement Learning with Formal Interpretability
Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
comment: 25 pages, 9 figures, containing Appendix
♻ ☆ Adaptive Smooth Control via Nonsingular Fast Terminal Sliding Mode for Distributed Space Telescope Demonstration Mission by CubeSat Formation Flying
This paper investigates the efficiency of nonsingular fast terminal sliding mode and adaptive smooth control method for the distributed space telescope demonstration mission. The distributed space telescope has a flexible focal length that corresponds to the relative position of the formation flying concept. The precise formation flying technology by CubeSats enhances the utility of distributed space systems with low costs. The propulsion systems for CubeSats usually have restricted degrees of freedom. Since the scientific mission requires continuous orbit control, the attitude and orbit control system mutually affect the control performance. The nonsingular fast terminal sliding mode has the advantage of a fast convergence rate and is able to improve the control performance. The adaptive smooth controller designed for the SISO system is expanded and applied to the attitude and orbit control system. The simulation results verify the efficiency of the adaptive smooth controller based on the nonsingular fast terminal sliding mode.
comment: This manuscript is submitted to IEEE Transactions on Aerospace and Electronic Systems
♻ ☆ Towards Safe Multi-Task Bayesian Optimization
Bayesian optimization has emerged as a highly effective tool for the safe online optimization of systems, due to its high sample efficiency and noise robustness. To further enhance its efficiency, reduced physical models of the system can be incorporated into the optimization process, accelerating it. These models are able to offer an approximation of the actual system, and evaluating them is significantly cheaper. The similarity between the model and reality is represented by additional hyperparameters, which are learned within the optimization process. Safety is a crucial criterion for online optimization methods such as Bayesian optimization, which has been addressed by recent works that provide safety guarantees under the assumption of known hyperparameters. In practice, however, this does not apply. Therefore, we extend the robust Gaussian process uniform error bounds to meet the multi-task setting, which involves the calculation of a confidence region from the hyperparameter posterior distribution utilizing Markov chain Monte Carlo methods. Subsequently, the robust safety bounds are employed to facilitate the safe optimization of the system, while incorporating measurements of the models. Simulation results indicate that the optimization can be significantly accelerated for expensive to evaluate functions in comparison to other state-of-the-art safe Bayesian optimization methods, contingent on the fidelity of the models.
comment: Submitted to L4DC 2024
♻ ☆ Generalized pair-wise logit dynamic and its connection to a mean field game: theoretical and computational investigations focusing on resource management
Logit dynamics are evolution equations that describe transitions to equilibria of actions among many players. We formulate a pair-wise logit dynamic in a continuous action space with a generalized exponential function, which we call a generalized pair-wise logit dynamic, depicted by a new evolution equation nonlocal in space. We prove the well-posedness and approximability of the generalized pair-wise logit dynamic to show that it is computationally implementable. We also show that this dynamic has an explicit connection to a mean field game of a controlled pure-jump process, with which the two different mathematical models can be understood in a unified way. Particularly, we show that the generalized pair-wise logit dynamic is derived as a myopic version of the corresponding mean field game, and that the conditions to guarantee the existence of unique solutions are different from each other. The key in this procedure is to find the objective function to be optimized in the mean field game based on the logit function. The monotonicity of the utility is unnecessary for the generalized pair-wise logit dynamic but crucial for the mean field game. Finally, we present applications of the two approaches to fisheries management problems with collected data.
♻ ☆ Tackling the Crowdsourced Shared-Trip Delivery Problem at Scale with a Novel Decomposition Heuristic
This paper presents a set-partitioning formulation and a novel decomposition heuristic (D-H) solution algorithm to solve large-scale instances of the urban crowdsourced shared-trip delivery (CSD) problem. The CSD problem involves dedicated vehicles (DVs) and shared personal vehicles (SPVs) fulfilling delivery orders, wherein the SPVs have their own trip origins and destinations. The D-H begins by assigning as many package delivery orders (PDOs) to SPVs as possible, where the D-H enumerates the set of routes each SPV can feasibly traverse and then solves a PDO-SPV-route assignment problem. For PDO-DV assignment and DV routing, the D-H solves a multi-vehicle routing problem with time-window, tour duration, and capacity constraints using an insertion heuristic. Finally, the D-H seeks potential solution improvements by switching PDOs between SPV and DV routes through a simulated annealing (SA)-inspired procedure. The D-H outperforms a commercial solver in terms of computational efficiency while obtaining near-optimal solutions for small problem instances. The SA-inspired switching procedure outperforms a large neighborhood search algorithm regarding run time, and the two are comparable regarding solution quality. Finally, the paper uses the D-H to analyze the impact of several relevant factors on city-scale CSD system performance, namely the number of participating SPVs and the maximum willingness to detour of SPVs. Consistent with the existing literature, we find that CSD can substantially reduce delivery costs. However, we find that CSD can increase vehicle miles traveled. Our findings provide meaningful insights for logistics practitioners, while the algorithms illustrate promise for large real-world systems.
Multiagent Systems 9
☆ Instruct, Not Assist: LLM-based Multi-Turn Planning and Hierarchical Questioning for Socratic Code Debugging
Socratic questioning is an effective teaching strategy, encouraging critical thinking and problem-solving. The conversational capabilities of large language models (LLMs) show great potential for providing scalable, real-time student guidance. However, current LLMs often give away solutions directly, making them ineffective instructors. We tackle this issue in the code debugging domain with TreeInstruct, an Instructor agent guided by a novel state space-based planning algorithm. TreeInstruct asks probing questions to help students independently identify and resolve errors. It estimates a student's conceptual and syntactical knowledge to dynamically construct a question tree based on their responses and current knowledge state, effectively addressing both independent and dependent mistakes concurrently in a multi-turn interaction setting. In addition to using an existing single-bug debugging benchmark, we construct a more challenging multi-bug dataset of 150 coding problems, incorrect solutions, and bug fixes -- all carefully constructed and annotated by experts. Extensive evaluation shows TreeInstruct's state-of-the-art performance on both datasets, proving it to be a more effective instructor than baselines. Furthermore, a real-world case study with five students of varying skill levels further demonstrates TreeInstruct's ability to guide students to debug their code efficiently with minimal turns and highly Socratic questioning.
☆ Decentralized Collaborative Pricing and Shunting for Multiple EV Charging Stations Based on Multi-Agent Reinforcement Learning
The extraordinary electric vehicle (EV) popularization in the recent years has facilitated research studies in alleviating EV energy charging demand. Previous studies primarily focused on the optimizations over charging stations (CS) profit and EV users cost savings through charge/discharge scheduling events. In this work, the random behaviors of EVs are considered, with EV users preferences over multi-CS characteristics modelled to imitate the potential CS selection disequilibrium. A price scheduling strategy under decentralized collaborative framework is proposed to achieve EV shunting in a multi-CS environment, while minimizing the charging cost through multi agent reinforcement learning. The proposed problem is formulated as a Markov Decision Process (MDP) with uncertain transition probability.
comment: Accepted at PSGEC: PSGEC2024
☆ KAOS: Large Model Multi-Agent Operating System
The intelligent interaction model based on large models reduces the differences in user experience across various system platforms but faces challenges in multi-agent collaboration and resource sharing. To demonstrate a uniform user experience across different foundational software platforms and address resource coordination management challenges, this paper proposes a multi-agent operating system based on the open-source Kylin. The research method involves empowering agents with large models to serve applications. First, by introducing management role agents and vertical multi-agent collaboration to construct or replace typical application software. Second, by studying system-level shared resource scheduling strategies to enhance user experience and optimize resource utilization. And finally, by validating the efficiency and superiority of the large model multi-agent operating system through real applications and scoring intelligence. The feasibility of this system is demonstrated, providing a new perspective for the development of multi-agent operating systems. Experimental results show significant advantages of multi-agent collaboration in various application scenarios.
☆ Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning
Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.
comment: This paper has been submitted to IEEE Journal. The source code has been released at: https://github.com/qiongwu86/RIS-VEC-MARL.git
☆ The Benefits of Power Regularization in Cooperative Reinforcement Learning
Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.
☆ Tracking the perspectives of interacting language models
Large language models (LLMs) are capable of producing high quality information at unprecedented rates. As these models continue to entrench themselves in society, the content they produce will become increasingly pervasive in databases that are, in turn, incorporated into the pre-training data, fine-tuning data, retrieval data, etc. of other language models. In this paper we formalize the idea of a communication network of LLMs and introduce a method for representing the perspective of individual models within a collection of LLMs. Given these tools we systematically study information diffusion in the communication network of LLMs in various simulated settings.
♻ ☆ Finite-time convergence to an $ε$-efficient Nash equilibrium in potential games
This paper investigates the convergence time of log-linear learning to an $\epsilon$-efficient Nash equilibrium (NE) in potential games. In such games, an efficient NE is defined as the maximizer of the potential function. Existing results are limited to potential games with stringent structural assumptions and entail exponential convergence times in $1/\epsilon$. Unaddressed so far, we tackle general potential games and prove the first finite-time convergence to an $\epsilon$-efficient NE. In particular, by using a problem-dependent analysis, our bound depends polynomially on $1/\epsilon$. Furthermore, we provide two extensions of our convergence result: first, we show that a variant of log-linear learning that requires a factor $A$ less feedback on the utility per round enjoys a similar convergence time; second, we demonstrate the robustness of our convergence guarantee if log-linear learning is subject to small perturbations such as alterations in the learning rule or noise-corrupted utilities.
comment: 9 main pages, 25 pages, 1 Table
♻ ☆ Game-theoretic Occlusion-Aware Motion Planning: an Efficient Hybrid-Information Approach
We present a novel algorithm for game-theoretic trajectory planning, tailored for settings in which agents can only observe one another in specific regions of the state space. Such problems arise naturally in the context of multi-robot navigation, where occlusions due to environment geometry naturally mask agents' view of one another. In this paper, we formalize these settings as dynamic games with a hybrid information structure, which interleaves so-called "open-loop" periods (in which agents cannot observe one another) with "feedback" periods (with full state observability). We present two main contributions. First, we study a canonical variant of these hybrid information games in which agents' dynamics are linear, and objectives are convex and quadratic. Here, we build upon classical solution methods for the open-loop and feedback variants of these games to derive an algorithm for the hybrid information case that matches the cubic runtime of the classical settings. Second, we consider a far broader class of problems in which agents' dynamics are nonlinear, and objectives are nonquadratic; we reduce these problems to sequences of hybrid information linear-quadratic games and empirically demonstrate that iteratively solving these simpler problems with the proposed algorithm yields reliable convergence to approximate Nash equilibria through simulation studies of overtaking and intersection traffic scenarios.
comment: Key Words: Dynamic Game Theory, Multi-Agent Motion Planning
♻ ☆ Investigating the Impact of Direct Punishment on the Emergence of Cooperation in Multi-Agent Reinforcement Learning Systems
Solving the problem of cooperation is fundamentally important for the creation and maintenance of functional societies. Problems of cooperation are omnipresent within human society, with examples ranging from navigating busy road junctions to negotiating treaties. As the use of AI becomes more pervasive throughout society, the need for socially intelligent agents capable of navigating these complex cooperative dilemmas is becoming increasingly evident. Direct punishment is a ubiquitous social mechanism that has been shown to foster the emergence of cooperation in both humans and non-humans. In the natural world, direct punishment is often strongly coupled with partner selection and reputation and used in conjunction with third-party punishment. The interactions between these mechanisms could potentially enhance the emergence of cooperation within populations. However, no previous work has evaluated the learning dynamics and outcomes emerging from Multi-Agent Reinforcement Learning (MARL) populations that combine these mechanisms. This paper addresses this gap. It presents a comprehensive analysis and evaluation of the behaviors and learning dynamics associated with direct punishment, third-party punishment, partner selection, and reputation. Finally, we discuss the implications of using these mechanisms on the design of cooperative AI systems.
comment: 50 pages, 19 figures
Artificial Intelligence 150
☆ mDPO: Conditional Preference Optimization for Multimodal Large Language Models
Direct preference optimization (DPO) has shown to be an effective method for large language model (LLM) alignment. Recent works have attempted to apply DPO to multimodal scenarios but have found it challenging to achieve consistent improvement. Through a comparative experiment, we identify the unconditional preference problem in multimodal preference optimization, where the model overlooks the image condition. To address this problem, we propose mDPO, a multimodal DPO objective that prevents the over-prioritization of language-only preferences by also optimizing image preference. Moreover, we introduce a reward anchor that forces the reward to be positive for chosen responses, thereby avoiding the decrease in their likelihood -- an intrinsic problem of relative preference optimization. Experiments on two multimodal LLMs of different sizes and three widely used benchmarks demonstrate that mDPO effectively addresses the unconditional preference problem in multimodal preference optimization and significantly improves model performance, particularly in reducing hallucination.
☆ MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs
Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.
comment: This project is available at https://github.com/Liuziyu77/MMDU
☆ Language Modeling with Editable External Knowledge
When the world changes, so does the text that humans write about it. How do we build language models that can be easily updated to reflect these changes? One popular approach is retrieval-augmented generation, in which new documents are inserted into a knowledge base and retrieved during prediction for downstream tasks. Most prior work on these systems have focused on improving behavior during prediction through better retrieval or reasoning. This paper introduces ERASE, which instead improves model behavior when new documents are acquired, by incrementally deleting or rewriting other entries in the knowledge base each time a document is added. In two new benchmark datasets evaluating models' ability to answer questions about a stream of news articles or conversations, ERASE improves accuracy relative to conventional retrieval-augmented generation by 7-13% (Mixtral-8x7B) and 6-10% (Llama-3-8B) absolute. Code and data are available at https://github.com/belindal/ERASE
☆ WPO: Enhancing RLHF with Weighted Preference Optimization
Reinforcement learning from human feedback (RLHF) is a promising solution to align large language models (LLMs) more closely with human values. Off-policy preference optimization, where the preference data is obtained from other models, is widely adopted due to its cost efficiency and scalability. However, off-policy preference optimization often suffers from a distributional gap between the policy used for data collection and the target policy, leading to suboptimal optimization. In this paper, we propose a novel strategy to mitigate this problem by simulating on-policy learning with off-policy preference data. Our Weighted Preference Optimization (WPO) method adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. This method not only addresses the distributional gap problem but also enhances the optimization process without incurring additional costs. We validate our method on instruction following benchmarks including Alpaca Eval 2 and MT-bench. WPO not only outperforms Direct Preference Optimization (DPO) by up to 5.6% on Alpaca Eval 2 but also establishes a remarkable length-controlled winning rate against GPT-4-turbo of 48.6% based on Llama-3-8B-Instruct, making it the strongest 8B model on the leaderboard. We will release the code and models at https://github.com/wzhouad/WPO.
☆ Embodied Instruction Following in Unknown Environments
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes.
comment: Project Page: https://gary3410.github.io/eif_unknown/
☆ Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level
Direct Preference Optimization (DPO), a standard method for aligning language models with human preferences, is traditionally applied to offline preferences. Recent studies show that DPO benefits from iterative training with online preferences labeled by a trained reward model. In this work, we identify a pitfall of vanilla iterative DPO - improved response quality can lead to increased verbosity. To address this, we introduce iterative length-regularized DPO (iLR-DPO) to penalize response length. Our empirical results show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity. Specifically, our 7B model achieves a $50.5\%$ length-controlled win rate against $\texttt{GPT-4 Preview}$ on AlpacaEval 2.0, and excels across standard benchmarks including MT-Bench, Arena-Hard and OpenLLM Leaderboard. These results demonstrate the effectiveness of iterative DPO in aligning language models with human feedback.
☆ RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includes encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.
☆ A Brief Survey on Leveraging Large Scale Vision Models for Enhanced Robot Grasping
Robotic grasping presents a difficult motor task in real-world scenarios, constituting a major hurdle to the deployment of capable robots across various industries. Notably, the scarcity of data makes grasping particularly challenging for learned models. Recent advancements in computer vision have witnessed a growth of successful unsupervised training mechanisms predicated on massive amounts of data sourced from the Internet, and now nearly all prominent models leverage pretrained backbone networks. Against this backdrop, we begin to investigate the potential benefits of large-scale visual pretraining in enhancing robot grasping performance. This preliminary literature review sheds light on critical challenges and delineates prospective directions for future research in visual pretraining for robotic manipulation.
comment: This report was written in February 2023, thus does not account for any works since then
☆ CELL your Model: Contrastive Explanation Methods for Large Language Models
The advent of black-box deep neural network classification models has sparked the need to explain their decisions. However, in the case of generative AI such as large language models (LLMs), there is no class prediction to explain. Rather, one can ask why an LLM output a particular response to a given prompt. In this paper, we answer this question by proposing, to the best of our knowledge, the first contrastive explanation methods requiring simply black-box/query access. Our explanations suggest that an LLM outputs a reply to a given prompt because if the prompt was slightly modified, the LLM would have given a different response that is either less preferable or contradicts the original response. The key insight is that contrastive explanations simply require a distance function that has meaning to the user and not necessarily a real valued representation of a specific response (viz. class label). We offer two algorithms for finding contrastive explanations: i) A myopic algorithm, which although effective in creating contrasts, requires many model calls and ii) A budgeted algorithm, our main algorithmic contribution, which intelligently creates contrasts adhering to a query budget, necessary for longer contexts. We show the efficacy of these methods on diverse natural language tasks such as open-text generation, automated red teaming, and explaining conversational degradation.
☆ MDCR: A Dataset for Multi-Document Conditional Reasoning
The same real-life questions posed to different individuals may lead to different answers based on their unique situations. For instance, whether a student is eligible for a scholarship depends on eligibility conditions, such as major or degree required. ConditionalQA was proposed to evaluate models' capability of reading a document and answering eligibility questions, considering unmentioned conditions. However, it is limited to questions on single documents, neglecting harder cases that may require cross-document reasoning and optimization, for example, "What is the maximum number of scholarships attainable?" Such questions over multiple documents are not only more challenging due to more context having to understand, but also because the model has to (1) explore all possible combinations of unmentioned conditions and (2) understand the relationship between conditions across documents, to reason about the optimal outcome. To evaluate models' capability of answering such questions, we propose a new dataset MDCR, which can reflect real-world challenges and serve as a new test bed for complex conditional reasoning that requires optimization. We evaluate this dataset using the most recent LLMs and demonstrate their limitations in solving this task. We believe this dataset will facilitate future research in answering optimization questions with unknown conditions.
☆ Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs
Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
☆ Task Me Anything
Benchmarks for large multimodal language models (MLMs) now serve to simultaneously assess the general capabilities of models instead of evaluating for a specific capability. As a result, when a developer wants to identify which models to use for their application, they are overwhelmed by the number of benchmarks and remain uncertain about which benchmark's results are most reflective of their specific use case. This paper introduces Task-Me-Anything, a benchmark generation engine which produces a benchmark tailored to a user's needs. Task-Me-Anything maintains an extendable taxonomy of visual assets and can programmatically generate a vast number of task instances. Additionally, it algorithmically addresses user queries regarding MLM performance efficiently within a computational budget. It contains 113K images, 10K videos, 2K 3D object assets, over 365 object categories, 655 attributes, and 335 relationships. It can generate 750M image/video question-answering pairs, which focus on evaluating MLM perceptual capabilities. Task-Me-Anything reveals critical insights: open-source MLMs excel in object and attribute recognition but lack spatial and temporal understanding; each model exhibits unique strengths and weaknesses; larger models generally perform better, though exceptions exist; and GPT4o demonstrates challenges in recognizing rotating/moving objects and distinguishing colors.
comment: website: https://www.task-me-anything.org
☆ Deep Learning methodology for the identification of wood species using high-resolution macroscopic images
Significant advancements in the field of wood species identification are needed worldwide to support sustainable timber trade. In this work we contribute to automate the identification of wood species via high-resolution macroscopic images of timber. The main challenge of this problem is that fine-grained patterns in timber are crucial in order to accurately identify wood species, and these patterns are not properly learned by traditional convolutional neural networks (CNNs) trained on low/medium resolution images. We propose a Timber Deep Learning Identification with Patch-based Inference Voting methodology, abbreviated TDLI-PIV methodology. Our proposal exploits the concept of patching and the availability of high-resolution macroscopic images of timber in order to overcome the inherent challenges that CNNs face in timber identification. The TDLI-PIV methodology is able to capture fine-grained patterns in timber and, moreover, boosts robustness and prediction accuracy via a collaborative voting inference process. In this work we also introduce a new data set of marcroscopic images of timber, called GOIMAI-Phase-I, which has been obtained using optical magnification in order to capture fine-grained details, which contrasts to the other datasets that are publicly available. More concretely, images in GOIMAI-Phase-I are taken with a smartphone with a 24x magnifying lens attached to the camera. Our data set contains 2120 images of timber and covers 37 legally protected wood species. Our experiments have assessed the performance of the TDLI-PIV methodology, involving the comparison with other methodologies available in the literature, exploration of data augmentation methods and the effect that the dataset size has on the accuracy of TDLI-PIV.
comment: 17 pages and 6 figures
☆ GAMA: A Large Audio-Language Model with Advanced Audio Understanding and Complex Reasoning Abilities
Perceiving and understanding non-speech sounds and non-verbal speech is essential to making decisions that help us interact with our surroundings. In this paper, we propose GAMA, a novel General-purpose Large Audio-Language Model (LALM) with Advanced Audio Understanding and Complex Reasoning Abilities. We build GAMA by integrating an LLM with multiple types of audio representations, including features from a custom Audio Q-Former, a multi-layer aggregator that aggregates features from multiple layers of an audio encoder. We fine-tune GAMA on a large-scale audio-language dataset, which augments it with audio understanding capabilities. Next, we propose CompA-R (Instruction-Tuning for Complex Audio Reasoning), a synthetically generated instruction-tuning (IT) dataset with instructions that require the model to perform complex reasoning on the input audio. We instruction-tune GAMA with CompA-R to endow it with complex reasoning abilities, where we further add a soft prompt as input with high-level semantic evidence by leveraging event tags of the input audio. Finally, we also propose CompA-R-test, a human-labeled evaluation dataset for evaluating the capabilities of LALMs on open-ended audio question-answering that requires complex reasoning. Through automated and expert human evaluations, we show that GAMA outperforms all other LALMs in literature on diverse audio understanding tasks by margins of 1%-84%. Further, GAMA IT-ed on CompA-R proves to be superior in its complex reasoning and instruction following capabilities.
comment: Project Website: https://sreyan88.github.io/gamaaudio/
☆ STAR: SocioTechnical Approach to Red Teaming Language Models
This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.
comment: 8 pages, 5 figures, 5 pages appendix. * denotes equal contribution
☆ DustNet: skillful neural network predictions of Saharan dust
Suspended in the atmosphere are millions of tonnes of mineral dust which interacts with weather and climate. Accurate representation of mineral dust in weather models is vital, yet remains challenging. Large scale weather models use high power supercomputers and take hours to complete the forecast. Such computational burden allows them to only include monthly climatological means of mineral dust as input states inhibiting their forecasting accuracy. Here, we introduce DustNet a simple, accurate and super fast forecasting model for 24-hours ahead predictions of aerosol optical depth AOD. DustNet trains in less than 8 minutes and creates predictions in 2 seconds on a desktop computer. Created by DustNet predictions outperform the state-of-the-art physics-based model on coarse 1 x 1 degree resolution at 95% of grid locations when compared to ground truth satellite data. Our results show DustNet has a potential for fast and accurate AOD forecasting which could transform our understanding of dust impacts on weather patterns.
comment: 34 pages, 9 figures, uses 2D CNN
☆ Transcendence: Generative Models Can Outperform The Experts That Train Them
Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence is enabled by low-temperature sampling, and rigorously assess this experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting.
comment: Code, models, and data at https://transcendence.eddie.win
☆ Imagination Policy: Using Generative Point Cloud Models for Learning Manipulation Policies
Humans can imagine goal states during planning and perform actions to match those goals. In this work, we propose Imagination Policy, a novel multi-task key-frame policy network for solving high-precision pick and place tasks. Instead of learning actions directly, Imagination Policy generates point clouds to imagine desired states which are then translated to actions using rigid action estimation. This transforms action inference into a local generative task. We leverage pick and place symmetries underlying the tasks in the generation process and achieve extremely high sample efficiency and generalizability to unseen configurations. Finally, we demonstrate state-of-the-art performance across various tasks on the RLbench benchmark compared with several strong baselines.
☆ Interactive Evolution: A Neural-Symbolic Self-Training Framework For Large Language Models
One of the primary driving forces contributing to the superior performance of Large Language Models (LLMs) is the extensive availability of human-annotated natural language data, which is used for alignment fine-tuning. This inspired researchers to investigate self-training methods to mitigate the extensive reliance on human annotations. However, the current success of self-training has been primarily observed in natural language scenarios, rather than in the increasingly important neural-symbolic scenarios. To this end, we propose an environment-guided neural-symbolic self-training framework named ENVISIONS. It aims to overcome two main challenges: (1) the scarcity of symbolic data, and (2) the limited proficiency of LLMs in processing symbolic language. Extensive evaluations conducted on three distinct domains demonstrate the effectiveness of our approach. Additionally, we have conducted a comprehensive analysis to uncover the factors contributing to ENVISIONS's success, thereby offering valuable insights for future research in this area. Code will be available at \url{https://github.com/xufangzhi/ENVISIONS}.
comment: 18 pages, 6 figures
☆ Zero-Shot Generalization during Instruction Tuning: Insights from Similarity and Granularity
Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. This line of research has been limited to examining transfer between tasks from a task-pair perspective, with few studies focusing on understanding zero-shot generalization from the perspective of the data itself. To bridge this gap, we first demonstrate through multiple metrics that zero-shot generalization during instruction tuning happens very early. Next, we investigate the facilitation of zero-shot generalization from both data similarity and granularity perspectives, confirming that encountering highly similar and fine-grained training data earlier during instruction tuning, without the constraints of defined "tasks", enables better generalization. Finally, we propose a more grounded training data arrangement method, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. We hope our analysis will advance the understanding of zero-shot generalization during instruction tuning and contribute to the development of more aligned LLMs. Our code is released at https://github.com/HBX-hbx/dynamics_of_zero-shot_generalization.
comment: 33 pages, 14 figures
☆ Refusal in Language Models Is Mediated by a Single Direction
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechanisms remain poorly understood. In this work, we show that refusal is mediated by a one-dimensional subspace, across 13 popular open-source chat models up to 72B parameters in size. Specifically, for each model, we find a single direction such that erasing this direction from the model's residual stream activations prevents it from refusing harmful instructions, while adding this direction elicits refusal on even harmless instructions. Leveraging this insight, we propose a novel white-box jailbreak method that surgically disables refusal with minimal effect on other capabilities. Finally, we mechanistically analyze how adversarial suffixes suppress propagation of the refusal-mediating direction. Our findings underscore the brittleness of current safety fine-tuning methods. More broadly, our work showcases how an understanding of model internals can be leveraged to develop practical methods for controlling model behavior.
☆ Nemotron-4 340B Technical Report
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distribution, modification, and use of the models and its outputs. These models perform competitively to open access models on a wide range of evaluation benchmarks, and were sized to fit on a single DGX H100 with 8 GPUs when deployed in FP8 precision. We believe that the community can benefit from these models in various research studies and commercial applications, especially for generating synthetic data to train smaller language models. Notably, over 98% of data used in our model alignment process is synthetically generated, showcasing the effectiveness of these models in generating synthetic data. To further support open research and facilitate model development, we are also open-sourcing the synthetic data generation pipeline used in our model alignment process.
☆ Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs
Language Model Programs, i.e. sophisticated pipelines of modular language model (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel optimizer that outperforms baselines on five of six diverse LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 12.9% accuracy. We will release our new optimizers and benchmark in DSPy at https://github.com/stanfordnlp/dspy
comment: Krista and Michael contributed equally to this work
☆ The Role of Inherent Bellman Error in Offline Reinforcement Learning with Linear Function Approximation
In this paper, we study the offline RL problem with linear function approximation. Our main structural assumption is that the MDP has low inherent Bellman error, which stipulates that linear value functions have linear Bellman backups with respect to the greedy policy. This assumption is natural in that it is essentially the minimal assumption required for value iteration to succeed. We give a computationally efficient algorithm which succeeds under a single-policy coverage condition on the dataset, namely which outputs a policy whose value is at least that of any policy which is well-covered by the dataset. Even in the setting when the inherent Bellman error is 0 (termed linear Bellman completeness), our algorithm yields the first known guarantee under single-policy coverage. In the setting of positive inherent Bellman error ${\varepsilon_{\mathrm{BE}}} > 0$, we show that the suboptimality error of our algorithm scales with $\sqrt{\varepsilon_{\mathrm{BE}}}$. Furthermore, we prove that the scaling of the suboptimality with $\sqrt{\varepsilon_{\mathrm{BE}}}$ cannot be improved for any algorithm. Our lower bound stands in contrast to many other settings in reinforcement learning with misspecification, where one can typically obtain performance that degrades linearly with the misspecification error.
☆ Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack
Large language models (LLMs) have been increasingly applied to various domains, which triggers increasing concerns about LLMs' safety on specialized domains, e.g. medicine. However, testing the domain-specific safety of LLMs is challenging due to the lack of domain knowledge-driven attacks in existing benchmarks. To bridge this gap, we propose a new task, knowledge-to-jailbreak, which aims to generate jailbreaks from domain knowledge to evaluate the safety of LLMs when applied to those domains. We collect a large-scale dataset with 12,974 knowledge-jailbreak pairs and fine-tune a large language model as jailbreak-generator, to produce domain knowledge-specific jailbreaks. Experiments on 13 domains and 8 target LLMs demonstrate the effectiveness of jailbreak-generator in generating jailbreaks that are both relevant to the given knowledge and harmful to the target LLMs. We also apply our method to an out-of-domain knowledge base, showing that jailbreak-generator can generate jailbreaks that are comparable in harmfulness to those crafted by human experts. Data and code: https://github.com/THU-KEG/Knowledge-to-Jailbreak/.
comment: 18 pages, 14 figures, 11 tables
☆ R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models KDD2024
Large language models have achieved remarkable success on general NLP tasks, but they may fall short for domain-specific problems. Recently, various Retrieval-Augmented Large Language Models (RALLMs) are proposed to address this shortcoming. However, existing evaluation tools only provide a few baselines and evaluate them on various domains without mining the depth of domain knowledge. In this paper, we address the challenges of evaluating RALLMs by introducing the R-Eval toolkit, a Python toolkit designed to streamline the evaluation of different RAG workflows in conjunction with LLMs. Our toolkit, which supports popular built-in RAG workflows and allows for the incorporation of customized testing data on the specific domain, is designed to be user-friendly, modular, and extensible. We conduct an evaluation of 21 RALLMs across three task levels and two representative domains, revealing significant variations in the effectiveness of RALLMs across different tasks and domains. Our analysis emphasizes the importance of considering both task and domain requirements when choosing a RAG workflow and LLM combination. We are committed to continuously maintaining our platform at https://github.com/THU-KEG/R-Eval to facilitate both the industry and the researchers.
comment: 12 pages, 9 figures, Accepted by KDD2024
☆ BLoB: Bayesian Low-Rank Adaptation by Backpropagation for Large Language Models
Large Language Models (LLMs) often suffer from overconfidence during inference, particularly when adapted to downstream domain-specific tasks with limited data. Previous work addresses this issue by employing approximate Bayesian estimation after the LLMs are trained, enabling them to quantify uncertainty. However, such post-training approaches' performance is severely limited by the parameters learned during training. In this paper, we go beyond post-training Bayesianization and propose Bayesian Low-Rank Adaptation by Backpropagation (BLoB), an algorithm that continuously and jointly adjusts both the mean and covariance of LLM parameters throughout the whole fine-tuning process. Our empirical results verify the effectiveness of BLoB in terms of generalization and uncertainty estimation, when evaluated on both in-distribution and out-of-distribution data.
comment: 27 pages, 3 figures, 9 tables; preprint, work in progress
☆ Benchmarking of LLM Detection: Comparing Two Competing Approaches
This article gives an overview of the field of LLM text recognition. Different approaches and implemented detectors for the recognition of LLM-generated text are presented. In addition to discussing the implementations, the article focuses on benchmarking the detectors. Although there are numerous software products for the recognition of LLM-generated text, with a focus on ChatGPT-like LLMs, the quality of the recognition (recognition rate) is not clear. Furthermore, while it can be seen that scientific contributions presenting their novel approaches strive for some kind of comparison with other approaches, the construction and independence of the evaluation dataset is often not comprehensible. As a result, discrepancies in the performance evaluation of LLM detectors are often visible due to the different benchmarking datasets. This article describes the creation of an evaluation dataset and uses this dataset to investigate the different detectors. The selected detectors are benchmarked against each other.
☆ Is Efficient PAC Learning Possible with an Oracle That Responds 'Yes' or 'No'?
The empirical risk minimization (ERM) principle has been highly impactful in machine learning, leading both to near-optimal theoretical guarantees for ERM-based learning algorithms as well as driving many of the recent empirical successes in deep learning. In this paper, we investigate the question of whether the ability to perform ERM, which computes a hypothesis minimizing empirical risk on a given dataset, is necessary for efficient learning: in particular, is there a weaker oracle than ERM which can nevertheless enable learnability? We answer this question affirmatively, showing that in the realizable setting of PAC learning for binary classification, a concept class can be learned using an oracle which only returns a single bit indicating whether a given dataset is realizable by some concept in the class. The sample complexity and oracle complexity of our algorithm depend polynomially on the VC dimension of the hypothesis class, thus showing that there is only a polynomial price to pay for use of our weaker oracle. Our results extend to the agnostic learning setting with a slight strengthening of the oracle, as well as to the partial concept, multiclass and real-valued learning settings. In the setting of partial concept classes, prior to our work no oracle-efficient algorithms were known, even with a standard ERM oracle. Thus, our results address a question of Alon et al. (2021) who asked whether there are algorithmic principles which enable efficient learnability in this setting.
☆ See It from My Perspective: Diagnosing the Western Cultural Bias of Large Vision-Language Models in Image Understanding
Vision-language models (VLMs) can respond to queries about images in many languages. However, beyond language, culture affects how we see things. For example, individuals from Western cultures focus more on the central figure in an image while individuals from Eastern cultures attend more to scene context. In this work, we present a novel investigation that demonstrates and localizes VLMs' Western bias in image understanding. We evaluate large VLMs across subjective and objective visual tasks with culturally diverse images and annotations. We find that VLMs perform better on the Western subset than the Eastern subset of each task. Controlled experimentation tracing the source of this bias highlights the importance of a diverse language mix in text-only pre-training for building equitable VLMs, even when inference is performed in English. Moreover, while prompting in the language of a target culture can lead to reductions in bias, it is not a substitute for building AI more representative of the world's languages.
comment: 17 pages, 7 figures. Code/models: https://github.com/amith-ananthram/see-it-from-my-perspective
☆ YOLO-FEDER FusionNet: A Novel Deep Learning Architecture for Drone Detection ICIP
Predominant methods for image-based drone detection frequently rely on employing generic object detection algorithms like YOLOv5. While proficient in identifying drones against homogeneous backgrounds, these algorithms often struggle in complex, highly textured environments. In such scenarios, drones seamlessly integrate into the background, creating camouflage effects that adversely affect the detection quality. To address this issue, we introduce a novel deep learning architecture called YOLO-FEDER FusionNet. Unlike conventional approaches, YOLO-FEDER FusionNet combines generic object detection methods with the specialized strength of camouflage object detection techniques to enhance drone detection capabilities. Comprehensive evaluations of YOLO-FEDER FusionNet show the efficiency of the proposed model and demonstrate substantial improvements in both reducing missed detections and false alarms.
comment: 7 pages, 4 figures, 6 tables, to be published in the conference proceedings of the 2024 IEEE International Conference on Image Processing (ICIP)
☆ Linear Bellman Completeness Suffices for Efficient Online Reinforcement Learning with Few Actions
One of the most natural approaches to reinforcement learning (RL) with function approximation is value iteration, which inductively generates approximations to the optimal value function by solving a sequence of regression problems. To ensure the success of value iteration, it is typically assumed that Bellman completeness holds, which ensures that these regression problems are well-specified. We study the problem of learning an optimal policy under Bellman completeness in the online model of RL with linear function approximation. In the linear setting, while statistically efficient algorithms are known under Bellman completeness (e.g., Jiang et al. (2017); Zanette et al. (2020)), these algorithms all rely on the principle of global optimism which requires solving a nonconvex optimization problem. In particular, it has remained open as to whether computationally efficient algorithms exist. In this paper we give the first polynomial-time algorithm for RL under linear Bellman completeness when the number of actions is any constant.
☆ MASAI: Modular Architecture for Software-engineering AI Agents
A common method to solve complex problems in software engineering, is to divide the problem into multiple sub-problems. Inspired by this, we propose a Modular Architecture for Software-engineering AI (MASAI) agents, where different LLM-powered sub-agents are instantiated with well-defined objectives and strategies tuned to achieve those objectives. Our modular architecture offers several advantages: (1) employing and tuning different problem-solving strategies across sub-agents, (2) enabling sub-agents to gather information from different sources scattered throughout a repository, and (3) avoiding unnecessarily long trajectories which inflate costs and add extraneous context. MASAI enabled us to achieve the highest performance (28.33% resolution rate) on the popular and highly challenging SWE-bench Lite dataset consisting of 300 GitHub issues from 11 Python repositories. We conduct a comprehensive evaluation of MASAI relative to other agentic methods and analyze the effects of our design decisions and their contribution to the success of MASAI.
☆ The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Cloze testing is a common method for measuring the behavior of large language models on a number of benchmark tasks. Using the MMLU dataset, we show that the base-rate probability (BRP) differences across answer tokens are significant and affect task performance ie. guess A if uncertain. We find that counterfactual prompting does sufficiently mitigate the BRP effect. The BRP effect is found to have a similar effect to test taking strategies employed by humans leading to the conflation of task performance and test-taking ability. We propose the Nvr-X-MMLU task, a variation of MMLU, which helps to disambiguate test-taking ability from task performance and reports the latter.
☆ Unveiling the Power of Source: Source-based Minimum Bayes Risk Decoding for Neural Machine Translation
Maximum a posteriori decoding, a commonly used method for neural machine translation (NMT), aims to maximize the estimated posterior probability. However, high estimated probability does not always lead to high translation quality. Minimum Bayes Risk (MBR) decoding offers an alternative by seeking hypotheses with the highest expected utility. In this work, we show that Quality Estimation (QE) reranking, which uses a QE model as a reranker, can be viewed as a variant of MBR. Inspired by this, we propose source-based MBR (sMBR) decoding, a novel approach that utilizes synthetic sources generated by backward translation as ``support hypotheses'' and a reference-free quality estimation metric as the utility function, marking the first work to solely use sources in MBR decoding. Experiments show that sMBR significantly outperforms QE reranking and is competitive with standard MBR decoding. Furthermore, sMBR calls the utility function fewer times compared to MBR. Our findings suggest that sMBR is a promising approach for high-quality NMT decoding.
☆ Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
☆ Long Code Arena: a Set of Benchmarks for Long-Context Code Models
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
comment: 54 pages, 4 figures, 22 tables
☆ CoSQA+: Enhancing Code Search Dataset with Matching Code
Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets are problematic: either using unrealistic queries, or with mismatched codes, and typically using one-to-one query-code pairing, which fails to reflect the reality that a query might have multiple valid code matches. This paper introduces CoSQA+, pairing high-quality queries (reused from CoSQA) with multiple suitable codes. We collect code candidates from diverse sources and form candidate pairs by pairing queries with these codes. Utilizing the power of large language models (LLMs), we automate pair annotation, filtering, and code generation for queries without suitable matches. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. Furthermore, we propose a new metric Mean Multi-choice Reciprocal Rank (MMRR), to assess one-to-N code search performance. We provide the code and data at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.
comment: 11 pages, 4 figures, conference
☆ Quaternion Generative Adversarial Neural Networks and Applications to Color Image Inpainting
Color image inpainting is a challenging task in imaging science. The existing method is based on real operation, and the red, green and blue channels of the color image are processed separately, ignoring the correlation between each channel. In order to make full use of the correlation between each channel, this paper proposes a Quaternion Generative Adversarial Neural Network (QGAN) model and related theory, and applies it to solve the problem of color image inpainting with large area missing. Firstly, the definition of quaternion deconvolution is given and the quaternion batch normalization is proposed. Secondly, the above two innovative modules are applied to generate adversarial networks to improve stability. Finally, QGAN is applied to color image inpainting and compared with other state-of-the-art algorithms. The experimental results show that QGAN has superiority in color image inpainting with large area missing.
comment: 19 pages, 6 figures
☆ Intersymbolic AI: Interlinking Symbolic AI and Subsymbolic AI
This perspective piece calls for the study of the new field of Intersymbolic AI, by which we mean the combination of symbolic AI, whose building blocks have inherent significance/meaning, with subsymbolic AI, whose entirety creates significance/effect despite the fact that individual building blocks escape meaning. Canonical kinds of symbolic AI are logic, games and planning. Canonical kinds of subsymbolic AI are (un)supervised machine and reinforcement learning. Intersymbolic AI interlinks the worlds of symbolic AI with its compositional symbolic significance and meaning and of subsymbolic AI with its summative significance or effect to enable culminations of insights from both worlds by going between and across symbolic AI insights with subsymbolic AI techniques that are being helped by symbolic AI principles. For example, Intersymbolic AI may start with symbolic AI to understand a dynamic system, continue with subsymbolic AI to learn its control, and end with symbolic AI to safely use the outcome of the learned subsymbolic AI controller in the dynamic system. Intersymbolic AI combines both symbolic and subsymbolic AI to increase the effectiveness of AI compared to either kind of AI alone, in much the same way that the combination of both conscious and subconscious thought increases the effectiveness of human thought compared to either kind of thought alone. Some successful contributions to the Intersymbolic AI paradigm are surveyed here but many more are considered possible by advancing Intersymbolic AI.
☆ Input Conditioned Graph Generation for Language Agents
Recent progress in Large Language Models (LLMs) and language agents has demonstrated significant promise for various future applications across multiple disciplines. While traditional approaches to language agents often rely on fixed, handcrafted designs, our research aims to develop both learnable and dynamic agents. Our method uses an existing framework that abstracts language agents as graphs. Within this graph framework, we aim to learn a model that can generate edges for every given input to the language agent. This allows us to generate edges that represent the flow of communication within the graph based on the given input, thereby adjusting the internal communication of a language agent. We learn to generate these edges using a pretrained LLM that is fine-tuned with reinforcement learning. This LLM can be fine-tuned on several datasets simultaneously, and we hypothesize that the model learns to adapt to these different domains during training, achieving good overall performance when encountering data from different domains during deployment. We demonstrate that our approach surpasses the previous static approach by nearly 6% accuracy on a combined dataset of MMLU and CMMLU, and by more than 10% when trained with a sparsity-inducing loss. It also performs superior in additional experiments conducted with the MMLU and Mini Crossword Puzzles datasets. The code is available at https://github.com/lukasVierling/DynamicGPTSwarm.
☆ AIC MLLM: Autonomous Interactive Correction MLLM for Robust Robotic Manipulation
The ability to reflect on and correct failures is crucial for robotic systems to interact stably with real-life objects.Observing the generalization and reasoning capabilities of Multimodal Large Language Models (MLLMs), previous approaches have aimed to utilize these models to enhance robotic systems accordingly.However, these methods typically focus on high-level planning corrections using an additional MLLM, with limited utilization of failed samples to correct low-level contact poses. To address this gap, we propose an Autonomous Interactive Correction (AIC) MLLM, which makes use of previous low-level interaction experiences to correct SE(3) pose predictions. Specifically, AIC MLLM is initially fine-tuned to acquire both pose prediction and feedback prompt comprehension abilities.We carefully design two types of prompt instructions through interactions with objects: 1) visual masks to highlight unmovable parts for position correction, and 2)textual descriptions to indicate potential directions for rotation correction.During inference, a Feedback Information Extraction module is introduced to recognize the failure cause, allowing AIC MLLM to adaptively correct the pose prediction using the corresponding prompts.To further enhance manipulation stability, we devise a Test Time Adaptation strategy that enables AIC MLLM to better adapt to the current scene configuration.Finally, extensive experiments are conducted in both simulated and real-world environments to evaluate the proposed method. The results demonstrate that our AIC MLLM can efficiently correct failure samples by leveraging interaction experience prompts.Real-world demonstration can be found at https://sites.google.com/view/aic-mllm
☆ GECOBench: A Gender-Controlled Text Dataset and Benchmark for Quantifying Biases in Explanations
Large pre-trained language models have become popular for many applications and form an important backbone of many downstream tasks in natural language processing (NLP). Applying 'explainable artificial intelligence' (XAI) techniques to enrich such models' outputs is considered crucial for assuring their quality and shedding light on their inner workings. However, large language models are trained on a plethora of data containing a variety of biases, such as gender biases, affecting model weights and, potentially, behavior. Currently, it is unclear to what extent such biases also impact model explanations in possibly unfavorable ways. We create a gender-controlled text dataset, GECO, in which otherwise identical sentences appear in male and female forms. This gives rise to ground-truth 'world explanations' for gender classification tasks, enabling the objective evaluation of the correctness of XAI methods. We also provide GECOBench, a rigorous quantitative evaluation framework benchmarking popular XAI methods, applying them to pre-trained language models fine-tuned to different degrees. This allows us to investigate how pre-training induces undesirable bias in model explanations and to what extent fine-tuning can mitigate such explanation bias. We show a clear dependency between explanation performance and the number of fine-tuned layers, where XAI methods are observed to particularly benefit from fine-tuning or complete retraining of embedding layers. Remarkably, this relationship holds for models achieving similar classification performance on the same task. With that, we highlight the utility of the proposed gender-controlled dataset and novel benchmarking approach for research and development of novel XAI methods. All code including dataset generation, model training, evaluation and visualization is available at: https://github.com/braindatalab/gecobench
comment: Under review
☆ Do Parameters Reveal More than Loss for Membership Inference? ICML 2024
Membership inference attacks aim to infer whether an individual record was used to train a model, serving as a key tool for disclosure auditing. While such evaluations are useful to demonstrate risk, they are computationally expensive and often make strong assumptions about potential adversaries' access to models and training environments, and thus do not provide very tight bounds on leakage from potential attacks. We show how prior claims around black-box access being sufficient for optimal membership inference do not hold for most useful settings such as stochastic gradient descent, and that optimal membership inference indeed requires white-box access. We validate our findings with a new white-box inference attack IHA (Inverse Hessian Attack) that explicitly uses model parameters by taking advantage of computing inverse-Hessian vector products. Our results show that both audits and adversaries may be able to benefit from access to model parameters, and we advocate for further research into white-box methods for membership privacy auditing.
comment: Accepted at High-dimensional Learning Dynamics (HiLD) Workshop, ICML 2024
☆ Improving Quality Control of Whole Slide Images by Explicit Artifact Augmentation
The problem of artifacts in whole slide image acquisition, prevalent in both clinical workflows and research-oriented settings, necessitates human intervention and re-scanning. Overcoming this challenge requires developing quality control algorithms, that are hindered by the limited availability of relevant annotated data in histopathology. The manual annotation of ground-truth for artifact detection methods is expensive and time-consuming. This work addresses the issue by proposing a method dedicated to augmenting whole slide images with artifacts. The tool seamlessly generates and blends artifacts from an external library to a given histopathology dataset. The augmented datasets are then utilized to train artifact classification methods. The evaluation shows their usefulness in classification of the artifacts, where they show an improvement from 0.10 to 0.01 AUROC depending on the artifact type. The framework, model, weights, and ground-truth annotations are freely released to facilitate open science and reproducible research.
☆ Explainable Artificial Intelligence and Multicollinearity : A Mini Review of Current Approaches
Explainable Artificial Intelligence (XAI) methods help to understand the internal mechanism of machine learning models and how they reach a specific decision or made a specific action. The list of informative features is one of the most common output of XAI methods. Multicollinearity is one of the big issue that should be considered when XAI generates the explanation in terms of the most informative features in an AI system. No review has been dedicated to investigate the current approaches to handle such significant issue. In this paper, we provide a review of the current state-of-the-art approaches in relation to the XAI in the context of recent advances in dealing with the multicollinearity issue. To do so, we searched in three repositories that are: Web of Science, Scopus and IEEE Xplore to find pertinent published papers. After excluding irrelevant papers, seven papers were considered in the review. In addition, we discuss the current XAI methods and their limitations in dealing with the multicollinearity and suggest future directions.
☆ FullCert: Deterministic End-to-End Certification for Training and Inference of Neural Networks
Modern machine learning models are sensitive to the manipulation of both the training data (poisoning attacks) and inference data (adversarial examples). Recognizing this issue, the community has developed many empirical defenses against both attacks and, more recently, provable certification methods against inference-time attacks. However, such guarantees are still largely lacking for training-time attacks. In this work, we present FullCert, the first end-to-end certifier with sound, deterministic bounds, which proves robustness against both training-time and inference-time attacks. We first bound all possible perturbations an adversary can make to the training data under the considered threat model. Using these constraints, we bound the perturbations' influence on the model's parameters. Finally, we bound the impact of these parameter changes on the model's prediction, resulting in joint robustness guarantees against poisoning and adversarial examples. To facilitate this novel certification paradigm, we combine our theoretical work with a new open-source library BoundFlow, which enables model training on bounded datasets. We experimentally demonstrate FullCert's feasibility on two different datasets.
☆ Revisiting Spurious Correlation in Domain Generalization
Without loss of generality, existing machine learning techniques may learn spurious correlation dependent on the domain, which exacerbates the generalization of models in out-of-distribution (OOD) scenarios. To address this issue, recent works build a structural causal model (SCM) to describe the causality within data generation process, thereby motivating methods to avoid the learning of spurious correlation by models. However, from the machine learning viewpoint, such a theoretical analysis omits the nuanced difference between the data generation process and representation learning process, resulting in that the causal analysis based on the former cannot well adapt to the latter. To this end, we explore to build a SCM for representation learning process and further conduct a thorough analysis of the mechanisms underlying spurious correlation. We underscore that adjusting erroneous covariates introduces bias, thus necessitating the correct selection of spurious correlation mechanisms based on practical application scenarios. In this regard, we substantiate the correctness of the proposed SCM and further propose to control confounding bias in OOD generalization by introducing a propensity score weighted estimator, which can be integrated into any existing OOD method as a plug-and-play module. The empirical results comprehensively demonstrate the effectiveness of our method on synthetic and large-scale real OOD datasets.
☆ On the Feasibility of Fidelity$^-$ for Graph Pruning IJCAI
As one of popular quantitative metrics to assess the quality of explanation of graph neural networks (GNNs), fidelity measures the output difference after removing unimportant parts of the input graph. Fidelity has been widely used due to its straightforward interpretation that the underlying model should produce similar predictions when features deemed unimportant from the explanation are removed. This raises a natural question: "Does fidelity induce a global (soft) mask for graph pruning?" To solve this, we aim to explore the potential of the fidelity measure to be used for graph pruning, eventually enhancing the GNN models for better efficiency. To this end, we propose Fidelity$^-$-inspired Pruning (FiP), an effective framework to construct global edge masks from local explanations. Our empirical observations using 7 edge attribution methods demonstrate that, surprisingly, general eXplainable AI methods outperform methods tailored to GNNs in terms of graph pruning performance.
comment: 6 pages, 3 figures, 2 tables; IJCAI Workshop on Explainable AI (XAI 2024) (to appear) (Please cite our workshop version.)
☆ Teleporter Theory: A General and Simple Approach for Modeling Cross-World Counterfactual Causality
Leveraging the development of structural causal model (SCM), researchers can establish graphical models for exploring the causal mechanisms behind machine learning techniques. As the complexity of machine learning applications rises, single-world interventionism causal analysis encounters theoretical adaptation limitations. Accordingly, cross-world counterfactual approach extends our understanding of causality beyond observed data, enabling hypothetical reasoning about alternative scenarios. However, the joint involvement of cross-world variables, encompassing counterfactual variables and real-world variables, challenges the construction of the graphical model. Twin network is a subtle attempt, establishing a symbiotic relationship, to bridge the gap between graphical modeling and the introduction of counterfactuals albeit with room for improvement in generalization. In this regard, we demonstrate the theoretical breakdowns of twin networks in certain cross-world counterfactual scenarios. To this end, we propose a novel teleporter theory to establish a general and simple graphical representation of counterfactuals, which provides criteria for determining teleporter variables to connect multiple worlds. In theoretical application, we determine that introducing the proposed teleporter theory can directly obtain the conditional independence between counterfactual variables and real-world variables from the cross-world SCM without requiring complex algebraic derivations. Accordingly, we can further identify counterfactual causal effects through cross-world symbolic derivation. We demonstrate the generality of the teleporter theory to the practical application. Adhering to the proposed theory, we build a plug-and-play module, and the effectiveness of which are substantiated by experiments on benchmarks.
☆ Online Context Learning for Socially-compliant Navigation
Robot social navigation needs to adapt to different human factors and environmental contexts. However, since these factors and contexts are difficult to predict and cannot be exhaustively enumerated, traditional learning-based methods have difficulty in ensuring the social attributes of robots in long-term and cross-environment deployments. This letter introduces an online context learning method that aims to empower robots to adapt to new social environments online. The proposed method adopts a two-layer structure. The bottom layer is built using a deep reinforcement learning-based method to ensure the output of basic robot navigation commands. The upper layer is implemented using an online robot learning-based method to socialize the control commands suggested by the bottom layer. Experiments using a community-wide simulator show that our method outperforms the state-of-the-art ones. Experimental results in the most challenging scenarios show that our method improves the performance of the state-of-the-art by 8%. The source code of the proposed method, the data used, and the tools for the per-training step will be publicly available at https://github.com/Nedzhaken/SOCSARL-OL.
comment: 8 pages, 4 figures, 1 table, 1 algorithm
☆ Constrained Reinforcement Learning with Average Reward Objective: Model-Based and Model-Free Algorithms
Reinforcement Learning (RL) serves as a versatile framework for sequential decision-making, finding applications across diverse domains such as robotics, autonomous driving, recommendation systems, supply chain optimization, biology, mechanics, and finance. The primary objective in these applications is to maximize the average reward. Real-world scenarios often necessitate adherence to specific constraints during the learning process. This monograph focuses on the exploration of various model-based and model-free approaches for Constrained RL within the context of average reward Markov Decision Processes (MDPs). The investigation commences with an examination of model-based strategies, delving into two foundational methods - optimism in the face of uncertainty and posterior sampling. Subsequently, the discussion transitions to parametrized model-free approaches, where the primal-dual policy gradient-based algorithm is explored as a solution for constrained MDPs. The monograph provides regret guarantees and analyzes constraint violation for each of the discussed setups. For the above exploration, we assume the underlying MDP to be ergodic. Further, this monograph extends its discussion to encompass results tailored for weakly communicating MDPs, thereby broadening the scope of its findings and their relevance to a wider range of practical scenarios.
comment: arXiv admin note: text overlap with arXiv:2402.02042; text overlap with arXiv:2202.00150 by other authors
☆ Vocabulary Expansion for Low-resource Cross-lingual Transfer
Large language models (LLMs) have shown remarkable capabilities in many languages beyond English. Yet, LLMs require more inference steps when generating non-English text due to their reliance on English-centric tokenizers, vocabulary, and pre-training data, resulting in higher usage costs to non-English speakers. Vocabulary expansion with target language tokens is a widely used cross-lingual vocabulary adaptation approach to remedy this issue. Despite its effectiveness in inference speedup, the majority of previous work has focused on high-resource settings assuming access to a substantial amount of target language data to effectively initialize the embeddings of the new tokens and adapt the LLM to the target language. However, vocabulary expansion for LLMs in low-resource settings (i.e. languages and compute) has yet to be explored. In this paper, we investigate sample-efficient adaptation strategies from different angles, including target vocabulary size and initialization methods, and the amount of target data available for adaptation. Extensive experiments across typologically diverse languages, tasks and models show that simpler heuristic-based embedding initialization is more efficient and robust to changes in target vocabulary size and adaptation data in low-resource settings, outperforming a popular random initialization and a more sophisticated state-of-the-art approach that relies on external data and model.
☆ How Far Can In-Context Alignment Go? Exploring the State of In-Context Alignment
Recent studies have demonstrated that In-Context Learning (ICL), through the use of specific demonstrations, can align Large Language Models (LLMs) with human preferences known as In-Context Alignment (ICA), indicating that models can comprehend human instructions without requiring parameter adjustments. However, the exploration of the mechanism and applicability of ICA remains limited. In this paper, we begin by dividing the context text used in ICA into three categories: format, system prompt, and example. Through ablation experiments, we investigate the effectiveness of each part in enabling ICA to function effectively. We then examine how variants in these parts impact the model's alignment performance. Our findings indicate that the example part is crucial for enhancing the model's alignment capabilities, with changes in examples significantly affecting alignment performance. We also conduct a comprehensive evaluation of ICA's zero-shot capabilities in various alignment tasks. The results indicate that compared to parameter fine-tuning methods, ICA demonstrates superior performance in knowledge-based tasks and tool-use tasks. However, it still exhibits certain limitations in areas such as multi-turn dialogues and instruction following.
comment: 22 pages, 6 figures, work in progress
☆ Promises, Outlooks and Challenges of Diffusion Language Modeling
The modern autoregressive Large Language Models (LLMs) have achieved outstanding performance on NLP benchmarks, and they are deployed in the real world. However, they still suffer from limitations of the autoregressive training paradigm. For example, autoregressive token generation is notably slow and can be prone to \textit{exposure bias}. The diffusion-based language models were proposed as an alternative to autoregressive generation to address some of these limitations. We evaluate the recently proposed Score Entropy Discrete Diffusion (SEDD) approach and show it is a promising alternative to autoregressive generation but it has some short-comings too. We empirically demonstrate the advantages and challenges of SEDD, and observe that SEDD generally matches autoregressive models in perplexity and on benchmarks such as HellaSwag, Arc or WinoGrande. Additionally, we show that in terms of inference latency, SEDD can be up to 4.5$\times$ more efficient than GPT-2. While SEDD allows conditioning on tokens at abitrary positions, SEDD appears slightly weaker than GPT-2 for conditional generation given short prompts. Finally, we reproduced the main results from the original SEDD paper.
☆ TRACE the Evidence: Constructing Knowledge-Grounded Reasoning Chains for Retrieval-Augmented Generation
Retrieval-augmented generation (RAG) offers an effective approach for addressing question answering (QA) tasks. However, the imperfections of the retrievers in RAG models often result in the retrieval of irrelevant information, which could introduce noises and degrade the performance, especially when handling multi-hop questions that require multiple steps of reasoning. To enhance the multi-hop reasoning ability of RAG models, we propose TRACE. TRACE constructs knowledge-grounded reasoning chains, which are a series of logically connected knowledge triples, to identify and integrate supporting evidence from the retrieved documents for answering questions. Specifically, TRACE employs a KG Generator to create a knowledge graph (KG) from the retrieved documents, and then uses an Autoregressive Reasoning Chain Constructor to build reasoning chains. Experimental results on three multi-hop QA datasets show that TRACE achieves an average performance improvement of up to 14.03% compared to using all the retrieved documents. Moreover, the results indicate that using reasoning chains as context, rather than the entire documents, is often sufficient to correctly answer questions.
☆ Adaptive Reinforcement Learning Planning: Harnessing Large Language Models for Complex Information Extraction
Existing research on large language models (LLMs) shows that they can solve information extraction tasks through multi-step planning. However, their extraction behavior on complex sentences and tasks is unstable, emerging issues such as false positives and missing elements. We observe that decomposing complex extraction tasks and extracting them step by step can effectively improve LLMs' performance, and the extraction orders of entities significantly affect the final results of LLMs. This paper proposes a two-stage multi-step method for LLM-based information extraction and adopts the RL framework to execute the multi-step planning. We regard sequential extraction as a Markov decision process, build an LLM-based extraction environment, design a decision module to adaptively provide the optimal order for sequential entity extraction on different sentences, and utilize the DDQN algorithm to train the decision model. We also design the rewards and evaluation metrics suitable for the extraction results of LLMs. We conduct extensive experiments on multiple public datasets to demonstrate the effectiveness of our method in improving the information extraction capabilities of LLMs.
☆ Attention-Based Deep Reinforcement Learning for Qubit Allocation in Modular Quantum Architectures
Modular, distributed and multi-core architectures are currently considered a promising approach for scalability of quantum computing systems. The integration of multiple Quantum Processing Units necessitates classical and quantum-coherent communication, introducing challenges related to noise and quantum decoherence in quantum state transfers between cores. Optimizing communication becomes imperative, and the compilation and mapping of quantum circuits onto physical qubits must minimize state transfers while adhering to architectural constraints. The compilation process, inherently an NP-hard problem, demands extensive search times even with a small number of qubits to be solved to optimality. To address this challenge efficiently, we advocate for the utilization of heuristic mappers that can rapidly generate solutions. In this work, we propose a novel approach employing Deep Reinforcement Learning (DRL) methods to learn these heuristics for a specific multi-core architecture. Our DRL agent incorporates a Transformer encoder and Graph Neural Networks. It encodes quantum circuits using self-attention mechanisms and produce outputs through an attention-based pointer mechanism that directly signifies the probability of matching logical qubits with physical cores. This enables the selection of optimal cores for logical qubits efficiently. Experimental evaluations show that the proposed method can outperform baseline approaches in terms of reducing inter-core communications and minimizing online time-to-solution. This research contributes to the advancement of scalable quantum computing systems by introducing a novel learning-based heuristic approach for efficient quantum circuit compilation and mapping.
GPT-Powered Elicitation Interview Script Generator for Requirements Engineering Training
Elicitation interviews are the most common requirements elicitation technique, and proficiency in conducting these interviews is crucial for requirements elicitation. Traditional training methods, typically limited to textbook learning, may not sufficiently address the practical complexities of interviewing techniques. Practical training with various interview scenarios is important for understanding how to apply theoretical knowledge in real-world contexts. However, there is a shortage of educational interview material, as creating interview scripts requires both technical expertise and creativity. To address this issue, we develop a specialized GPT agent for auto-generating interview scripts. The GPT agent is equipped with a dedicated knowledge base tailored to the guidelines and best practices of requirements elicitation interview procedures. We employ a prompt chaining approach to mitigate the output length constraint of GPT to be able to generate thorough and detailed interview scripts. This involves dividing the interview into sections and crafting distinct prompts for each, allowing for the generation of complete content for each section. The generated scripts are assessed through standard natural language generation evaluation metrics and an expert judgment study, confirming their applicability in requirements engineering training.
comment: Accepted to RE@Next! at the IEEE International Requirements Engineering Conference 2024 at Reykjavik, Iceland
☆ Analysing the Behaviour of Tree-Based Neural Networks in Regression Tasks
The landscape of deep learning has vastly expanded the frontiers of source code analysis, particularly through the utilization of structural representations such as Abstract Syntax Trees (ASTs). While these methodologies have demonstrated effectiveness in classification tasks, their efficacy in regression applications, such as execution time prediction from source code, remains underexplored. This paper endeavours to decode the behaviour of tree-based neural network models in the context of such regression challenges. We extend the application of established models--tree-based Convolutional Neural Networks (CNNs), Code2Vec, and Transformer-based methods--to predict the execution time of source code by parsing it to an AST. Our comparative analysis reveals that while these models are benchmarks in code representation, they exhibit limitations when tasked with regression. To address these deficiencies, we propose a novel dual-transformer approach that operates on both source code tokens and AST representations, employing cross-attention mechanisms to enhance interpretability between the two domains. Furthermore, we explore the adaptation of Graph Neural Networks (GNNs) to this tree-based problem, theorizing the inherent compatibility due to the graphical nature of ASTs. Empirical evaluations on real-world datasets showcase that our dual-transformer model outperforms all other tree-based neural networks and the GNN-based models. Moreover, our proposed dual transformer demonstrates remarkable adaptability and robust performance across diverse datasets.
comment: This Paper is submitted to IEEE Transactions on Neural Networks and Learning Systems
☆ AnyTrans: Translate AnyText in the Image with Large Scale Models
This paper introduces AnyTrans, an all-encompassing framework for the task-Translate AnyText in the Image (TATI), which includes multilingual text translation and text fusion within images. Our framework leverages the strengths of large-scale models, such as Large Language Models (LLMs) and text-guided diffusion models, to incorporate contextual cues from both textual and visual elements during translation. The few-shot learning capability of LLMs allows for the translation of fragmented texts by considering the overall context. Meanwhile, the advanced inpainting and editing abilities of diffusion models make it possible to fuse translated text seamlessly into the original image while preserving its style and realism. Additionally, our framework can be constructed entirely using open-source models and requires no training, making it highly accessible and easily expandable. To encourage advancement in the TATI task, we have meticulously compiled a test dataset called MTIT6, which consists of multilingual text image translation data from six language pairs.
☆ Super(ficial)-alignment: Strong Models May Deceive Weak Models in Weak-to-Strong Generalization
Superalignment, where humans are weak supervisors of superhuman models, has become an important and widely discussed issue in the current era of rapid development of Large Language Models (LLMs). The recent work preliminarily studies this problem by using weak models to supervise strong models. It discovers that weakly supervised strong students can consistently outperform weak teachers towards the alignment target, leading to a weak-to-strong generalization phenomenon. However, we are concerned that behind such a promising phenomenon, whether there exists an issue of weak-to-strong deception, where strong models may deceive weak models by exhibiting well-aligned in areas known to weak models but producing misaligned behaviors in cases weak models do not know. We then take an initial step towards exploring this security issue in a specific but realistic multi-objective alignment case, where there may be some alignment targets conflicting with each other (e.g., helpfulness v.s. harmlessness). Such a conflict is likely to cause strong models to deceive weak models in one alignment dimension to gain high reward in other alignment dimension. Our experiments on both the reward modeling task and the preference optimization scenario indicate: (1) the weak-to-strong deception exists; (2) the deception phenomenon may intensify as the capability gap between weak and strong models increases. We also discuss potential solutions and find bootstrapping with an intermediate model can mitigate the deception to some extent. Our work highlights the urgent need to pay more attention to the true reliability of superalignment.
comment: Code is available at https://github.com/keven980716/weak-to-strong-deception
☆ A Simple and Effective $L_2$ Norm-Based Strategy for KV Cache Compression
The deployment of large language models (LLMs) is often hindered by the extensive memory requirements of the Key-Value (KV) cache, especially as context lengths increase. Existing approaches to reduce the KV cache size involve either fine-tuning the model to learn a compression strategy or leveraging attention scores to reduce the sequence length. We analyse the attention distributions in decoder-only Transformers-based models and observe that attention allocation patterns stay consistent across most layers. Surprisingly, we find a clear correlation between the $L_2$ and the attention scores over cached KV pairs, where a low $L_2$ of a key embedding usually leads to a high attention score during decoding. This finding indicates that the influence of a KV pair is potentially determined by the key embedding itself before being queried. Based on this observation, we compress the KV cache based on the $L_2$ of key embeddings. Our experimental results show that this simple strategy can reduce the KV cache size by 50% on language modelling and needle-in-a-haystack tasks and 90% on passkey retrieval tasks without losing accuracy.
☆ Fusion Makes Perfection: An Efficient Multi-Grained Matching Approach for Zero-Shot Relation Extraction NAACL2024
Predicting unseen relations that cannot be observed during the training phase is a challenging task in relation extraction. Previous works have made progress by matching the semantics between input instances and label descriptions. However, fine-grained matching often requires laborious manual annotation, and rich interactions between instances and label descriptions come with significant computational overhead. In this work, we propose an efficient multi-grained matching approach that uses virtual entity matching to reduce manual annotation cost, and fuses coarse-grained recall and fine-grained classification for rich interactions with guaranteed inference speed. Experimental results show that our approach outperforms the previous State Of The Art (SOTA) methods, and achieves a balance between inference efficiency and prediction accuracy in zero-shot relation extraction tasks. Our code is available at https://github.com/longls777/EMMA.
comment: Accepted to the main conference of NAACL2024
☆ DiTTo-TTS: Efficient and Scalable Zero-Shot Text-to-Speech with Diffusion Transformer
Large-scale diffusion models have shown outstanding generative abilities across multiple modalities including images, videos, and audio. However, text-to-speech (TTS) systems typically involve domain-specific modeling factors (e.g., phonemes and phoneme-level durations) to ensure precise temporal alignments between text and speech, which hinders the efficiency and scalability of diffusion models for TTS. In this work, we present an efficient and scalable Diffusion Transformer (DiT) that utilizes off-the-shelf pre-trained text and speech encoders. Our approach addresses the challenge of text-speech alignment via cross-attention mechanisms with the prediction of the total length of speech representations. To achieve this, we enhance the DiT architecture to suit TTS and improve the alignment by incorporating semantic guidance into the latent space of speech. We scale the training dataset and the model size to 82K hours and 790M parameters, respectively. Our extensive experiments demonstrate that the large-scale diffusion model for TTS without domain-specific modeling not only simplifies the training pipeline but also yields superior or comparable zero-shot performance to state-of-the-art TTS models in terms of naturalness, intelligibility, and speaker similarity. Our speech samples are available at https://ditto-tts.github.io.
☆ Dredge Word, Social Media, and Webgraph Networks for Unreliable Website Classification and Identification
In an attempt to mimic the complex paths through which unreliable content spreads between search engines and social media, we explore the impact of incorporating both webgraph and large-scale social media contexts into website credibility classification and discovery systems. We further explore the usage of what we define as \textit{dredge words} on social media -- terms or phrases for which unreliable domains rank highly. Through comprehensive graph neural network ablations, we demonstrate that curriculum-based heterogeneous graph models that leverage context from both webgraphs and social media data outperform homogeneous and single-mode approaches. We further demonstrate that the incorporation of dredge words into our model strongly associates unreliable websites with social media and online commerce platforms. Finally, we show our heterogeneous model greatly outperforms competing systems in the top-k identification of unlabeled unreliable websites. We demonstrate the strong unreliability signals present in the diverse paths that users follow to uncover unreliable content, and we release a novel dataset of dredge words.
☆ Formally Certified Approximate Model Counting
Approximate model counting is the task of approximating the number of solutions to an input Boolean formula. The state-of-the-art approximate model counter for formulas in conjunctive normal form (CNF), ApproxMC, provides a scalable means of obtaining model counts with probably approximately correct (PAC)-style guarantees. Nevertheless, the validity of ApproxMC's approximation relies on a careful theoretical analysis of its randomized algorithm and the correctness of its highly optimized implementation, especially the latter's stateful interactions with an incremental CNF satisfiability solver capable of natively handling parity (XOR) constraints. We present the first certification framework for approximate model counting with formally verified guarantees on the quality of its output approximation. Our approach combines: (i) a static, once-off, formal proof of the algorithm's PAC guarantee in the Isabelle/HOL proof assistant; and (ii) dynamic, per-run, verification of ApproxMC's calls to an external CNF-XOR solver using proof certificates. We detail our general approach to establish a rigorous connection between these two parts of the verification, including our blueprint for turning the formalized, randomized algorithm into a verified proof checker, and our design of proof certificates for both ApproxMC and its internal CNF-XOR solving steps. Experimentally, we show that certificate generation adds little overhead to an approximate counter implementation, and that our certificate checker is able to fully certify $84.7\%$ of instances with generated certificates when given the same time and memory limits as the counter.
comment: The extended version, including the appendix, of the paper to be published in CAV24
☆ HARE: HumAn pRiors, a key to small language model Efficiency
Human priors play a crucial role in efficiently utilizing data in deep learning. However, with the development of large language models (LLMs), there is an increasing emphasis on scaling both model size and data volume, which often diminishes the importance of human priors in data construction. Influenced by these trends, existing Small Language Models (SLMs) mainly rely on web-scraped large-scale training data, neglecting the proper incorporation of human priors. This oversight limits the training efficiency of language models in resource-constrained settings. In this paper, we propose a principle to leverage human priors for data construction. This principle emphasizes achieving high-performance SLMs by training on a concise dataset that accommodates both semantic diversity and data quality consistency, while avoiding benchmark data leakage. Following this principle, we train an SLM named HARE-1.1B. Extensive experiments on large-scale benchmark datasets demonstrate that HARE-1.1B performs favorably against state-of-the-art SLMs, validating the effectiveness of the proposed principle. Additionally, this provides new insights into efficient language model training in resource-constrained environments from the view of human priors.
☆ CodeGemma: Open Code Models Based on Gemma
This paper introduces CodeGemma, a collection of specialized open code models built on top of Gemma, capable of a variety of code and natural language generation tasks. We release three model variants. CodeGemma 7B pretrained (PT) and instruction-tuned (IT) variants have remarkably resilient natural language understanding, excel in mathematical reasoning, and match code capabilities of other open models. CodeGemma 2B is a state-of-the-art code completion model designed for fast code infilling and open-ended generation in latency-sensitive settings.
☆ Evaluating Open Language Models Across Task Types, Application Domains, and Reasoning Types: An In-Depth Experimental Analysis
The rapid rise of Language Models (LMs) has expanded their use in several applications. Yet, due to constraints of model size, associated cost, or proprietary restrictions, utilizing state-of-the-art (SOTA) LLMs is not always feasible. With open, smaller LMs emerging, more applications can leverage their capabilities, but selecting the right LM can be challenging. This work conducts an in-depth experimental analysis of the semantic correctness of outputs of 10 smaller, open LMs across three aspects: task types, application domains and reasoning types, using diverse prompt styles. We demonstrate that most effective models and prompt styles vary depending on the specific requirements. Our analysis provides a comparative assessment of LMs and prompt styles using a proposed three-tier schema of aspects for their strategic selection based on use-case and other constraints. We also show that if utilized appropriately, these LMs can compete with, and sometimes outperform, SOTA LLMs like DeepSeek-v2, GPT-3.5-Turbo, and GPT-4o.
☆ DistPred: A Distribution-Free Probabilistic Inference Method for Regression and Forecasting
Traditional regression and prediction tasks often only provide deterministic point estimates. To estimate the uncertainty or distribution information of the response variable, methods such as Bayesian inference, model ensembling, or MC Dropout are typically used. These methods either assume that the posterior distribution of samples follows a Gaussian process or require thousands of forward passes for sample generation. We propose a novel approach called DistPred for regression and forecasting tasks, which overcomes the limitations of existing methods while remaining simple and powerful. Specifically, we transform proper scoring rules that measure the discrepancy between the predicted distribution and the target distribution into a differentiable discrete form and use it as a loss function to train the model end-to-end. This allows the model to sample numerous samples in a single forward pass to estimate the potential distribution of the response variable. We have compared our method with several existing approaches on multiple datasets and achieved state-of-the-art performance. Additionally, our method significantly improves computational efficiency. For example, compared to state-of-the-art models, DistPred has a 90x faster inference speed. Experimental results can be reproduced through https://github.com/Anoise/DistPred.
☆ Boosting Scientific Concepts Understanding: Can Analogy from Teacher Models Empower Student Models?
Analogical reasoning plays a critical role in human cognition, enabling us to understand new concepts by associating them with familiar ones. Previous research in the AI community has mainly focused on identifying and generating analogies and then examining their quality under human evaluation, which overlooks the practical application of these analogies in real-world settings. Inspired by the human education process, in this paper, we propose to investigate how analogies created by teacher language models (LMs) can assist student LMs in understanding scientific concepts, thereby aligning more closely with practical scenarios. Our results suggest that free-form analogies can indeed aid LMs in understanding concepts. Additionally, analogies generated by student LMs can improve their own performance on scientific question answering, demonstrating their capability to use analogies for self-learning new knowledge. Resources are available at https://github.com/siyuyuan/SCUA.
☆ Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Large language models (LLMs) have shown promising abilities as cost-effective and reference-free evaluators for assessing language generation quality. In particular, pairwise LLM evaluators, which compare two generated texts and determine the preferred one, have been employed in a wide range of applications. However, LLMs exhibit preference biases and worrying sensitivity to prompt designs. In this work, we first reveal that the predictive preference of LLMs can be highly brittle and skewed, even with semantically equivalent instructions. We find that fairer predictive preferences from LLMs consistently lead to judgments that are better aligned with humans. Motivated by this phenomenon, we propose an automatic Zero-shot Evaluation-oriented Prompt Optimization framework, ZEPO, which aims to produce fairer preference decisions and improve the alignment of LLM evaluators with human judgments. To this end, we propose a zero-shot learning objective based on the preference decision fairness. ZEPO demonstrates substantial performance improvements over state-of-the-art LLM evaluators, without requiring labeled data, on representative meta-evaluation benchmarks. Our findings underscore the critical correlation between preference fairness and human alignment, positioning ZEPO as an efficient prompt optimizer for bridging the gap between LLM evaluators and human judgments.
comment: 5 pages, 3 figures, 1 table (12 pages, 4 figures, 6 tables including references and appendices)
☆ $\textit{Refiner}$: Restructure Retrieval Content Efficiently to Advance Question-Answering Capabilities
Large Language Models (LLMs) are limited by their parametric knowledge, leading to hallucinations in knowledge-extensive tasks. To address this, Retrieval-Augmented Generation (RAG) incorporates external document chunks to expand LLM knowledge. Furthermore, compressing information from document chunks through extraction or summarization can improve LLM performance. Nonetheless, LLMs still struggle to notice and utilize scattered key information, a problem known as the "lost-in-the-middle" syndrome. Therefore, we typically need to restructure the content for LLM to recognize the key information. We propose $\textit{Refiner}$, an end-to-end extract-and-restructure paradigm that operates in the post-retrieval process of RAG. $\textit{Refiner}$ leverages a single decoder-only LLM to adaptively extract query-relevant contents verbatim along with the necessary context, and section them based on their interconnectedness, thereby highlights information distinction, and aligns downstream LLMs with the original context effectively. Experiments show that a trained $\textit{Refiner}$ (with 7B parameters) exhibits significant gain to downstream LLM in improving answer accuracy, and outperforms other state-of-the-art advanced RAG and concurrent compressing approaches in various single-hop and multi-hop QA tasks. Notably, $\textit{Refiner}$ achieves a 80.5% tokens reduction and a 1.6-7.0% improvement margin in multi-hop tasks compared to the next best solution. $\textit{Refiner}$ is a plug-and-play solution that can be seamlessly integrated with RAG systems, facilitating its application across diverse open-source frameworks.
comment: 8 pages
☆ Preserving Knowledge in Large Language Model: A Model-Agnostic Self-Decompression Approach
Humans can retain old knowledge while learning new information, but Large Language Models (LLMs) often suffer from catastrophic forgetting when post-pretrained or supervised fine-tuned (SFT) on domain-specific data. Moreover, for Multimodal Large Language Models (MLLMs) which are composed of the LLM base and visual projector (e.g. LLaVA), a significant decline in performance on language benchmarks was observed compared to their single-modality counterparts. To address these challenges, we introduce a novel model-agnostic self-decompression method, Tree Generation (TG), that decompresses knowledge within LLMs into the training corpus. This paper focuses on TG-SFT, which can synthetically generate SFT data for the instruction tuning steps. By incorporating the dumped corpus during SFT for MLLMs, we significantly reduce the forgetting problem.
☆ Full-ECE: A Metric For Token-level Calibration on Large Language Models
Deep Neural Networks (DNNs) excel in various domains but face challenges in providing accurate uncertainty estimates, which are crucial for high-stakes applications. Large Language Models (LLMs) have recently emerged as powerful tools, demonstrating exceptional performance in language tasks. However, traditional calibration metrics such as Expected Calibration Error (ECE) and classwise-ECE (cw-ECE) are inadequate for LLMs due to their vast vocabularies, data complexity, and distributional focus. To address this, we propose a novel calibration concept called full calibration and introduce its corresponding metric, Full-ECE. Full-ECE evaluates the entire predicted probability distribution, offering a more accurate and robust measure of calibration for LLMs.
☆ Program Synthesis Benchmark for Visual Programming in XLogoOnline Environment
Large language and multimodal models have shown remarkable successes on various benchmarks focused on specific skills such as general-purpose programming, natural language understanding, math word problem-solving, and visual question answering. However, it is unclear how well these models perform on tasks that require a combination of these skills. In this paper, we curate a novel program synthesis benchmark based on the XLogoOnline visual programming environment. The benchmark comprises 85 real-world tasks from the Mini-level of the XLogoOnline environment, each requiring a combination of different skills such as spatial planning, basic programming, and logical reasoning. Our evaluation shows that current state-of-the-art models like GPT-4V and Llama3-70B struggle to solve these tasks, achieving only 20% and 2.35% success rates. Next, we develop a fine-tuning pipeline to boost the performance of models by leveraging a large-scale synthetic training dataset with over 80000 tasks. Moreover, we showcase how emulator-driven feedback can be used to design a curriculum over training data distribution. We showcase that a fine-tuned Llama3-8B drastically outperforms GPT-4V and Llama3-70B models, and provide an in-depth analysis of the models' expertise across different skill dimensions. We will publicly release the benchmark for future research on program synthesis in visual programming.
☆ GitHub Copilot: the perfect Code compLeeter?
This paper aims to evaluate GitHub Copilot's generated code quality based on the LeetCode problem set using a custom automated framework. We evaluate the results of Copilot for 4 programming languages: Java, C++, Python3 and Rust. We aim to evaluate Copilot's reliability in the code generation stage, the correctness of the generated code and its dependency on the programming language, problem's difficulty level and problem's topic. In addition to that, we evaluate code's time and memory efficiency and compare it to the average human results. In total, we generate solutions for 1760 problems for each programming language and evaluate all the Copilot's suggestions for each problem, resulting in over 50000 submissions to LeetCode spread over a 2-month period. We found that Copilot successfully solved most of the problems. However, Copilot was rather more successful in generating code in Java and C++ than in Python3 and Rust. Moreover, in case of Python3 Copilot proved to be rather unreliable in the code generation phase. We also discovered that Copilot's top-ranked suggestions are not always the best. In addition, we analysed how the topic of the problem impacts the correctness rate. Finally, based on statistics information from LeetCode, we can conclude that Copilot generates more efficient code than an average human.
comment: 10 pages, 6 figures. Code available: https://github.com/IljaSir/CopilotSolverForLeetCode
☆ GUICourse: From General Vision Language Models to Versatile GUI Agents
Utilizing Graphic User Interface (GUI) for human-computer interaction is essential for accessing a wide range of digital tools. Recent advancements in Vision Language Models (VLMs) highlight the compelling potential to develop versatile agents to help humans finish GUI navigation tasks. However, current VLMs are challenged in terms of fundamental abilities (OCR and grounding) and GUI knowledge (the functions and control methods of GUI elements), preventing them from becoming practical GUI agents. To solve these challenges, we contribute GUICourse, a suite of datasets to train visual-based GUI agents from general VLMs. First, we introduce the GUIEnv dataset to strengthen the OCR and grounding capabilities of VLMs. Then, we introduce the GUIAct and GUIChat datasets to enrich their knowledge of GUI components and interactions. Experiments demonstrate that our GUI agents have better performance on common GUI tasks than their baseline VLMs. Even the small-size GUI agent (with 3.1B parameters) can still work well on single-step and multi-step GUI tasks. Finally, we analyze the different varieties in the training stage of this agent by ablation study. Our source codes and datasets are released at https://github.com/yiye3/GUICourse.
☆ Temporal Lidar Depth Completion
Given the lidar measurements from an autonomous vehicle, we can project the points and generate a sparse depth image. Depth completion aims at increasing the resolution of such a depth image by infilling and interpolating the sparse depth values. Like most existing approaches, we make use of camera images as guidance in very sparse or occluded regions. In addition, we propose a temporal algorithm that utilizes information from previous timesteps using recurrence. In this work, we show how a state-of-the-art method PENet can be modified to benefit from recurrency. Our algorithm achieves state-of-the-art results on the KITTI depth completion dataset while adding only less than one percent of additional overhead in terms of both neural network parameters and floating point operations. The accuracy is especially improved for faraway objects and regions containing a low amount of lidar depth samples. Even in regions without any ground truth (like sky and rooftops) we observe large improvements which are not captured by the existing evaluation metrics.
☆ Management Decisions in Manufacturing using Causal Machine Learning -- To Rework, or not to Rework?
In this paper, we present a data-driven model for estimating optimal rework policies in manufacturing systems. We consider a single production stage within a multistage, lot-based system that allows for optional rework steps. While the rework decision depends on an intermediate state of the lot and system, the final product inspection, and thus the assessment of the actual yield, is delayed until production is complete. Repair steps are applied uniformly to the lot, potentially improving some of the individual items while degrading others. The challenge is thus to balance potential yield improvement with the rework costs incurred. Given the inherently causal nature of this decision problem, we propose a causal model to estimate yield improvement. We apply methods from causal machine learning, in particular double/debiased machine learning (DML) techniques, to estimate conditional treatment effects from data and derive policies for rework decisions. We validate our decision model using real-world data from opto-electronic semiconductor manufacturing, achieving a yield improvement of 2 - 3% during the color-conversion process of white light-emitting diodes (LEDs).
comment: 30 pages, 10 figures
☆ VideoVista: A Versatile Benchmark for Video Understanding and Reasoning
Despite significant breakthroughs in video analysis driven by the rapid development of large multimodal models (LMMs), there remains a lack of a versatile evaluation benchmark to comprehensively assess these models' performance in video understanding and reasoning. To address this, we present VideoVista, a video QA benchmark that integrates challenges across diverse content categories, durations, and abilities. Specifically, VideoVista comprises 25,000 questions derived from 3,400 videos spanning 14 categories (e.g., Howto, Film, and Entertainment) with durations ranging from a few seconds to over 10 minutes. Besides, it encompasses 19 types of understanding tasks (e.g., anomaly detection, interaction understanding) and 8 reasoning tasks (e.g., logical reasoning, causal reasoning). To achieve this, we present an automatic data construction framework, leveraging powerful GPT-4o alongside advanced analysis tools (e.g., video splitting, object segmenting, and tracking). We also utilize this framework to construct training data to enhance the capabilities of video-related LMMs (Video-LMMs). Through a comprehensive and quantitative evaluation of cutting-edge models, we reveal that: 1) Video-LMMs face difficulties in fine-grained video tasks involving temporal location, object tracking, and anomaly detection; 2) Video-LMMs present inferior logical and relation reasoning abilities; 3) Open-source Video-LMMs' performance is significantly lower than GPT-4o and Gemini-1.5, lagging by 20 points. This highlights the crucial role VideoVista will play in advancing LMMs that can accurately understand videos and perform precise reasoning.
comment: 38 pages, 44 figures
☆ Optimizing and Testing Instruction-Following: Analyzing the Impact of Fine-Grained Instruction Variants on instruction-tuned LLMs
The effective alignment of Large Language Models (LLMs) with precise instructions is essential for their application in diverse real-world scenarios. Current methods focus on enhancing the diversity and complexity of training and evaluation samples, yet they fall short in accurately assessing LLMs' ability to follow similar instruction variants. We introduce an effective data augmentation technique that decomposes complex instructions into simpler sub-components, modifies these, and reconstructs them into new variants, thereby preserves the original instruction's context and complexity while introducing variability, which is critical for training and evaluating LLMs' instruction-following precision. We developed the DeMoRecon dataset using this method to both fine-tune and evaluate LLMs. Our findings show that LLMs fine-tuned with DeMoRecon will gain significant performance boost on both ours and commonly used instructions-following benchmarks.
☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.
comment: 22 pages
☆ From Pixels to Progress: Generating Road Network from Satellite Imagery for Socioeconomic Insights in Impoverished Areas IJCAI2024
The Sustainable Development Goals (SDGs) aim to resolve societal challenges, such as eradicating poverty and improving the lives of vulnerable populations in impoverished areas. Those areas rely on road infrastructure construction to promote accessibility and economic development. Although publicly available data like OpenStreetMap is available to monitor road status, data completeness in impoverished areas is limited. Meanwhile, the development of deep learning techniques and satellite imagery shows excellent potential for earth monitoring. To tackle the challenge of road network assessment in impoverished areas, we develop a systematic road extraction framework combining an encoder-decoder architecture and morphological operations on satellite imagery, offering an integrated workflow for interdisciplinary researchers. Extensive experiments of road network extraction on real-world data in impoverished regions achieve a 42.7% enhancement in the F1-score over the baseline methods and reconstruct about 80% of the actual roads. We also propose a comprehensive road network dataset covering approximately 794,178 km2 area and 17.048 million people in 382 impoverished counties in China. The generated dataset is further utilized to conduct socioeconomic analysis in impoverished counties, showing that road network construction positively impacts regional economic development. The technical appendix, code, and generated dataset can be found at https://github.com/tsinghua-fib-lab/Road_network_extraction_impoverished_counties.
comment: 12 pages, 13 figures, IJCAI2024 (AI and Social Good)
☆ Development of an Adaptive Multi-Domain Artificial Intelligence System Built using Machine Learning and Expert Systems Technologies
Producing an artificial general intelligence (AGI) has been an elusive goal in artificial intelligence (AI) research for some time. An AGI would have the capability, like a human, to be exposed to a new problem domain, learn about it and then use reasoning processes to make decisions. While AI techniques have been used across a wide variety of problem domains, an AGI would require an AI that could reason beyond its programming and training. This paper presents a small step towards producing an AGI. It describes a mechanism for an AI to learn about and develop reasoning pathways to make decisions in an a priori unknown domain. It combines a classical AI technique, the expert system, with a its modern adaptation - the gradient descent trained expert system (GDTES) - and utilizes generative artificial intelligence (GAI) to create a network and training data set for this system. These can be created from available sources or may draw upon knowledge incorporated in a GAI's own pre-trained model. The learning process in GDTES is used to optimize the AI's decision-making. While this approach does not meet the standards that many have defined for an AGI, it provides a somewhat similar capability, albeit one which requires a learning process before use.
☆ The Fall of ROME: Understanding the Collapse of LLMs in Model Editing
Despite significant progress in model editing methods, their application in real-world scenarios remains challenging as they often cause large language models (LLMs) to collapse. Among them, ROME is particularly concerning, as it could disrupt LLMs with only a single edit. In this paper, we study the root causes of such collapse. Through extensive analysis, we identify two primary factors that contribute to the collapse: i) inconsistent handling of prefixed and unprefixed keys in the parameter update equation may result in very small denominators, causing excessively large parameter updates; ii) the subject of collapse cases is usually the first token, whose unprefixed key distribution significantly differs from the prefixed key distribution in autoregressive transformers, causing the aforementioned issue to materialize. To validate our analysis, we propose a simple yet effective approach: uniformly using prefixed keys during editing phase and adding prefixes during the testing phase. The experimental results show that the proposed solution can prevent model collapse while maintaining the effectiveness of the edits.
☆ Adversarial Style Augmentation via Large Language Model for Robust Fake News Detection
The spread of fake news negatively impacts individuals and is regarded as a significant social challenge that needs to be addressed. A number of algorithmic and insightful features have been identified for detecting fake news. However, with the recent LLMs and their advanced generation capabilities, many of the detectable features (e.g., style-conversion attacks) can be altered, making it more challenging to distinguish from real news. This study proposes adversarial style augmentation, AdStyle, to train a fake news detector that remains robust against various style-conversion attacks. Our model's key mechanism is the careful use of LLMs to automatically generate a diverse yet coherent range of style-conversion attack prompts. This improves the generation of prompts that are particularly difficult for the detector to handle. Experiments show that our augmentation strategy improves robustness and detection performance when tested on fake news benchmark datasets.
comment: 8 pages
☆ NLDF: Neural Light Dynamic Fields for Efficient 3D Talking Head Generation
Talking head generation based on the neural radiation fields model has shown promising visual effects. However, the slow rendering speed of NeRF seriously limits its application, due to the burdensome calculation process over hundreds of sampled points to synthesize one pixel. In this work, a novel Neural Light Dynamic Fields model is proposed aiming to achieve generating high quality 3D talking face with significant speedup. The NLDF represents light fields based on light segments, and a deep network is used to learn the entire light beam's information at once. In learning the knowledge distillation is applied and the NeRF based synthesized result is used to guide the correct coloration of light segments in NLDF. Furthermore, a novel active pool training strategy is proposed to focus on high frequency movements, particularly on the speaker mouth and eyebrows. The propose method effectively represents the facial light dynamics in 3D talking video generation, and it achieves approximately 30 times faster speed compared to state of the art NeRF based method, with comparable generation visual quality.
☆ Liberal Entity Matching as a Compound AI Toolchain
Entity matching (EM), the task of identifying whether two descriptions refer to the same entity, is essential in data management. Traditional methods have evolved from rule-based to AI-driven approaches, yet current techniques using large language models (LLMs) often fall short due to their reliance on static knowledge and rigid, predefined prompts. In this paper, we introduce Libem, a compound AI system designed to address these limitations by incorporating a flexible, tool-oriented approach. Libem supports entity matching through dynamic tool use, self-refinement, and optimization, allowing it to adapt and refine its process based on the dataset and performance metrics. Unlike traditional solo-AI EM systems, which often suffer from a lack of modularity that hinders iterative design improvements and system optimization, Libem offers a composable and reusable toolchain. This approach aims to contribute to ongoing discussions and developments in AI-driven data management.
comment: 2 pages, compound ai systems 2024
☆ Relational Learning in Pre-Trained Models: A Theory from Hypergraph Recovery Perspective
Foundation Models (FMs) have demonstrated remarkable insights into the relational dynamics of the world, leading to the crucial question: how do these models acquire an understanding of world hybrid relations? Traditional statistical learning, particularly for prediction problems, may overlook the rich and inherently structured information from the data, especially regarding the relationships between objects. We introduce a mathematical model that formalizes relational learning as hypergraph recovery to study pre-training of FMs. In our framework, the world is represented as a hypergraph, with data abstracted as random samples from hyperedges. We theoretically examine the feasibility of a Pre-Trained Model (PTM) to recover this hypergraph and analyze the data efficiency in a minimax near-optimal style. By integrating rich graph theories into the realm of PTMs, our mathematical framework offers powerful tools for an in-depth understanding of pre-training from a unique perspective and can be used under various scenarios. As an example, we extend the framework to entity alignment in multimodal learning.
☆ Performance Improvement of Language-Queried Audio Source Separation Based on Caption Augmentation From Large Language Models for DCASE Challenge 2024 Task 9
We present a prompt-engineering-based text-augmentation approach applied to a language-queried audio source separation (LASS) task. To enhance the performance of LASS, the proposed approach utilizes large language models (LLMs) to generate multiple captions corresponding to each sentence of the training dataset. To this end, we first perform experiments to identify the most effective prompts for caption augmentation with a smaller number of captions. A LASS model trained with these augmented captions demonstrates improved performance on the DCASE 2024 Task 9 validation set compared to that trained without augmentation. This study highlights the effectiveness of LLM-based caption augmentation in advancing language-queried audio source separation.
comment: DCASE 2024 Challenge Task 9, 4 pages
☆ SpoT-Mamba: Learning Long-Range Dependency on Spatio-Temporal Graphs with Selective State Spaces IJCAI 2024
Spatio-temporal graph (STG) forecasting is a critical task with extensive applications in the real world, including traffic and weather forecasting. Although several recent methods have been proposed to model complex dynamics in STGs, addressing long-range spatio-temporal dependencies remains a significant challenge, leading to limited performance gains. Inspired by a recently proposed state space model named Mamba, which has shown remarkable capability of capturing long-range dependency, we propose a new STG forecasting framework named SpoT-Mamba. SpoT-Mamba generates node embeddings by scanning various node-specific walk sequences. Based on the node embeddings, it conducts temporal scans to capture long-range spatio-temporal dependencies. Experimental results on the real-world traffic forecasting dataset demonstrate the effectiveness of SpoT-Mamba.
comment: 6 pages, 2 figures, 3 tables. Spatio-Temporal Reasoning and Learning (STRL) Workshop at the 33rd International Joint Conference on Artificial Intelligence (IJCAI 2024)
☆ FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation
Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.
☆ The Benefits of Power Regularization in Cooperative Reinforcement Learning
Cooperative Multi-Agent Reinforcement Learning (MARL) algorithms, trained only to optimize task reward, can lead to a concentration of power where the failure or adversarial intent of a single agent could decimate the reward of every agent in the system. In the context of teams of people, it is often useful to explicitly consider how power is distributed to ensure no person becomes a single point of failure. Here, we argue that explicitly regularizing the concentration of power in cooperative RL systems can result in systems which are more robust to single agent failure, adversarial attacks, and incentive changes of co-players. To this end, we define a practical pairwise measure of power that captures the ability of any co-player to influence the ego agent's reward, and then propose a power-regularized objective which balances task reward and power concentration. Given this new objective, we show that there always exists an equilibrium where every agent is playing a power-regularized best-response balancing power and task reward. Moreover, we present two algorithms for training agents towards this power-regularized objective: Sample Based Power Regularization (SBPR), which injects adversarial data during training; and Power Regularization via Intrinsic Motivation (PRIM), which adds an intrinsic motivation to regulate power to the training objective. Our experiments demonstrate that both algorithms successfully balance task reward and power, leading to lower power behavior than the baseline of task-only reward and avoid catastrophic events in case an agent in the system goes off-policy.
☆ Evading AI-Generated Content Detectors using Homoglyphs
The generation of text that is increasingly human-like has been enabled by the advent of large language models (LLMs). As the detection of AI-generated content holds significant importance in the fight against issues such as misinformation and academic cheating, numerous studies have been conducted to develop reliable LLM detectors. While promising results have been demonstrated by such detectors on test data, recent research has revealed that they can be circumvented by employing different techniques. In this article, homoglyph-based ($a \rightarrow {\alpha}$) attacks that can be used to circumvent existing LLM detectors are presented. The efficacy of the attacks is illustrated by analizing how homoglyphs shift the tokenization of the text, and thus its token loglikelihoods. A comprehensive evaluation is conducted to assess the effectiveness of homoglyphs on state-of-the-art LLM detectors, including Binoculars, DetectGPT, OpenAI's detector, and watermarking techniques, on five different datasets. A significant reduction in the efficiency of all the studied configurations of detectors and datasets, down to an accuracy of 0.5 (random guessing), is demonstrated by the proposed approach. The results show that homoglyph-based attacks can effectively evade existing LLM detectors, and the implications of these findings are discussed along with possible defenses against such attacks.
☆ MiniConGTS: A Near Ultimate Minimalist Contrastive Grid Tagging Scheme for Aspect Sentiment Triplet Extraction
Aspect Sentiment Triplet Extraction (ASTE) aims to co-extract the sentiment triplets in a given corpus. Existing approaches within the pretraining-finetuning paradigm tend to either meticulously craft complex tagging schemes and classification heads, or incorporate external semantic augmentation to enhance performance. In this study, we, for the first time, re-evaluate the redundancy in tagging schemes and the internal enhancement in pretrained representations. We propose a method to improve and utilize pretrained representations by integrating a minimalist tagging scheme and a novel token-level contrastive learning strategy. The proposed approach demonstrates comparable or superior performance compared to state-of-the-art techniques while featuring a more compact design and reduced computational overhead. Additionally, we are the first to formally evaluate GPT-4's performance in few-shot learning and Chain-of-Thought scenarios for this task. The results demonstrate that the pretraining-finetuning paradigm remains highly effective even in the era of large language models.
comment: arXiv admin note: text overlap with arXiv:2403.07342
♻ ☆ Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Current Multimodal Large Language Models (MLLMs) typically integrate a pre-trained LLM with another pre-trained vision transformer through a connector, such as an MLP, endowing the LLM with visual capabilities. However, the misalignment between two embedding strategies in MLLMs -- the structural textual embeddings based on an embedding look-up table and the continuous embeddings generated directly by the vision encoder -- makes challenges for a more seamless fusion of visual and textual information. We propose Ovis, a novel MLLM architecture designed to structurally align visual and textual embeddings. Ovis integrates an additional learnable visual embedding table into the visual encoder's process. To capture rich visual semantics, each image patch indexes the visual embedding table multiple times, resulting in a final visual embedding that is a probabilistic combination of the indexed embeddings. This structural approach mirrors the method used for generating textual embeddings. Empirical evaluations on various multimodal benchmarks show that Ovis outperforms open-source MLLMs of similar parameter scales and even surpasses the proprietary model Qwen-VL-Plus overall. These results highlight the potential of Ovis' structured visual representation for advancing MLLM architectural design and promoting more effective multimodal learning. Code, datasets, and models are available at https://github.com/AIDC-AI/Ovis.
♻ ☆ Measurement Simplification in ρ-POMDP with Performance Guarantees
Decision making under uncertainty is at the heart of any autonomous system acting with imperfect information. The cost of solving the decision making problem is exponential in the action and observation spaces, thus rendering it unfeasible for many online systems. This paper introduces a novel approach to efficient decision-making, by partitioning the high-dimensional observation space. Using the partitioned observation space, we formulate analytical bounds on the expected information-theoretic reward, for general belief distributions. These bounds are then used to plan efficiently while keeping performance guarantees. We show that the bounds are adaptive, computationally efficient, and that they converge to the original solution. We extend the partitioning paradigm and present a hierarchy of partitioned spaces that allows greater efficiency in planning. We then propose a specific variant of these bounds for Gaussian beliefs and show a theoretical performance improvement of at least a factor of 4. Finally, we compare our novel method to other state of the art algorithms in active SLAM scenarios, in simulation and in real experiments. In both cases we show a significant speed-up in planning with performance guarantees.
♻ ☆ InSaAF: Incorporating Safety through Accuracy and Fairness | Are LLMs ready for the Indian Legal Domain?
Recent advancements in language technology and Artificial Intelligence have resulted in numerous Language Models being proposed to perform various tasks in the legal domain ranging from predicting judgments to generating summaries. Despite their immense potential, these models have been proven to learn and exhibit societal biases and make unfair predictions. In this study, we explore the ability of Large Language Models (LLMs) to perform legal tasks in the Indian landscape when social factors are involved. We present a novel metric, $\beta$-weighted $\textit{Legal Safety Score ($LSS_{\beta}$)}$, which encapsulates both the fairness and accuracy aspects of the LLM. We assess LLMs' safety by considering its performance in the $\textit{Binary Statutory Reasoning}$ task and its fairness exhibition with respect to various axes of disparities in the Indian society. Task performance and fairness scores of LLaMA and LLaMA--2 models indicate that the proposed $LSS_{\beta}$ metric can effectively determine the readiness of a model for safe usage in the legal sector. We also propose finetuning pipelines, utilising specialised legal datasets, as a potential method to mitigate bias and improve model safety. The finetuning procedures on LLaMA and LLaMA--2 models increase the $LSS_{\beta}$, improving their usability in the Indian legal domain. Our code is publicly released.
♻ ☆ Grokking Group Multiplication with Cosets
The complex and unpredictable nature of deep neural networks prevents their safe use in many high-stakes applications. There have been many techniques developed to interpret deep neural networks, but all have substantial limitations. Algorithmic tasks have proven to be a fruitful test ground for interpreting a neural network end-to-end. Building on previous work, we completely reverse engineer fully connected one-hidden layer networks that have ``grokked'' the arithmetic of the permutation groups $S_5$ and $S_6$. The models discover the true subgroup structure of the full group and converge on neural circuits that decompose the group arithmetic using the permutation group's subgroups. We relate how we reverse engineered the model's mechanisms and confirmed our theory was a faithful description of the circuit's functionality. We also draw attention to current challenges in conducting interpretability research by comparing our work to Chughtai et al. [4] which alleges to find a different algorithm for this same problem.
♻ ☆ Shaping Up SHAP: Enhancing Stability through Layer-Wise Neighbor Selection
Machine learning techniques, such as deep learning and ensemble methods, are widely used in various domains due to their ability to handle complex real-world tasks. However, their black-box nature has raised multiple concerns about the fairness, trustworthiness, and transparency of computer-assisted decision-making. This has led to the emergence of local post-hoc explainability methods, which offer explanations for individual decisions made by black-box algorithms. Among these methods, Kernel SHAP is widely used due to its model-agnostic nature and its well-founded theoretical framework. Despite these strengths, Kernel SHAP suffers from high instability: different executions of the method with the same inputs can lead to significantly different explanations, which diminishes the relevance of the explanations. The contribution of this paper is two-fold. On the one hand, we show that Kernel SHAP's instability is caused by its stochastic neighbor selection procedure, which we adapt to achieve full stability without compromising explanation fidelity. On the other hand, we show that by restricting the neighbors generation to perturbations of size 1 -- which we call the coalitions of Layer 1 -- we obtain a novel feature-attribution method that is fully stable, computationally efficient, and still meaningful.
♻ ☆ Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions
Recent advancements in general-purpose AI have highlighted the importance of guiding AI systems towards the intended goals, ethical principles, and values of individuals and groups, a concept broadly recognized as alignment. However, the lack of clarified definitions and scopes of human-AI alignment poses a significant obstacle, hampering collaborative efforts across research domains to achieve this alignment. In particular, ML- and philosophy-oriented alignment research often views AI alignment as a static, unidirectional process (i.e., aiming to ensure that AI systems' objectives match humans) rather than an ongoing, mutual alignment problem [429]. This perspective largely neglects the long-term interaction and dynamic changes of alignment. To understand these gaps, we introduce a systematic review of over 400 papers published between 2019 and January 2024, spanning multiple domains such as Human-Computer Interaction (HCI), Natural Language Processing (NLP), Machine Learning (ML), and others. We characterize, define and scope human-AI alignment. From this, we present a conceptual framework of "Bidirectional Human-AI Alignment" to organize the literature from a human-centered perspective. This framework encompasses both 1) conventional studies of aligning AI to humans that ensures AI produces the intended outcomes determined by humans, and 2) a proposed concept of aligning humans to AI, which aims to help individuals and society adjust to AI advancements both cognitively and behaviorally. Additionally, we articulate the key findings derived from literature analysis, including discussions about human values, interaction techniques, and evaluations. To pave the way for future studies, we envision three key challenges for future directions and propose examples of potential future solutions.
comment: 56 pages
♻ ☆ Social Environment Design ICML 2024
Artificial Intelligence (AI) holds promise as a technology that can be used to improve government and economic policy-making. This paper proposes a new research agenda towards this end by introducing Social Environment Design, a general framework for the use of AI for automated policy-making that connects with the Reinforcement Learning, EconCS, and Computational Social Choice communities. The framework seeks to capture general economic environments, includes voting on policy objectives, and gives a direction for the systematic analysis of government and economic policy through AI simulation. We highlight key open problems for future research in AI-based policy-making. By solving these challenges, we hope to achieve various social welfare objectives, thereby promoting more ethical and responsible decision making.
comment: ICML 2024 Position Paper. Website at https://sed.eddie.win
♻ ☆ Reward Machines for Deep RL in Noisy and Uncertain Environments
Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.
♻ ☆ Generative Pretrained Structured Transformers: Unsupervised Syntactic Language Models at Scale ACL 2024
A syntactic language model (SLM) incrementally generates a sentence with its syntactic tree in a left-to-right manner. We present Generative Pretrained Structured Transformers (GPST), an unsupervised SLM at scale capable of being pre-trained from scratch on raw texts with high parallelism. GPST circumvents the limitations of previous SLMs such as relying on gold trees and sequential training. It consists of two components, a usual SLM supervised by a uni-directional language modeling loss, and an additional composition model, which induces syntactic parse trees and computes constituent representations, supervised by a bi-directional language modeling loss. We propose a representation surrogate to enable joint parallel training of the two models in a hard-EM fashion. We pre-train GPST on OpenWebText, a corpus with $9$ billion tokens, and demonstrate the superiority of GPST over GPT-2 with a comparable size in numerous tasks covering both language understanding and language generation. Meanwhile, GPST also significantly outperforms existing unsupervised SLMs on left-to-right grammar induction, while holding a substantial acceleration on training.
comment: accepted by ACL 2024
♻ ☆ Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pernicious behaviors like reward-tampering, where a model directly modifies its own reward mechanism. However, these more pernicious behaviors may be too complex to be discovered via exploration. In this paper, we study whether Large Language Model (LLM) assistants which find easily discovered forms of specification gaming will generalize to perform rarer and more blatant forms, up to and including reward-tampering. We construct a curriculum of increasingly sophisticated gameable environments and find that training on early-curriculum environments leads to more specification gaming on remaining environments. Strikingly, a small but non-negligible proportion of the time, LLM assistants trained on the full curriculum generalize zero-shot to directly rewriting their own reward function. Retraining an LLM not to game early-curriculum environments mitigates, but does not eliminate, reward-tampering in later environments. Moreover, adding harmlessness training to our gameable environments does not prevent reward-tampering. These results demonstrate that LLMs can generalize from common forms of specification gaming to more pernicious reward tampering and that such behavior may be nontrivial to remove.
comment: Fix title typo, update main figure to render properly on non-chrome browsers
♻ ☆ t-DGR: A Trajectory-Based Deep Generative Replay Method for Continual Learning in Decision Making
Deep generative replay has emerged as a promising approach for continual learning in decision-making tasks. This approach addresses the problem of catastrophic forgetting by leveraging the generation of trajectories from previously encountered tasks to augment the current dataset. However, existing deep generative replay methods for continual learning rely on autoregressive models, which suffer from compounding errors in the generated trajectories. In this paper, we propose a simple, scalable, and non-autoregressive method for continual learning in decision-making tasks using a generative model that generates task samples conditioned on the trajectory timestep. We evaluate our method on Continual World benchmarks and find that our approach achieves state-of-the-art performance on the average success rate metric among continual learning methods. Code is available at https://github.com/WilliamYue37/t-DGR.
comment: Published at 3rd Conference on Lifelong Learning Agents (CoLLAs), 2024
♻ ☆ An Interactive Agent Foundation Model
The development of artificial intelligence systems is transitioning from creating static, task-specific models to dynamic, agent-based systems capable of performing well in a wide range of applications. We propose an Interactive Agent Foundation Model that uses a novel multi-task agent training paradigm for training AI agents across a wide range of domains, datasets, and tasks. Our training paradigm unifies diverse pre-training strategies, including visual masked auto-encoders, language modeling, and next-action prediction, enabling a versatile and adaptable AI framework. We demonstrate the performance of our framework across three separate domains -- Robotics, Gaming AI, and Healthcare. Our model demonstrates its ability to generate meaningful and contextually relevant outputs in each area. The strength of our approach lies in its generality, leveraging a variety of data sources such as robotics sequences, gameplay data, large-scale video datasets, and textual information for effective multimodal and multi-task learning. Our approach provides a promising avenue for developing generalist, action-taking, multimodal systems.
♻ ☆ Evaluating Task-based Effectiveness of MLLMs on Charts
In this paper, we explore a forward-thinking question: Is GPT-4V effective at low-level data analysis tasks on charts? To this end, we first curate a large-scale dataset, named ChartInsights, consisting of 89,388 quartets (chart, task, question, answer) and covering 10 widely-used low-level data analysis tasks on 7 chart types. Firstly, we conduct systematic evaluations to understand the capabilities and limitations of 18 advanced MLLMs, which include 12 open-source models and 6 closed-source models. Starting with a standard textual prompt approach, the average accuracy rate across the 18 MLLMs is 36.17%. Among all the models, GPT-4V achieves the highest accuracy, reaching 56.13%. To understand the limitations of multimodal large models in low-level data analysis tasks, we have designed various experiments to conduct an in-depth test of capabilities of GPT-4V. We further investigate how visual modifications to charts, such as altering visual elements (e.g. changing color schemes) and introducing perturbations (e.g. adding image noise), affect performance of GPT-4V. Secondly, we present 12 experimental findings. These findings suggest potential of GPT-4V to revolutionize interaction with charts and uncover the gap between human analytic needs and capabilities of GPT-4V. Thirdly, we propose a novel textual prompt strategy, named Chain-of-Charts, tailored for low-level analysis tasks, which boosts model performance by 24.36%, resulting in an accuracy of 80.49%. Furthermore, by incorporating a visual prompt strategy that directs attention of GPT-4V to question-relevant visual elements, we further improve accuracy to 83.83%. Our study not only sheds light on the capabilities and limitations of GPT-4V in low-level data analysis tasks but also offers valuable insights for future research.
comment: The experimental part needs to be revised. Withdraw this version
♻ ☆ Reference-based Metrics Disprove Themselves in Question Generation
Reference-based metrics such as BLEU and BERTScore are widely used to evaluate question generation (QG). In this study, on QG benchmarks such as SQuAD and HotpotQA, we find that using human-written references cannot guarantee the effectiveness of the reference-based metrics. Most QG benchmarks have only one reference; we replicated the annotation process and collect another reference. A good metric was expected to grade a human-validated question no worse than generated questions. However, the results of reference-based metrics on our newly collected reference disproved the metrics themselves. We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity, utilizing large language models. These criteria are not constrained to the syntactic or semantic of a single reference question, and the metric does not require a diverse set of references. Experiments reveal that our metric accurately distinguishes between high-quality questions and flawed ones, and achieves state-of-the-art alignment with human judgment.
comment: Revised Jun 14 2024; Under Review
♻ ☆ BirdSet: A Dataset and Benchmark for Classification in Avian Bioacoustics NeurIPS2024
Deep learning (DL) models have emerged as a powerful tool in avian bioacoustics to assess environmental health. To maximize the potential of cost-effective and minimal-invasive passive acoustic monitoring (PAM), DL models must analyze bird vocalizations across a wide range of species and environmental conditions. However, data fragmentation challenges a comprehensive evaluation of generalization performance. Therefore, we introduce the BirdSet dataset, comprising approximately 520,000 global bird recordings for training and over 400 hours of PAM recordings for testing. Our benchmark offers baselines for several DL models to enhance comparability and consolidate research across studies, along with code implementations that include comprehensive training and evaluation protocols.
comment: Under review @NeurIPS2024 Datasets & Benchmarks
♻ ☆ Learning to Check: Unleashing Potentials for Self-Correction in Large Language Models
Self-correction has achieved impressive results in enhancing the style and security of the generated output from large language models (LLMs). However, recent studies suggest that self-correction might be limited or even counterproductive in reasoning tasks due to LLMs' difficulties in identifying logical mistakes. In this paper, we aim to enhance the self-checking capabilities of LLMs by constructing training data for checking tasks. Specifically, we apply the Chain of Thought (CoT) methodology to self-checking tasks, utilizing fine-grained step-level analyses and explanations to assess the correctness of reasoning paths. We propose a specialized checking format called "Step CoT Check". Following this format, we construct a checking-correction dataset that includes detailed step-by-step analysis and checking. Then we fine-tune LLMs to enhance their error detection and correction abilities. Our experiments demonstrate that fine-tuning with the "Step CoT Check" format significantly improves the self-checking and self-correction abilities of LLMs across multiple benchmarks. This approach outperforms other formats, especially in locating the incorrect position, with greater benefits observed in larger models. For reproducibility, all the datasets and code are provided in https://github.com/bammt/Learn-to-check.
♻ ☆ A Perspective on Explainable Artificial Intelligence Methods: SHAP and LIME
eXplainable artificial intelligence (XAI) methods have emerged to convert the black box of machine learning (ML) models into a more digestible form. These methods help to communicate how the model works with the aim of making ML models more transparent and increasing the trust of end-users into their output. SHapley Additive exPlanations (SHAP) and Local Interpretable Model Agnostic Explanation (LIME) are two widely used XAI methods, particularly with tabular data. In this perspective piece, we discuss the way the explainability metrics of these two methods are generated and propose a framework for interpretation of their outputs, highlighting their weaknesses and strengths. Specifically, we discuss their outcomes in terms of model-dependency and in the presence of collinearity among the features, relying on a case study from the biomedical domain (classification of individuals with or without myocardial infarction). The results indicate that SHAP and LIME are highly affected by the adopted ML model and feature collinearity, raising a note of caution on their usage and interpretation.
♻ ☆ DiaHalu: A Dialogue-level Hallucination Evaluation Benchmark for Large Language Models
Since large language models (LLMs) achieve significant success in recent years, the hallucination issue remains a challenge, numerous benchmarks are proposed to detect the hallucination. Nevertheless, some of these benchmarks are not naturally generated by LLMs but are intentionally induced. Also, many merely focus on the factuality hallucination while ignoring the faithfulness hallucination. Additionally, although dialogue pattern is more widely utilized in the era of LLMs, current benchmarks only concentrate on sentence-level and passage-level hallucination. In this study, we propose DiaHalu, the first dialogue-level hallucination evaluation benchmark to our knowledge. Initially, we integrate the collected topics into system prompts and facilitate a dialogue between two ChatGPT3.5. Subsequently, we manually modify the contents that do not adhere to human language conventions and then have LLMs re-generate, simulating authentic human-machine interaction scenarios. Finally, professional scholars annotate all the samples in the dataset. DiaHalu covers four common multi-turn dialogue domains and five hallucination subtypes, extended from factuality and faithfulness hallucination. Experiments through some well-known LLMs and detection methods on the dataset show that DiaHalu is a challenging benchmark, holding significant value for further research.
♻ ☆ Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners? ICML 2024
There is increasing interest in employing large language models (LLMs) as cognitive models. For such purposes, it is central to understand which properties of human cognition are well-modeled by LLMs, and which are not. In this work, we study the biases of LLMs in relation to those known in children when solving arithmetic word problems. Surveying the learning science literature, we posit that the problem-solving process can be split into three distinct steps: text comprehension, solution planning and solution execution. We construct tests for each one in order to understand whether current LLMs display the same cognitive biases as children in these steps. We generate a novel set of word problems for each of these tests, using a neuro-symbolic approach that enables fine-grained control over the problem features. We find evidence that LLMs, with and without instruction-tuning, exhibit human-like biases in both the text-comprehension and the solution-planning steps of the solving process, but not in the final step, in which the arithmetic expressions are executed to obtain the answer.
comment: Accepted at ICML 2024
♻ ☆ Improving Generalization of Neural Vehicle Routing Problem Solvers Through the Lens of Model Architecture
Neural models produce promising results when solving Vehicle Routing Problems (VRPs), but often fall short in generalization. Recent attempts to enhance model generalization often incur unnecessarily large training cost or cannot be directly applied to other models solving different VRP variants. To address these issues, we take a novel perspective on model architecture in this study. Specifically, we propose a plug-and-play Entropy-based Scaling Factor (ESF) and a Distribution-Specific (DS) decoder to enhance the size and distribution generalization, respectively. ESF adjusts the attention weight pattern of the model towards familiar ones discovered during training when solving VRPs of varying sizes. The DS decoder explicitly models VRPs of multiple training distribution patterns through multiple auxiliary light decoders, expanding the model representation space to encompass a broader range of distributional scenarios. We conduct extensive experiments on both synthetic and widely recognized real-world benchmarking datasets and compare the performance with seven baseline models. The results demonstrate the effectiveness of using ESF and DS decoder to obtain a more generalizable model and showcase their applicability to solve different VRP variants, i.e., travelling salesman problem and capacitated VRP. Notably, our proposed generic components require minimal computational resources, and can be effortlessly integrated into conventional generalization strategies to further elevate model generalization.
comment: 13 pages, 6 figures, and 6 tables
♻ ☆ Beyond Bare Queries: Open-Vocabulary Object Retrieval with 3D Scene Graph
Locating objects referred to in natural language poses a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object retrieval with simple (bare) queries but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene spatial graph representation with metric edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to form 3D objects, an advanced raycasting algorithm to project them to 2D, and a vision-language model to describe them as graph nodes. On Replica and ScanNet datasets, we show that the designed method accurately constructs 3D object-centric maps. We have demonstrated that their quality takes a leading place for open-vocabulary 3D semantic segmentation against other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On Sr3D and Nr3D benchmarks, our deductive approach demonstrates a significant improvement, enabling retrieving objects by complex queries compared to other state-of-the-art methods. Considering our design solutions, we achieved a processing speed approximately x3 times faster than the closest analog. This promising performance enables our approach for usage in applied intelligent robotics projects. We make the code publicly available at linukc.github.io/bbq/.
comment: 9 pages, 4 figures, 4 tables
♻ ☆ Backdoor for Debias: Mitigating Model Bias with Backdoor Attack-based Artificial Bias
With the swift advancement of deep learning, state-of-the-art algorithms have been utilized in various social situations. Nonetheless, some algorithms have been discovered to exhibit biases and provide unequal results. The current debiasing methods face challenges such as poor utilization of data or intricate training requirements. In this work, we found that the backdoor attack can construct an artificial bias similar to the model bias derived in standard training. Considering the strong adjustability of backdoor triggers, we are motivated to mitigate the model bias by carefully designing reverse artificial bias created from backdoor attack. Based on this, we propose a backdoor debiasing framework based on knowledge distillation, which effectively reduces the model bias from original data and minimizes security risks from the backdoor attack. The proposed solution is validated on both image and structured datasets, showing promising results. This work advances the understanding of backdoor attacks and highlights its potential for beneficial applications. The code for the study can be found at \url{https://anonymous.4open.science/r/DwB-BC07/}.
♻ ☆ Fine-Grained Domain Generalization with Feature Structuralization
Fine-grained domain generalization (FGDG) is a more challenging task than traditional DG tasks due to its small inter-class variations and relatively large intra-class disparities. When domain distribution changes, the vulnerability of subtle features leads to a severe deterioration in model performance. Nevertheless, humans inherently demonstrate the capacity for generalizing to out-of-distribution data, leveraging structured multi-granularity knowledge that emerges from discerning the commonality and specificity within categories. Likewise, we propose a Feature Structuralized Domain Generalization (FSDG) model, wherein features experience structuralization into common, specific, and confounding segments, harmoniously aligned with their relevant semantic concepts, to elevate performance in FGDG. Specifically, feature structuralization (FS) is accomplished through joint optimization of five constraints: a decorrelation function applied to disentangled segments, three constraints ensuring common feature consistency and specific feature distinctiveness, and a prediction calibration term. By imposing these stipulations, FSDG is prompted to disentangle and align features based on multi-granularity knowledge, facilitating robust subtle distinctions among categories. Extensive experimentation on three benchmarks consistently validates the superiority of FSDG over state-of-the-art counterparts, with an average improvement of 6.2% in FGDG performance. Beyond that, the explainability analysis on explicit concept matching intensity between the shared concepts among categories and the model channels, along with experiments on various mainstream model architectures, substantiates the validity of FS.
♻ ☆ KInIT at SemEval-2024 Task 8: Fine-tuned LLMs for Multilingual Machine-Generated Text Detection SemEval-2024
SemEval-2024 Task 8 is focused on multigenerator, multidomain, and multilingual black-box machine-generated text detection. Such a detection is important for preventing a potential misuse of large language models (LLMs), the newest of which are very capable in generating multilingual human-like texts. We have coped with this task in multiple ways, utilizing language identification and parameter-efficient fine-tuning of smaller LLMs for text classification. We have further used the per-language classification-threshold calibration to uniquely combine fine-tuned models predictions with statistical detection metrics to improve generalization of the system detection performance. Our submitted method achieved competitive results, ranking at the fourth place, just under 1 percentage point behind the winner.
comment: SemEval-2024 Task 8
♻ ☆ CAT: A Causally Graph Attention Network for Trimming Heterophilic Graph
Local Attention-guided Message Passing Mechanism (LAMP) adopted in Graph Attention Networks (GATs) is designed to adaptively learn the importance of neighboring nodes for better local aggregation on the graph, which can bring the representations of similar neighbors closer effectively, thus showing stronger discrimination ability. However, existing GATs suffer from a significant discrimination ability decline in heterophilic graphs because the high proportion of dissimilar neighbors can weaken the self-attention of the central node, jointly resulting in the deviation of the central node from similar nodes in the representation space. This kind of effect generated by neighboring nodes is called the Distraction Effect (DE) in this paper. To estimate and weaken the DE of neighboring nodes, we propose a Causally graph Attention network for Trimming heterophilic graph (CAT). To estimate the DE, since the DE are generated through two paths (grab the attention assigned to neighbors and reduce the self-attention of the central node), we use Total Effect to model DE, which is a kind of causal estimand and can be estimated from intervened data; To weaken the DE, we identify the neighbors with the highest DE (we call them Distraction Neighbors) and remove them. We adopt three representative GATs as the base model within the proposed CAT framework and conduct experiments on seven heterophilic datasets in three different sizes. Comparative experiments show that CAT can improve the node classification accuracy of all base GAT models. Ablation experiments and visualization further validate the enhancement of discrimination ability brought by CAT. The source code is available at https://github.com/GeoX-Lab/CAT.
comment: 25 pages, 18 figures, 5 tables
♻ ☆ Probabilistic Constrained Reinforcement Learning with Formal Interpretability
Reinforcement learning can provide effective reasoning for sequential decision-making problems with variable dynamics. Such reasoning in practical implementation, however, poses a persistent challenge in interpreting the reward function and the corresponding optimal policy. Consequently, representing sequential decision-making problems as probabilistic inference can have considerable value, as, in principle, the inference offers diverse and powerful mathematical tools to infer the stochastic dynamics whilst suggesting a probabilistic interpretation of policy optimization. In this study, we propose a novel Adaptive Wasserstein Variational Optimization, namely AWaVO, to tackle these interpretability challenges. Our approach uses formal methods to achieve the interpretability for convergence guarantee, training transparency, and intrinsic decision-interpretation. To demonstrate its practicality, we showcase guaranteed interpretability with an optimal global convergence rate in simulation and in practical quadrotor tasks. In comparison with state-of-the-art benchmarks including TRPO-IPO, PCPO and CRPO, we empirically verify that AWaVO offers a reasonable trade-off between high performance and sufficient interpretability.
comment: 25 pages, 9 figures, containing Appendix
♻ ☆ Quantitative CLTs in Deep Neural Networks
We study the distribution of a fully connected neural network with random Gaussian weights and biases in which the hidden layer widths are proportional to a large constant $n$. Under mild assumptions on the non-linearity, we obtain quantitative bounds on normal approximations valid at large but finite $n$ and any fixed network depth. Our theorems show both for the finite-dimensional distributions and the entire process, that the distance between a random fully connected network (and its derivatives) to the corresponding infinite width Gaussian process scales like $n^{-\gamma}$ for $\gamma>0$, with the exponent depending on the metric used to measure discrepancy. Our bounds are strictly stronger in terms of their dependence on network width than any previously available in the literature; in the one-dimensional case, we also prove that they are optimal, i.e., we establish matching lower bounds.
♻ ☆ MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities
In this study, we focus on heterogeneous knowledge transfer across entirely different model architectures, tasks, and modalities. Existing knowledge transfer methods (e.g., backbone sharing, knowledge distillation) often hinge on shared elements within model structures or task-specific features/labels, limiting transfers to complex model types or tasks. To overcome these challenges, we present MergeNet, which learns to bridge the gap of parameter spaces of heterogeneous models, facilitating the direct interaction, extraction, and application of knowledge within these parameter spaces. The core mechanism of MergeNet lies in the parameter adapter, which operates by querying the source model's low-rank parameters and adeptly learning to identify and map parameters into the target model. MergeNet is learned alongside both models, allowing our framework to dynamically transfer and adapt knowledge relevant to the current stage, including the training trajectory knowledge of the source model. Extensive experiments on heterogeneous knowledge transfer demonstrate significant improvements in challenging settings, where representative approaches may falter or prove less applicable.
♻ ☆ SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Recently, interpreting complex charts with logical reasoning has emerged as challenges due to the development of vision-language models. A prior state-of-the-art (SOTA) model has presented an end-to-end method that leverages the vision-language model to convert charts into table format utilizing Large Language Model (LLM) for reasoning. However, unlike natural images, charts contain a mix of essential and irrelevant information required for chart reasoning, and we discover that this characteristic can lower the performance of chart-to-table extraction. In this paper, we introduce SIMPLOT, a method designed to extract only the elements necessary for chart reasoning. The proposed method involves two steps: 1) training to mimic a simple plot that contains only the essential information from a complex chart for table extraction, followed by 2) performing reasoning based on the table. Our model enables accurate chart reasoning without the need for additional annotations or datasets, and its effectiveness is demonstrated through various experiments. Furthermore, we propose a novel prompt mimicking how human interpret charts for more accurate reasoning. Our source code is available at https://github.com/sangwu99/Simplot.
♻ ☆ IDVT: Interest-aware Denoising and View-guided Tuning for Social Recommendation
In the information age, recommendation systems are vital for efficiently filtering information and identifying user preferences. Online social platforms have enriched these systems by providing valuable auxiliary information. Socially connected users are assumed to share similar preferences, enhancing recommendation accuracy and addressing cold start issues. However, empirical findings challenge the assumption, revealing that certain social connections can actually harm system performance. Our statistical analysis indicates a significant amount of noise in the social network, where many socially connected users do not share common interests. To address this issue, we propose an innovative \underline{I}nterest-aware \underline{D}enoising and \underline{V}iew-guided \underline{T}uning (IDVT) method for the social recommendation. The first ID part effectively denoises social connections. Specifically, the denoising process considers both social network structure and user interaction interests in a global view. Moreover, in this global view, we also integrate denoised social information (social domain) into the propagation of the user-item interactions (collaborative domain) and aggregate user representations from two domains using a gating mechanism. To tackle potential user interest loss and enhance model robustness within the global view, our second VT part introduces two additional views (local view and dropout-enhanced view) for fine-tuning user representations in the global view through contrastive learning. Extensive evaluations on real-world datasets with varying noise ratios demonstrate the superiority of IDVT over state-of-the-art social recommendation methods.
♻ ☆ An Empirical Study on Cross-lingual Vocabulary Adaptation for Efficient Language Model Inference
The development of state-of-the-art generative large language models (LLMs) disproportionately relies on English-centric tokenizers, vocabulary and pre-training data. Despite the fact that some LLMs have multilingual capabilities, recent studies have shown that their inference efficiency deteriorates when generating text in languages other than English. This results in increased inference time and costs. Cross-lingual vocabulary adaptation (CVA) methods have been proposed for adapting models to a target language aiming to improve downstream performance. However, the effectiveness of these methods on increasing inference efficiency of generative LLMs has yet to be explored. In this paper, we perform an empirical study of five CVA methods on four generative LLMs (including monolingual and multilingual models) across four typologically-diverse languages and four natural language understanding tasks. We find that CVA substantially contributes to LLM inference speedups of up to 271.5\%. We also show that adapting LLMs that have been pre-trained on more balanced multilingual data results in downstream performance comparable to the original models.
♻ ☆ Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy
Human forecasting accuracy in practice relies on the 'wisdom of the crowd' effect, in which predictions about future events are significantly improved by aggregating across a crowd of individual forecasters. Past work on the forecasting ability of large language models (LLMs) suggests that frontier LLMs, as individual forecasters, underperform compared to the gold standard of a human-crowd forecasting-tournament aggregate. In Study 1, we expand this research by using an LLM ensemble approach consisting of a crowd of 12 LLMs. We compare the aggregated LLM predictions on 31 binary questions to those of a crowd of 925 human forecasters from a three-month forecasting tournament. Our preregistered main analysis shows that the LLM crowd outperforms a simple no-information benchmark, and is not statistically different from the human crowd. We also observe a set of human-like biases in machine responses, such as an acquiescence effect and a tendency to favour round numbers. In Study 2, we test whether LLM predictions (of GPT-4 and Claude 2) can be improved by drawing on human cognitive output. We find that both models' forecasting accuracy benefits from exposure to the median human prediction as information, improving accuracy by between 17% and 28%, though this leads to less accurate predictions than simply averaging human and machine forecasts. Our results suggest that LLMs can achieve forecasting accuracy rivaling that of the human crowd: via the simple, practically applicable method of forecast aggregation.
comment: 26 pages; 18 visualizations (nine figures, nine tables)
♻ ☆ Large Language Model Can Continue Evolving From Mistakes
As world knowledge evolves and new task paradigms emerge, Continual Learning (CL) is crucial for keeping Large Language Models (LLMs) up-to-date and addressing their shortcomings. In practical applications, LLMs often require both continual instruction tuning (CIT) and continual pre-training (CPT) to adapt to new task paradigms and acquire necessary knowledge for task-solving. However, it remains challenging to collect CPT data that addresses the knowledge deficiencies in models while maintaining adequate volume, and improving the efficiency of utilizing this data also presents significant difficulties. Inspired by the 'summarizing mistakes' learning skill, we propose the Continue Evolving from Mistakes (CEM) method, aiming to provide a data-efficient approach for collecting CPT data and continually improving LLMs' performance through iterative evaluation and supplementation with mistake-relevant knowledge. To efficiently utilize these CPT data and mitigate forgetting, we design a novel CL training set construction paradigm that integrates parallel CIT and CPT data. Extensive experiments demonstrate the efficacy of the CEM method, achieving up to a 17% improvement in accuracy in the best case. Furthermore, additional experiments confirm the potential of combining CEM with catastrophic forgetting mitigation methods, enabling iterative and continual model evolution.
♻ ☆ Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.
♻ ☆ ArtWhisperer: A Dataset for Characterizing Human-AI Interactions in Artistic Creations ICML 2024
As generative AI becomes more prevalent, it is important to study how human users interact with such models. In this work, we investigate how people use text-to-image models to generate desired target images. To study this interaction, we created ArtWhisperer, an online game where users are given a target image and are tasked with iteratively finding a prompt that creates a similar-looking image as the target. Through this game, we recorded over 50,000 human-AI interactions; each interaction corresponds to one text prompt created by a user and the corresponding generated image. The majority of these are repeated interactions where a user iterates to find the best prompt for their target image, making this a unique sequential dataset for studying human-AI collaborations. In an initial analysis of this dataset, we identify several characteristics of prompt interactions and user strategies. People submit diverse prompts and are able to discover a variety of text descriptions that generate similar images. Interestingly, prompt diversity does not decrease as users find better prompts. We further propose a new metric to quantify the steerability of AI using our dataset. We define steerability as the expected number of interactions required to adequately complete a task. We estimate this value by fitting a Markov chain for each target task and calculating the expected time to reach an adequate score in the Markov chain. We quantify and compare AI steerability across different types of target images and two different models, finding that images of cities and natural world images are more steerable than artistic and fantasy images. These findings provide insights into human-AI interaction behavior, present a concrete method of assessing AI steerability, and demonstrate the general utility of the ArtWhisperer dataset.
comment: 31 pages, 27 figures, ICML 2024
♻ ☆ KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery
This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. An ontology can then be constructed where a user defines the types of entities and relationships they want to capture. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. Named Entity Recognition (NER) and Relation Classification (RC) models can be trained on the resulting annotations and can be used to annotate the unannotated portion of the documents. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data. Furthermore, we integrate a suite of Large Language Models (LLMs) that can be used for QA and summarisation that is grounded in the included documents via a retrieval component. KnowledgeHub is a unique tool that supports annotation, IE and QA, which gives the user full insight into the knowledge discovery pipeline.
♻ ☆ Benchmarking General Purpose In-Context Learning
In-context learning (ICL) is becoming increasingly appealing to the AI community due to its flexibility, generality, sample efficiency, and exemption from artificial optimization skills. It is desirable to further enhance the generality and capability of ICL, which gives rise to the concept of general-purpose in-context learning (GPICL). We aim to extend ICL to address a broader range of tasks with an extended learning horizon and higher improvement potential, albeit with relatively limited zero-shot generalization. To this end, we introduce two lightweight but insightful benchmarks specifically crafted to train and evaluate GPICL functionalities. Each benchmark includes a vast number of tasks characterized by significant task variance, featuring minimal inductive bias. These tasks are also designed to facilitate lifelong in-context learning through continuous generation and interaction. These features pose significant challenges for models that rely on context or interactions to improve their proficiency, including language models, decision models, and world models. Our experiments reveal that the scale of parameters alone may not be crucial for ICL or GPICL, suggesting alternative approaches such as increasing the scale of contexts and memory states.
♻ ☆ ART: Automatic Red-teaming for Text-to-Image Models to Protect Benign Users
Large-scale pre-trained generative models are taking the world by storm, due to their abilities in generating creative content. Meanwhile, safeguards for these generative models are developed, to protect users' rights and safety, most of which are designed for large language models. Existing methods primarily focus on jailbreak and adversarial attacks, which mainly evaluate the model's safety under malicious prompts. Recent work found that manually crafted safe prompts can unintentionally trigger unsafe generations. To further systematically evaluate the safety risks of text-to-image models, we propose a novel Automatic Red-Teaming framework, ART. Our method leverages both vision language model and large language model to establish a connection between unsafe generations and their prompts, thereby more efficiently identifying the model's vulnerabilities. With our comprehensive experiments, we reveal the toxicity of the popular open-source text-to-image models. The experiments also validate the effectiveness, adaptability, and great diversity of ART. Additionally, we introduce three large-scale red-teaming datasets for studying the safety risks associated with text-to-image models. Datasets and models can be found in https://github.com/GuanlinLee/ART.
♻ ☆ Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
♻ ☆ SUBLLM: A Novel Efficient Architecture with Token Sequence Subsampling for LLM ECAI 2024
While Large Language Models (LLMs) have achieved remarkable success in various fields, the efficiency of training and inference remains a major challenge. To address this issue, we propose SUBLLM, short for Subsampling-Upsampling-Bypass Large Language Model, an innovative architecture that extends the core decoder-only framework by incorporating subsampling, upsampling, and bypass modules. The subsampling modules are responsible for shortening the sequence, while the upsampling modules restore the sequence length, and the bypass modules enhance convergence. In comparison to LLaMA, the proposed SUBLLM exhibits significant enhancements in both training and inference speeds as well as memory usage, while maintaining competitive few-shot performance. During training, SUBLLM increases speeds by 26% and cuts memory by 10GB per GPU. In inference, it boosts speeds by up to 37% and reduces memory by 1GB per GPU. The training and inference speeds can be enhanced by 34% and 52% respectively when the context window is expanded to 8192. We shall release the source code of the proposed architecture in the published version.
comment: 9 pages, 3 figures, submitted to ECAI 2024
♻ ☆ Mitigating Bias for Question Answering Models by Tracking Bias Influence NAACL 2024
Models of various NLP tasks have been shown to exhibit stereotypes, and the bias in the question answering (QA) models is especially harmful as the output answers might be directly consumed by the end users. There have been datasets to evaluate bias in QA models, while bias mitigation technique for the QA models is still under-explored. In this work, we propose BMBI, an approach to mitigate the bias of multiple-choice QA models. Based on the intuition that a model would lean to be more biased if it learns from a biased example, we measure the bias level of a query instance by observing its influence on another instance. If the influenced instance is more biased, we derive that the query instance is biased. We then use the bias level detected as an optimization objective to form a multi-task learning setting in addition to the original QA task. We further introduce a new bias evaluation metric to quantify bias in a comprehensive and sensitive way. We show that our method could be applied to multiple QA formulations across multiple bias categories. It can significantly reduce the bias level in all 9 bias categories in the BBQ dataset while maintaining comparable QA accuracy.
comment: To appear at NAACL 2024 main conference
♻ ☆ Large Language Models Play StarCraft II: Benchmarks and A Chain of Summarization Approach
StarCraft II is a challenging benchmark for AI agents due to the necessity of both precise micro level operations and strategic macro awareness. Previous works, such as Alphastar and SCC, achieve impressive performance on tackling StarCraft II , however, still exhibit deficiencies in long term strategic planning and strategy interpretability. Emerging large language model (LLM) agents, such as Voyage and MetaGPT, presents the immense potential in solving intricate tasks. Motivated by this, we aim to validate the capabilities of LLMs on StarCraft II, a highly complex RTS game.To conveniently take full advantage of LLMs` reasoning abilities, we first develop textual StratCraft II environment, called TextStarCraft II, which LLM agent can interact. Secondly, we propose a Chain of Summarization method, including single frame summarization for processing raw observations and multi frame summarization for analyzing game information, providing command recommendations, and generating strategic decisions. Our experiment consists of two parts: first, an evaluation by human experts, which includes assessing the LLMs`s mastery of StarCraft II knowledge and the performance of LLM agents in the game; second, the in game performance of LLM agents, encompassing aspects like win rate and the impact of Chain of Summarization.Experiment results demonstrate that: 1. LLMs possess the relevant knowledge and complex planning abilities needed to address StarCraft II scenarios; 2. Human experts consider the performance of LLM agents to be close to that of an average player who has played StarCraft II for eight years; 3. LLM agents are capable of defeating the built in AI at the Harder(Lv5) difficulty level. We have open sourced the code and released demo videos of LLM agent playing StarCraft II.
♻ ☆ X-Light: Cross-City Traffic Signal Control Using Transformer on Transformer as Meta Multi-Agent Reinforcement Learner IJCAI 2024
The effectiveness of traffic light control has been significantly improved by current reinforcement learning-based approaches via better cooperation among multiple traffic lights. However, a persisting issue remains: how to obtain a multi-agent traffic signal control algorithm with remarkable transferability across diverse cities? In this paper, we propose a Transformer on Transformer (TonT) model for cross-city meta multi-agent traffic signal control, named as X-Light: We input the full Markov Decision Process trajectories, and the Lower Transformer aggregates the states, actions, rewards among the target intersection and its neighbors within a city, and the Upper Transformer learns the general decision trajectories across different cities. This dual-level approach bolsters the model's robust generalization and transferability. Notably, when directly transferring to unseen scenarios, ours surpasses all baseline methods with +7.91% on average, and even +16.3% in some cases, yielding the best results.
comment: Accepted by IJCAI 2024
♻ ☆ COBIAS: Contextual Reliability in Bias Assessment
Large Language Models (LLMs) are trained on extensive web corpora, which enable them to understand and generate human-like text. However, this training process also results in inherent biases within the models. These biases arise from web data's diverse and often uncurated nature, containing various stereotypes and prejudices. Previous works on debiasing models rely on benchmark datasets to measure their method's performance. However, these datasets suffer from several pitfalls due to the highly subjective understanding of bias, highlighting a critical need for contextual exploration. We propose understanding the context of inputs by considering the diverse situations in which they may arise. Our contribution is two-fold: (i) we augment 2,291 stereotyped statements from two existing bias-benchmark datasets with points for adding context; (ii) we develop the Context-Oriented Bias Indicator and Assessment Score (COBIAS) to assess a statement's contextual reliability in measuring bias. Our metric aligns with human judgment on contextual reliability of statements (Spearman's $\rho = 0.65, p = 3.4 * 10^{-60}$) and can be used to create reliable datasets, which would assist bias mitigation works.
♻ ☆ A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models KDD2024
As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-Generated Content (AIGC), the powerful capacity of retrieval in providing additional knowledge enables RAG to assist existing generative AI in producing high-quality outputs. Recently, Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in RA-LLMs, covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we systematically review mainstream relevant work by their architectures, training strategies, and application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research. Updated information about this survey can be found at https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
comment: This is the long version of the corresponding survey paper accepted by KDD2024
♻ ☆ Self-Supervised Learning of Time Series Representation via Diffusion Process and Imputation-Interpolation-Forecasting Mask KDD 2024
Time Series Representation Learning (TSRL) focuses on generating informative representations for various Time Series (TS) modeling tasks. Traditional Self-Supervised Learning (SSL) methods in TSRL fall into four main categories: reconstructive, adversarial, contrastive, and predictive, each with a common challenge of sensitivity to noise and intricate data nuances. Recently, diffusion-based methods have shown advanced generative capabilities. However, they primarily target specific application scenarios like imputation and forecasting, leaving a gap in leveraging diffusion models for generic TSRL. Our work, Time Series Diffusion Embedding (TSDE), bridges this gap as the first diffusion-based SSL TSRL approach. TSDE segments TS data into observed and masked parts using an Imputation-Interpolation-Forecasting (IIF) mask. It applies a trainable embedding function, featuring dual-orthogonal Transformer encoders with a crossover mechanism, to the observed part. We train a reverse diffusion process conditioned on the embeddings, designed to predict noise added to the masked part. Extensive experiments demonstrate TSDE's superiority in imputation, interpolation, forecasting, anomaly detection, classification, and clustering. We also conduct an ablation study, present embedding visualizations, and compare inference speed, further substantiating TSDE's efficiency and validity in learning representations of TS data.
comment: Published as a full paper by KDD 2024 Research Track (12 pages as main paper and 11 pages as appendix). Source code available at https://github.com/llcresearch/TSDE
♻ ☆ ValueDCG: Measuring Comprehensive Human Value Understanding Ability of Language Models
Personal values are a crucial factor behind human decision-making. Considering that Large Language Models (LLMs) have been shown to impact human decisions significantly, it is essential to make sure they accurately understand human values to ensure their safety. However, evaluating their grasp of these values is complex due to the value's intricate and adaptable nature. We argue that truly understanding values in LLMs requires considering both "know what" and "know why". To this end, we present a comprehensive evaluation metric, ValueDCG (Value Discriminator-Critique Gap), to quantitatively assess the two aspects with an engineering implementation. We assess four representative LLMs and provide compelling evidence that the growth rates of LLM's "know what" and "know why" capabilities do not align with increases in parameter numbers, resulting in a decline in the models' capacity to understand human values as larger amounts of parameters. This may further suggest that LLMs might craft plausible explanations based on the provided context without truly understanding their inherent value, indicating potential risks.
♻ ☆ ALLaVA: Harnessing GPT4V-Synthesized Data for Lite Vision-Language Models
Large vision-language models (LVLMs) have shown premise in a broad range of vision-language tasks with their strong reasoning and generalization capabilities. However, they require considerable computational resources for training and deployment. This study aims to bridge the performance gap between traditional-scale LVLMs and resource-friendly lite versions by adopting high-quality training data. To this end, we propose a comprehensive pipeline for generating a synthetic dataset. The key idea is to leverage strong proprietary models to generate (i) fine-grained image annotations for vision-language alignment and (ii) complex reasoning visual question-answering pairs for visual instruction fine-tuning, yielding 1.3M samples in total. We train a series of lite VLMs on the synthetic dataset and experimental results demonstrate the effectiveness of the proposed scheme, where they achieve competitive performance on 17 benchmarks among 4B LVLMs, and even perform on par with 7B/13B-scale models on various benchmarks. This work highlights the feasibility of adopting high-quality data in crafting more efficient LVLMs. We name our dataset \textit{ALLaVA}, and open-source it to research community for developing better resource-efficient LVLMs for wider usage.
comment: 22 pages
♻ ☆ Instruction Fusion: Advancing Prompt Evolution through Hybridization
The fine-tuning of Large Language Models (LLMs) specialized in code generation has seen notable advancements through the use of open-domain coding queries. Despite the successes, existing methodologies like Evol-Instruct encounter performance limitations, impeding further enhancements in code generation tasks. This paper examines the constraints of existing prompt evolution techniques and introduces a novel approach, Instruction Fusion (IF). IF innovatively combines two distinct prompts through a hybridization process, thereby enhancing the evolution of training prompts for code LLMs. Our experimental results reveal that the proposed novel method effectively addresses the shortcomings of prior methods, significantly improving the performance of Code LLMs across five code generation benchmarks, namely HumanEval, HumanEval+, MBPP, MBPP+ and MultiPL-E, which underscore the effectiveness of Instruction Fusion in advancing the capabilities of LLMs in code generation.
♻ ☆ Grounding Continuous Representations in Geometry: Equivariant Neural Fields
Recently, Neural Fields have emerged as a powerful modelling paradigm to represent continuous signals. In a conditional neural field, a field is represented by a latent variable that conditions the NeF, whose parametrisation is otherwise shared over an entire dataset. We propose Equivariant Neural Fields based on cross attention transformers, in which NeFs are conditioned on a geometric conditioning variable, a latent point cloud, that enables an equivariant decoding from latent to field. Our equivariant approach induces a steerability property by which both field and latent are grounded in geometry and amenable to transformation laws if the field transforms, the latent represents transforms accordingly and vice versa. Crucially, the equivariance relation ensures that the latent is capable of (1) representing geometric patterns faitfhully, allowing for geometric reasoning in latent space, (2) weightsharing over spatially similar patterns, allowing for efficient learning of datasets of fields. These main properties are validated using classification experiments and a verification of the capability of fitting entire datasets, in comparison to other non-equivariant NeF approaches. We further validate the potential of ENFs by demonstrate unique local field editing properties.
comment: Preprint for Neurips submission
♻ ☆ A Closed-form Solution for Weight Optimization in Fully-connected Feed-forward Neural Networks
This work addresses weight optimization problem for fully-connected feed-forward neural networks. Unlike existing approaches that are based on back-propagation (BP) and chain rule gradient-based optimization (which implies iterative execution, potentially burdensome and time-consuming in some cases), the proposed approach offers the solution for weight optimization in closed-form by means of least squares (LS) methodology. In the case where the input-to-output mapping is injective, the new approach optimizes the weights in a back-propagating fashion in a single iteration by jointly optimizing a set of weights in each layer for each neuron. In the case where the input-to-output mapping is not injective (e.g., in classification problems), the proposed solution is easily adapted to obtain its final solution in a few iterations. An important advantage over the existing solutions is that these computations (for all neurons in a layer) are independent from each other; thus, they can be carried out in parallel to optimize all weights in a given layer simultaneously. Furthermore, its running time is deterministic in the sense that one can obtain the exact number of computations necessary to optimize the weights in all network layers (per iteration, in the case of non-injective mapping). Our simulation and empirical results show that the proposed scheme, BPLS, works well and is competitive with existing ones in terms of accuracy, but significantly surpasses them in terms of running time. To summarize, the new method is straightforward to implement, is competitive and computationally more efficient than the existing ones, and is well-tailored for parallel implementation.
♻ ☆ Auto-ICL: In-Context Learning without Human Supervision
With in-context learning ability, the performance of large language models can be significantly boosted when provided with appropriate context. However, existing in-context learning methods mainly rely on human-provided contexts, such as labeled examples and explicit instructions. Writing context by humans is labor-intensive on various tasks and limits the model to tasks manageable by humans. To overcome these limitations, we propose Automatic In-Context Learning framework that enables the model to autonomously generate examples and instructions for problem-solving. With experiments across various models and datasets, results show that model-generated contexts outperform human-annotated contexts, including Few-Shot and Few-Shot-CoT methods, and surpass existing self-generated context methods like Zero-CoT and Auto-CoT.
comment: Under review
♻ ☆ Two Stones Hit One Bird: Bilevel Positional Encoding for Better Length Extrapolation ICML 2024
In this work, we leverage the intrinsic segmentation of language sequences and design a new positional encoding method called Bilevel Positional Encoding (BiPE). For each position, our BiPE blends an intra-segment encoding and an inter-segment encoding. The intra-segment encoding identifies the locations within a segment and helps the model capture the semantic information therein via absolute positional encoding. The inter-segment encoding specifies the segment index, models the relationships between segments, and aims to improve extrapolation capabilities via relative positional encoding. Theoretical analysis shows this disentanglement of positional information makes learning more effective. The empirical results also show that our BiPE has superior length extrapolation capabilities across a wide range of tasks in diverse text modalities.
comment: 17 pages, 7 figures, 8 tables; ICML 2024 Camera Ready version; Code: https://github.com/zhenyuhe00/BiPE
Information Retrieval 31
☆ DiffMM: Multi-Modal Diffusion Model for Recommendation
The rise of online multi-modal sharing platforms like TikTok and YouTube has enabled personalized recommender systems to incorporate multiple modalities (such as visual, textual, and acoustic) into user representations. However, addressing the challenge of data sparsity in these systems remains a key issue. To address this limitation, recent research has introduced self-supervised learning techniques to enhance recommender systems. However, these methods often rely on simplistic random augmentation or intuitive cross-view information, which can introduce irrelevant noise and fail to accurately align the multi-modal context with user-item interaction modeling. To fill this research gap, we propose a novel multi-modal graph diffusion model for recommendation called DiffMM. Our framework integrates a modality-aware graph diffusion model with a cross-modal contrastive learning paradigm to improve modality-aware user representation learning. This integration facilitates better alignment between multi-modal feature information and collaborative relation modeling. Our approach leverages diffusion models' generative capabilities to automatically generate a user-item graph that is aware of different modalities, facilitating the incorporation of useful multi-modal knowledge in modeling user-item interactions. We conduct extensive experiments on three public datasets, consistently demonstrating the superiority of our DiffMM over various competitive baselines. For open-sourced model implementation details, you can access the source codes of our proposed framework at: https://github.com/HKUDS/DiffMM .
☆ Multi-Layer Ranking with Large Language Models for News Source Recommendation SIGIR 2024
To seek reliable information sources for news events, we introduce a novel task of expert recommendation, which aims to identify trustworthy sources based on their previously quoted statements. To achieve this, we built a novel dataset, called NewsQuote, consisting of 23,571 quote-speaker pairs sourced from a collection of news articles. We formulate the recommendation task as the retrieval of experts based on their likelihood of being associated with a given query. We also propose a multi-layer ranking framework employing Large Language Models to improve the recommendation performance. Our results show that employing an in-context learning based LLM ranker and a multi-layer ranking-based filter significantly improve both the predictive quality and behavioural quality of the recommender system.
comment: Accepted by the SIGIR 2024. arXiv admin note: text overlap with arXiv:2305.04825
☆ Graph Neural Re-Ranking via Corpus Graph
Re-ranking systems aim to reorder an initial list of documents to satisfy better the information needs associated with a user-provided query. Modern re-rankers predominantly rely on neural network models, which have proven highly effective in representing samples from various modalities. However, these models typically evaluate query-document pairs in isolation, neglecting the underlying document distribution that could enhance the quality of the re-ranked list. To address this limitation, we propose Graph Neural Re-Ranking (GNRR), a pipeline based on Graph Neural Networks (GNNs), that enables each query to consider documents distribution during inference. Our approach models document relationships through corpus subgraphs and encodes their representations using GNNs. Through extensive experiments, we demonstrate that GNNs effectively capture cross-document interactions, improving performance on popular ranking metrics. In TREC-DL19, we observe a relative improvement of 5.8% in Average Precision compared to our baseline. These findings suggest that integrating the GNN segment offers significant advantages, especially in scenarios where understanding the broader context of documents is crucial.
comment: This preprint is the result of work in progress, therefore it should still be considered a draft
Prompts as Auto-Optimized Training Hyperparameters: Training Best-in-Class IR Models from Scratch with 10 Gold Labels
We develop a method for training small-scale (under 100M parameter) neural information retrieval models with as few as 10 gold relevance labels. The method depends on generating synthetic queries for documents using a language model (LM), and the key step is that we automatically optimize the LM prompt that is used to generate these queries based on training quality. In experiments with the BIRCO benchmark, we find that models trained with our method outperform RankZephyr and are competitive with RankLLama, both of which are 7B parameter models trained on over 100K labels. These findings point to the power of automatic prompt optimization for synthetic dataset generation.
☆ TourRank: Utilizing Large Language Models for Documents Ranking with a Tournament-Inspired Strategy
Large Language Models (LLMs) are increasingly employed in zero-shot documents ranking, yielding commendable results. However, several significant challenges still persist in LLMs for ranking: (1) LLMs are constrained by limited input length, precluding them from processing a large number of documents simultaneously; (2) The output document sequence is influenced by the input order of documents, resulting in inconsistent ranking outcomes; (3) Achieving a balance between cost and ranking performance is quite challenging. To tackle these issues, we introduce a novel documents ranking method called TourRank, which is inspired by the tournament mechanism. This approach alleviates the impact of LLM's limited input length through intelligent grouping, while the tournament-like points system ensures robust ranking, mitigating the influence of the document input sequence. We test TourRank with different LLMs on the TREC DL datasets and the BEIR benchmark. Experimental results show that TourRank achieves state-of-the-art performance at a reasonable cost.
☆ Long Code Arena: a Set of Benchmarks for Long-Context Code Models
Nowadays, the fields of code and natural language processing are evolving rapidly. In particular, models become better at processing long context windows - supported context sizes have increased by orders of magnitude over the last few years. However, there is a shortage of benchmarks for code processing that go beyond a single file of context, while the most popular ones are limited to a single method. With this work, we aim to close this gap by introducing Long Code Arena, a suite of six benchmarks for code processing tasks that require project-wide context. These tasks cover different aspects of code processing: library-based code generation, CI builds repair, project-level code completion, commit message generation, bug localization, and module summarization. For each task, we provide a manually verified dataset for testing, an evaluation suite, and open-source baseline solutions based on popular LLMs to showcase the usage of the dataset and to simplify adoption by other researchers. We publish the benchmark page on HuggingFace Spaces with the leaderboard, links to HuggingFace Hub for all the datasets, and link to the GitHub repository with baselines: https://huggingface.co/spaces/JetBrains-Research/long-code-arena.
comment: 54 pages, 4 figures, 22 tables
☆ CoSQA+: Enhancing Code Search Dataset with Matching Code
Semantic code search, retrieving code that matches a given natural language query, is an important task to improve productivity in software engineering. Existing code search datasets are problematic: either using unrealistic queries, or with mismatched codes, and typically using one-to-one query-code pairing, which fails to reflect the reality that a query might have multiple valid code matches. This paper introduces CoSQA+, pairing high-quality queries (reused from CoSQA) with multiple suitable codes. We collect code candidates from diverse sources and form candidate pairs by pairing queries with these codes. Utilizing the power of large language models (LLMs), we automate pair annotation, filtering, and code generation for queries without suitable matches. Through extensive experiments, CoSQA+ has demonstrated superior quality over CoSQA. Models trained on CoSQA+ exhibit improved performance. Furthermore, we propose a new metric Mean Multi-choice Reciprocal Rank (MMRR), to assess one-to-N code search performance. We provide the code and data at https://github.com/DeepSoftwareAnalytics/CoSQA_Plus.
comment: 11 pages, 4 figures, conference
☆ Making Alice Appear Like Bob: A Probabilistic Preference Obfuscation Method For Implicit Feedback Recommendation Models
Users' interaction or preference data used in recommender systems carry the risk of unintentionally revealing users' private attributes (e.g., gender or race). This risk becomes particularly concerning when the training data contains user preferences that can be used to infer these attributes, especially if they align with common stereotypes. This major privacy issue allows malicious attackers or other third parties to infer users' protected attributes. Previous efforts to address this issue have added or removed parts of users' preferences prior to or during model training to improve privacy, which often leads to decreases in recommendation accuracy. In this work, we introduce SBO, a novel probabilistic obfuscation method for user preference data designed to improve the accuracy--privacy trade-off for such recommendation scenarios. We apply SBO to three state-of-the-art recommendation models (i.e., BPR, MultVAE, and LightGCN) and two popular datasets (i.e., MovieLens-1M and LFM-2B). Our experiments reveal that SBO outperforms comparable approaches with respect to the accuracy--privacy trade-off. Specifically, we can reduce the leakage of users' protected attributes while maintaining on-par recommendation accuracy.
☆ Evaluating the Efficacy of Open-Source LLMs in Enterprise-Specific RAG Systems: A Comparative Study of Performance and Scalability
This paper presents an analysis of open-source large language models (LLMs) and their application in Retrieval-Augmented Generation (RAG) tasks, specific for enterprise-specific data sets scraped from their websites. With the increasing reliance on LLMs in natural language processing, it is crucial to evaluate their performance, accessibility, and integration within specific organizational contexts. This study examines various open-source LLMs, explores their integration into RAG frameworks using enterprise-specific data, and assesses the performance of different open-source embeddings in enhancing the retrieval and generation process. Our findings indicate that open-source LLMs, combined with effective embedding techniques, can significantly improve the accuracy and efficiency of RAG systems, offering a viable alternative to proprietary solutions for enterprises.
☆ They're All Doctors: Synthesizing Diverse Counterfactuals to Mitigate Associative Bias
Vision Language Models (VLMs) such as CLIP are powerful models; however they can exhibit unwanted biases, making them less safe when deployed directly in applications such as text-to-image, text-to-video retrievals, reverse search, or classification tasks. In this work, we propose a novel framework to generate synthetic counterfactual images to create a diverse and balanced dataset that can be used to fine-tune CLIP. Given a set of diverse synthetic base images from text-to-image models, we leverage off-the-shelf segmentation and inpainting models to place humans with diverse visual appearances in context. We show that CLIP trained on such datasets learns to disentangle the human appearance from the context of an image, i.e., what makes a doctor is not correlated to the person's visual appearance, like skin color or body type, but to the context, such as background, the attire they are wearing, or the objects they are holding. We demonstrate that our fine-tuned CLIP model, $CF_\alpha$, improves key fairness metrics such as MaxSkew, MinSkew, and NDKL by 40-66\% for image retrieval tasks, while still achieving similar levels of performance in downstream tasks. We show that, by design, our model retains maximal compatibility with the original CLIP models, and can be easily controlled to support different accuracy versus fairness trade-offs in a plug-n-play fashion.
☆ Transparency, Privacy, and Fairness in Recommender Systems
Recommender systems have become a pervasive part of our daily online experience, and are one of the most widely used applications of artificial intelligence and machine learning. Therefore, regulations and requirements for trustworthy artificial intelligence, for example, the European AI Act, which includes notions such as transparency, privacy, and fairness are also highly relevant for the design of recommender systems in practice. This habilitation elaborates on aspects related to these three notions in the light of recommender systems, namely: (i) transparency and cognitive models, (ii) privacy and limited preference information, and (iii) fairness and popularity bias in recommender systems. Specifically, with respect to aspect (i), we highlight the usefulness of incorporating psychological theories for a transparent design process of recommender systems. We term this type of systems psychology-informed recommender systems. In aspect (ii), we study and address the trade-off between accuracy and privacy in differentially-private recommendations. We design a novel recommendation approach for collaborative filtering based on an efficient neighborhood reuse concept, which reduces the number of users that need to be protected with differential privacy. Furthermore, we address the related issue of limited availability of user preference information, e.g., click data, in the settings of session-based and cold-start recommendations. With respect to aspect (iii), we analyze popularity bias in recommender systems. We find that the recommendation frequency of an item is positively correlated with this item's popularity. This also leads to the unfair treatment of users with little interest in popular content. Finally, we study long-term fairness dynamics in algorithmic decision support in the labor market using agent-based modeling techniques.
comment: Habilitation (post-doctoral thesis) at Graz University of Technology for the scientific subject Applied Computer Science
☆ Iterative Utility Judgment Framework via LLMs Inspired by Relevance in Philosophy
Utility and topical relevance are critical measures in information retrieval (IR), reflecting system and user perspectives, respectively. While topical relevance has long been emphasized, utility is a higher standard of relevance and is more useful for facilitating downstream tasks, e.g., in Retrieval-Augmented Generation (RAG). When we incorporate utility judgments into RAG, we realize that the topical relevance, utility, and answering in RAG are closely related to the three types of relevance that Schutz discussed from a philosophical perspective. They are topical relevance, interpretational relevance, and motivational relevance, respectively. Inspired by the dynamic iterations of the three types of relevance, we propose an Iterative utiliTy judgmEnt fraMework (ITEM) to promote each step of the cycle of RAG. We conducted extensive experiments on multi-grade passage retrieval and factoid question-answering datasets (i.e., TREC DL, WebAP, and NQ). Experimental results demonstrate significant improvements in utility judgments, ranking of topical relevance, and answer generation upon representative baselines, including multiple single-shot utility judging approaches. Our code and benchmark can be found at https://anonymous.4open.science/r/ITEM-B486/.
comment: 22 pages
☆ Unifying Multimodal Retrieval via Document Screenshot Embedding
In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.
☆ DELRec: Distilling Sequential Pattern to Enhance LLM-based Recommendation
Sequential recommendation (SR) tasks enhance recommendation accuracy by capturing the connection between users' past interactions and their changing preferences. Conventional models often focus solely on capturing sequential patterns within the training data, neglecting the broader context and semantic information embedded in item titles from external sources. This limits their predictive power and adaptability. Recently, large language models (LLMs) have shown promise in SR tasks due to their advanced understanding capabilities and strong generalization abilities. Researchers have attempted to enhance LLMs' recommendation performance by incorporating information from SR models. However, previous approaches have encountered problems such as 1) only influencing LLMs at the result level;2) increased complexity of LLMs recommendation methods leading to reduced interpretability; 3) incomplete understanding and utilization of SR models information by LLMs. To address these problems, we proposes a novel framework, DELRec, which aims to extract knowledge from SR models and enable LLMs to easily comprehend and utilize this supplementary information for more effective sequential recommendations. DELRec consists of two main stages: 1) SR Models Pattern Distilling, focusing on extracting behavioral patterns exhibited by SR models using soft prompts through two well-designed strategies; 2) LLMs-based Sequential Recommendation, aiming to fine-tune LLMs to effectively use the distilled auxiliary information to perform SR tasks. Extensive experimental results conducted on three real datasets validate the effectiveness of the DELRec framework.
☆ Mutual Learning for Finetuning Click-Through Rate Prediction Models
Click-Through Rate (CTR) prediction has become an essential task in digital industries, such as digital advertising or online shopping. Many deep learning-based methods have been implemented and have become state-of-the-art models in the domain. To further improve the performance of CTR models, Knowledge Distillation based approaches have been widely used. However, most of the current CTR prediction models do not have much complex architectures, so it's hard to call one of them 'cumbersome' and the other one 'tiny'. On the other hand, the performance gap is also not very large between complex and simple models. So, distilling knowledge from one model to the other could not be worth the effort. Under these considerations, Mutual Learning could be a better approach, since all the models could be improved mutually. In this paper, we showed how useful the mutual learning algorithm could be when it is between equals. In our experiments on the Criteo and Avazu datasets, the mutual learning algorithm improved the performance of the model by up to 0.66% relative improvement.
comment: 7 pages, 2 figures, 4 tables
☆ UniGLM: Training One Unified Language Model for Text-Attributed Graphs
Representation learning on text-attributed graphs (TAGs), where nodes are represented by textual descriptions, is crucial for textual and relational knowledge systems and recommendation systems. Currently, state-of-the-art embedding methods for TAGs primarily focus on fine-tuning language models (e.g., BERT) using structure-aware training signals. While effective, these methods are tailored for individual TAG and cannot generalize across various graph scenarios. Given the shared textual space, leveraging multiple TAGs for joint fine-tuning, aligning text and graph structure from different aspects, would be more beneficial. Motivated by this, we introduce a novel Unified Graph Language Model (UniGLM) framework, the first graph embedding model that generalizes well to both in-domain and cross-domain TAGs. Specifically, UniGLM is trained over multiple TAGs with different domains and scales using self-supervised contrastive learning. UniGLM includes an adaptive positive sample selection technique for identifying structurally similar nodes and a lazy contrastive module that is devised to accelerate training by minimizing repetitive encoding calculations. Extensive empirical results across 9 benchmark TAGs demonstrate UniGLM's efficacy against leading embedding baselines in terms of generalization (various downstream tasks and backbones) and transfer learning (in and out of domain scenarios). The code is available at https://github.com/NYUSHCS/UniGLM.
☆ Balancing Embedding Spectrum for Recommendation
Modern recommender systems heavily rely on high-quality representations learned from high-dimensional sparse data. While significant efforts have been invested in designing powerful algorithms for extracting user preferences, the factors contributing to good representations have remained relatively unexplored. In this work, we shed light on an issue in the existing pair-wise learning paradigm (i.e., the embedding collapse problem), that the representations tend to span a subspace of the whole embedding space, leading to a suboptimal solution and reducing the model capacity. Specifically, optimization on observed interactions is equivalent to a low pass filter causing users/items to have the same representations and resulting in a complete collapse. While negative sampling acts as a high pass filter to alleviate the collapse by balancing the embedding spectrum, its effectiveness is only limited to certain losses, which still leads to an incomplete collapse. To tackle this issue, we propose a novel method called DirectSpec, acting as a reliable all pass filter to balance the spectrum distribution of the embeddings during training, ensuring that users/items effectively span the entire embedding space. Additionally, we provide a thorough analysis of DirectSpec from a decorrelation perspective and propose an enhanced variant, DirectSpec+, which employs self-paced gradients to optimize irrelevant samples more effectively. Moreover, we establish a close connection between DirectSpec+ and uniformity, demonstrating that contrastive learning (CL) can alleviate the collapse issue by indirectly balancing the spectrum. Finally, we implement DirectSpec and DirectSpec+ on two popular recommender models: MF and LightGCN. Our experimental results demonstrate its effectiveness and efficiency over competitive baselines.
☆ When Box Meets Graph Neural Network in Tag-aware Recommendation
Last year has witnessed the re-flourishment of tag-aware recommender systems supported by the LLM-enriched tags. Unfortunately, though large efforts have been made, current solutions may fail to describe the diversity and uncertainty inherent in user preferences with only tag-driven profiles. Recently, with the development of geometry-based techniques, e.g., box embedding, diversity of user preferences now could be fully modeled as the range within a box in high dimension space. However, defect still exists as these approaches are incapable of capturing high-order neighbor signals, i.e., semantic-rich multi-hop relations within the user-tag-item tripartite graph, which severely limits the effectiveness of user modeling. To deal with this challenge, in this paper, we propose a novel algorithm, called BoxGNN, to perform the message aggregation via combination of logical operations, thereby incorporating high-order signals. Specifically, we first embed users, items, and tags as hyper-boxes rather than simple points in the representation space, and define two logical operations to facilitate the subsequent process. Next, we perform the message aggregation mechanism via the combination of logical operations, to obtain the corresponding high-order box representations. Finally, we adopt a volume-based learning objective with Gumbel smoothing techniques to refine the representation of boxes. Extensive experiments on two publicly available datasets and one LLM-enhanced e-commerce dataset have validated the superiority of BoxGNN compared with various state-of-the-art baselines. The code is released online
☆ GAugLLM: Improving Graph Contrastive Learning for Text-Attributed Graphs with Large Language Models
This work studies self-supervised graph learning for text-attributed graphs (TAGs) where nodes are represented by textual attributes. Unlike traditional graph contrastive methods that perturb the numerical feature space and alter the graph's topological structure, we aim to improve view generation through language supervision. This is driven by the prevalence of textual attributes in real applications, which complement graph structures with rich semantic information. However, this presents challenges because of two major reasons. First, text attributes often vary in length and quality, making it difficulty to perturb raw text descriptions without altering their original semantic meanings. Second, although text attributes complement graph structures, they are not inherently well-aligned. To bridge the gap, we introduce GAugLLM, a novel framework for augmenting TAGs. It leverages advanced large language models like Mistral to enhance self-supervised graph learning. Specifically, we introduce a mixture-of-prompt-expert technique to generate augmented node features. This approach adaptively maps multiple prompt experts, each of which modifies raw text attributes using prompt engineering, into numerical feature space. Additionally, we devise a collaborative edge modifier to leverage structural and textual commonalities, enhancing edge augmentation by examining or building connections between nodes. Empirical results across five benchmark datasets spanning various domains underscore our framework's ability to enhance the performance of leading contrastive methods as a plug-in tool. Notably, we observe that the augmented features and graph structure can also enhance the performance of standard generative methods, as well as popular graph neural networks. The open-sourced implementation of our GAugLLM is available at Github.
☆ Personalized Federated Knowledge Graph Embedding with Client-Wise Relation Graph
Federated Knowledge Graph Embedding (FKGE) has recently garnered considerable interest due to its capacity to extract expressive representations from distributed knowledge graphs, while concurrently safeguarding the privacy of individual clients. Existing FKGE methods typically harness the arithmetic mean of entity embeddings from all clients as the global supplementary knowledge, and learn a replica of global consensus entities embeddings for each client. However, these methods usually neglect the inherent semantic disparities among distinct clients. This oversight not only results in the globally shared complementary knowledge being inundated with too much noise when tailored to a specific client, but also instigates a discrepancy between local and global optimization objectives. Consequently, the quality of the learned embeddings is compromised. To address this, we propose Personalized Federated knowledge graph Embedding with client-wise relation Graph (PFedEG), a novel approach that employs a client-wise relation graph to learn personalized embeddings by discerning the semantic relevance of embeddings from other clients. Specifically, PFedEG learns personalized supplementary knowledge for each client by amalgamating entity embedding from its neighboring clients based on their "affinity" on the client-wise relation graph. Each client then conducts personalized embedding learning based on its local triples and personalized supplementary knowledge. We conduct extensive experiments on four benchmark datasets to evaluate our method against state-of-the-art models and results demonstrate the superiority of our method.
♻ ☆ MTEB-French: Resources for French Sentence Embedding Evaluation and Analysis
Recently, numerous embedding models have been made available and widely used for various NLP tasks. The Massive Text Embedding Benchmark (MTEB) has primarily simplified the process of choosing a model that performs well for several tasks in English, but extensions to other languages remain challenging. This is why we expand MTEB to propose the first massive benchmark of sentence embeddings for French. We gather 15 existing datasets in an easy-to-use interface and create three new French datasets for a global evaluation of 8 task categories. We compare 51 carefully selected embedding models on a large scale, conduct comprehensive statistical tests, and analyze the correlation between model performance and many of their characteristics. We find out that even if no model is the best on all tasks, large multilingual models pre-trained on sentence similarity perform exceptionally well. Our work comes with open-source code, new datasets and a public leaderboard.
♻ ☆ Query Performance Prediction using Relevance Judgments Generated by Large Language Models
Query performance prediction (QPP) aims to estimate the retrieval quality of a search system for a query without human relevance judgments. Previous QPP methods typically return a single scalar value and do not require the predicted values to approximate a specific information retrieval (IR) evaluation measure, leading to certain drawbacks: (i) a single scalar is insufficient to accurately represent different IR evaluation measures, especially when metrics do not highly correlate, and (ii) a single scalar limits the interpretability of QPP methods because solely using a scalar is insufficient to explain QPP results. To address these issues, we propose a QPP framework using automatically generated relevance judgments (QPP-GenRE), which decomposes QPP into independent subtasks of predicting the relevance of each item in a ranked list to a given query. This allows us to predict any IR evaluation measure using the generated relevance judgments as pseudo-labels. This also allows us to interpret predicted IR evaluation measures, and identify, track and rectify errors in generated relevance judgments to improve QPP quality. We predict an item's relevance by using open-source large language models (LLMs) to ensure scientific reproducibility. We face two main challenges: (i) excessive computational costs of judging an entire corpus for predicting a metric considering recall, and (ii) limited performance in prompting open-source LLMs in a zero-/few-shot manner. To solve the challenges, we devise an approximation strategy to predict an IR measure considering recall and propose to fine-tune open-source LLMs using human-labeled relevance judgments. Experiments on the TREC 2019-2022 deep learning tracks show that QPP-GenRE achieves state-of-the-art QPP quality for both lexical and neural rankers.
♻ ☆ KnowledgeHub: An end-to-end Tool for Assisted Scientific Discovery
This paper describes the KnowledgeHub tool, a scientific literature Information Extraction (IE) and Question Answering (QA) pipeline. This is achieved by supporting the ingestion of PDF documents that are converted to text and structured representations. An ontology can then be constructed where a user defines the types of entities and relationships they want to capture. A browser-based annotation tool enables annotating the contents of the PDF documents according to the ontology. Named Entity Recognition (NER) and Relation Classification (RC) models can be trained on the resulting annotations and can be used to annotate the unannotated portion of the documents. A knowledge graph is constructed from these entity and relation triples which can be queried to obtain insights from the data. Furthermore, we integrate a suite of Large Language Models (LLMs) that can be used for QA and summarisation that is grounded in the included documents via a retrieval component. KnowledgeHub is a unique tool that supports annotation, IE and QA, which gives the user full insight into the knowledge discovery pipeline.
♻ ☆ A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models KDD2024
As one of the most advanced techniques in AI, Retrieval-Augmented Generation (RAG) can offer reliable and up-to-date external knowledge, providing huge convenience for numerous tasks. Particularly in the era of AI-Generated Content (AIGC), the powerful capacity of retrieval in providing additional knowledge enables RAG to assist existing generative AI in producing high-quality outputs. Recently, Large Language Models (LLMs) have demonstrated revolutionary abilities in language understanding and generation, while still facing inherent limitations, such as hallucinations and out-of-date internal knowledge. Given the powerful abilities of RAG in providing the latest and helpful auxiliary information, Retrieval-Augmented Large Language Models (RA-LLMs) have emerged to harness external and authoritative knowledge bases, rather than solely relying on the model's internal knowledge, to augment the generation quality of LLMs. In this survey, we comprehensively review existing research studies in RA-LLMs, covering three primary technical perspectives: architectures, training strategies, and applications. As the preliminary knowledge, we briefly introduce the foundations and recent advances of LLMs. Then, to illustrate the practical significance of RAG for LLMs, we systematically review mainstream relevant work by their architectures, training strategies, and application areas, detailing specifically the challenges of each and the corresponding capabilities of RA-LLMs. Finally, to deliver deeper insights, we discuss current limitations and several promising directions for future research. Updated information about this survey can be found at https://advanced-recommender-systems.github.io/RAG-Meets-LLMs/
comment: This is the long version of the corresponding survey paper accepted by KDD2024
♻ ☆ The Topology of a Family Tree Graph and Its Members' Satisfaction with One Another: A Machine Learning Approach
Family members' satisfaction with one another is central to creating healthy and supportive family environments. In this work, we propose and implement a novel computational technique aimed at exploring the possible relationship between the topology of a given family tree graph and its members' satisfaction with one another. Through an extensive empirical evaluation ($N=486$ families), we show that the proposed technique brings about highly accurate results in predicting family members' satisfaction with one another based solely on the family graph's topology. Furthermore, the results indicate that our technique favorably compares to baseline regression models which rely on established features associated with family members' satisfaction with one another in prior literature.
♻ ☆ A Short Review for Ontology Learning: Stride to Large Language Models Trend
Ontologies provide formal representation of knowledge shared within Semantic Web applications. Ontology learning involves the construction of ontologies from a given corpus. In the past years, ontology learning has traversed through shallow learning and deep learning methodologies, each offering distinct advantages and limitations in the quest for knowledge extraction and representation. A new trend of these approaches is relying on large language models (LLMs) to enhance ontology learning. This paper gives a review in approaches and challenges of ontology learning. It analyzes the methodologies and limitations of shallow-learning-based and deep-learning-based techniques for ontology learning, and provides comprehensive knowledge for the frontier work of using LLMs to enhance ontology learning. In addition, it proposes several noteworthy future directions for further exploration into the integration of LLMs with ontology learning tasks.
♻ ☆ Soft Prompt Tuning for Augmenting Dense Retrieval with Large Language Models
Dense retrieval (DR) converts queries and documents into dense embeddings and measures the similarity between queries and documents in vector space. One of the challenges in DR is the lack of domain-specific training data. While DR models can learn from large-scale public datasets like MS MARCO through transfer learning, evidence shows that not all DR models and domains can benefit from transfer learning equally. Recently, some researchers have resorted to large language models (LLMs) to improve the zero-shot and few-shot DR models. However, the hard prompts or human-written prompts utilized in these works cannot guarantee the good quality of generated weak queries. To tackle this, we propose soft prompt tuning for augmenting DR (SPTAR): For each task, we leverage soft prompt-tuning to optimize a task-specific soft prompt on limited ground truth data and then prompt the LLMs to tag unlabeled documents with weak queries, yielding enough weak document-query pairs to train task-specific dense retrievers. We design a filter to select high-quality example document-query pairs in the prompt to further improve the quality of weak tagged queries. To the best of our knowledge, there is no prior work utilizing soft prompt tuning to augment DR models. The experiments demonstrate that SPTAR outperforms the unsupervised baselines BM25 and the recently proposed LLMs-based augmentation method for DR.
comment: Under Reivew
♻ ☆ DomainRAG: A Chinese Benchmark for Evaluating Domain-specific Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) offers a promising solution to address various limitations of Large Language Models (LLMs), such as hallucination and difficulties in keeping up with real-time updates. This approach is particularly critical in expert and domain-specific applications where LLMs struggle to cover expert knowledge. Therefore, evaluating RAG models in such scenarios is crucial, yet current studies often rely on general knowledge sources like Wikipedia to assess the models' abilities in solving common-sense problems. In this paper, we evaluated LLMs by RAG settings in a domain-specific context, college enrollment. We identified six required abilities for RAG models, including the ability in conversational RAG, analyzing structural information, faithfulness to external knowledge, denoising, solving time-sensitive problems, and understanding multi-document interactions. Each ability has an associated dataset with shared corpora to evaluate the RAG models' performance. We evaluated popular LLMs such as Llama, Baichuan, ChatGLM, and GPT models. Experimental results indicate that existing closed-book LLMs struggle with domain-specific questions, highlighting the need for RAG models to solve expert problems. Moreover, there is room for RAG models to improve their abilities in comprehending conversational history, analyzing structural information, denoising, processing multi-document interactions, and faithfulness in expert knowledge. We expect future studies could solve these problems better.
♻ ☆ Denoising Diffusion Recommender Model SIGIR 2024
Recommender systems often grapple with noisy implicit feedback. Most studies alleviate the noise issues from data cleaning perspective such as data resampling and reweighting, but they are constrained by heuristic assumptions. Another denoising avenue is from model perspective, which proactively injects noises into user-item interactions and enhances the intrinsic denoising ability of models. However, this kind of denoising process poses significant challenges to the recommender model's representation capacity to capture noise patterns. To address this issue, we propose Denoising Diffusion Recommender Model (DDRM), which leverages multi-step denoising process of diffusion models to robustify user and item embeddings from any recommender models. DDRM injects controlled Gaussian noises in the forward process and iteratively removes noises in the reverse denoising process, thereby improving embedding robustness against noisy feedback. To achieve this target, the key lies in offering appropriate guidance to steer the reverse denoising process and providing a proper starting point to start the forward-reverse process during inference. In particular, we propose a dedicated denoising module that encodes collaborative information as denoising guidance. Besides, in the inference stage, DDRM utilizes the average embeddings of users' historically liked items as the starting point rather than using pure noise since pure noise lacks personalization, which increases the difficulty of the denoising process. Extensive experiments on three datasets with three representative backend recommender models demonstrate the effectiveness of DDRM.
comment: Accepted by SIGIR 2024
♻ ☆ PromptReps: Prompting Large Language Models to Generate Dense and Sparse Representations for Zero-Shot Document Retrieval
Utilizing large language models (LLMs) for zero-shot document ranking is done in one of two ways: 1) prompt-based re-ranking methods, which require no further training but are only feasible for re-ranking a handful of candidate documents due to computational costs; and 2) unsupervised contrastive trained dense retrieval methods, which can retrieve relevant documents from the entire corpus but require a large amount of paired text data for contrastive training. In this paper, we propose PromptReps, which combines the advantages of both categories: no need for training and the ability to retrieve from the whole corpus. Our method only requires prompts to guide an LLM to generate query and document representations for effective document retrieval. Specifically, we prompt the LLMs to represent a given text using a single word, and then use the last token's hidden states and the corresponding logits associated with the prediction of the next token to construct a hybrid document retrieval system. The retrieval system harnesses both dense text embedding and sparse bag-of-words representations given by the LLM. We further explore variations of this core idea that consider the generation of multiple words, and representations that rely on multiple embeddings and sparse distributions. Our experimental evaluation on the MSMARCO, TREC deep learning and BEIR zero-shot document retrieval datasets illustrates that this simple prompt-based LLM retrieval method can achieve a similar or higher retrieval effectiveness than state-of-the-art LLM embedding methods that are trained with large amounts of unsupervised data, especially when using a larger LLM.
♻ ☆ Enhancing Recommendation Diversity by Re-ranking with Large Language Models
It has long been recognized that it is not enough for a Recommender System (RS) to provide recommendations based only on their relevance to users. Among many other criteria, the set of recommendations may need to be diverse. Diversity is one way of handling recommendation uncertainty and ensuring that recommendations offer users a meaningful choice. The literature reports many ways of measuring diversity and improving the diversity of a set of recommendations, most notably by re-ranking and selecting from a larger set of candidate recommendations. Driven by promising insights from the literature on how to incorporate versatile Large Language Models (LLMs) into the RS pipeline, in this paper we show how LLMs can be used for diversity re-ranking. We begin with an informal study that verifies that LLMs can be used for re-ranking tasks and do have some understanding of the concept of item diversity. Then, we design a more rigorous methodology where LLMs are prompted to generate a diverse ranking from a candidate ranking using various prompt templates with different re-ranking instructions in a zero-shot fashion. We conduct comprehensive experiments testing state-of-the-art LLMs from the GPT and Llama families. We compare their re-ranking capabilities with random re-ranking and various traditional re-ranking methods from the literature. We open-source the code of our experiments for reproducibility. Our findings suggest that the trade-offs (in terms of performance and costs, among others) of LLM-based re-rankers are superior to those of random re-rankers but, as yet, inferior to the ones of traditional re-rankers. However, the LLM approach is promising. LLMs exhibit improved performance on many natural language processing and recommendation tasks and lower inference costs. Given these trends, we can expect LLM-based re-ranking to become more competitive soon.
comment: 32 pages, 2 figures
Multimedia 5
☆ Unveiling Encoder-Free Vision-Language Models
Existing vision-language models (VLMs) mostly rely on vision encoders to extract visual features followed by large language models (LLMs) for visual-language tasks. However, the vision encoders set a strong inductive bias in abstracting visual representation, e.g., resolution, aspect ratio, and semantic priors, which could impede the flexibility and efficiency of the VLMs. Training pure VLMs that accept the seamless vision and language inputs, i.e., without vision encoders, remains challenging and rarely explored. Empirical observations reveal that direct training without encoders results in slow convergence and large performance gaps. In this work, we bridge the gap between encoder-based and encoder-free models, and present a simple yet effective training recipe towards pure VLMs. Specifically, we unveil the key aspects of training encoder-free VLMs efficiently via thorough experiments: (1) Bridging vision-language representation inside one unified decoder; (2) Enhancing visual recognition capability via extra supervision. With these strategies, we launch EVE, an encoder-free vision-language model that can be trained and forwarded efficiently. Notably, solely utilizing 35M publicly accessible data, EVE can impressively rival the encoder-based VLMs of similar capacities across multiple vision-language benchmarks. It significantly outperforms the counterpart Fuyu-8B with mysterious training procedures and undisclosed training data. We believe that EVE provides a transparent and efficient route for developing a pure decoder-only architecture across modalities. Our code and models are publicly available at: https://github.com/baaivision/EVE.
comment: 16 pages, 7 figures
☆ MusicScore: A Dataset for Music Score Modeling and Generation
Music scores are written representations of music and contain rich information about musical components. The visual information on music scores includes notes, rests, staff lines, clefs, dynamics, and articulations. This visual information in music scores contains more semantic information than audio and symbolic representations of music. Previous music score datasets have limited sizes and are mainly designed for optical music recognition (OMR). There is a lack of research on creating a large-scale benchmark dataset for music modeling and generation. In this work, we propose MusicScore, a large-scale music score dataset collected and processed from the International Music Score Library Project (IMSLP). MusicScore consists of image-text pairs, where the image is a page of a music score and the text is the metadata of the music. The metadata of MusicScore is extracted from the general information section of the IMSLP pages. The metadata includes rich information about the composer, instrument, piece style, and genre of the music pieces. MusicScore is curated into small, medium, and large scales of 400, 14k, and 200k image-text pairs with varying diversity, respectively. We build a score generation system based on a UNet diffusion model to generate visually readable music scores conditioned on text descriptions to benchmark the MusicScore dataset for music score generation. MusicScore is released to the public at https://huggingface.co/datasets/ZheqiDAI/MusicScore.
comment: Dataset paper, dataset link: https://huggingface.co/datasets/ZheqiDAI/MusicScore
☆ Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning
Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal Large Language Models (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023 challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.
comment: 37 pages, 12 figures, Project: https://github.com/ZebangCheng/Emotion-LLaMA, Demo: https://huggingface.co/spaces/ZebangCheng/Emotion-LLaMA
♻ ☆ Beyond Embeddings: The Promise of Visual Table in Visual Reasoning
Visual representation learning has been a cornerstone in computer vision, involving typical forms such as visual embeddings, structural symbols, and text-based representations. Despite the success of CLIP-type visual embeddings, they often lack access to world knowledge critical for visual reasoning. In this work, we propose Visual Table, a novel form of visual representation tailored for visual reasoning. Visual tables are constructed as hierarchical descriptions of visual scenes, featuring a scene description and multiple object-centric descriptions covering categories, attributes, and knowledge. Thanks to the structural and textual formats, visual tables offer unique advantages over mere visual embeddings, such as interpretability and controllable editing. Furthermore, they deliver instance-level world knowledge and detailed attributes that are essential for visual reasoning. To create visual tables, we develop a generator trained on the dataset with collected, small-scale annotations. Extensive results on 11 visual reasoning benchmarks demonstrate that the generated visual tables significantly outperform previous structural and text-based representations. Moreover, they consistently enhance state-of-the-art multimodal large language models across diverse benchmarks, showcasing their potential for advancing visual reasoning tasks. Our code is available at https://github.com/LaVi-Lab/Visual-Table.
comment: Project page: https://github.com/LaVi-Lab/Visual-Table
♻ ☆ Scalable Image Coding for Humans and Machines Using Feature Fusion Network
As image recognition models become more prevalent, scalable coding methods for machines and humans gain more importance. Applications of image recognition models include traffic monitoring and farm management. In these use cases, the scalable coding method proves effective because the tasks require occasional image checking by humans. Existing image compression methods for humans and machines meet these requirements to some extent. However, these compression methods are effective solely for specific image recognition models. We propose a learning-based scalable image coding method for humans and machines that is compatible with numerous image recognition models. We combine an image compression model for machines with a compression model, providing additional information to facilitate image decoding for humans. The features in these compression models are fused using a feature fusion network to achieve efficient image compression. Our method's additional information compression model is adjusted to reduce the number of parameters by enabling combinations of features of different sizes in the feature fusion network. Our approach confirms that the feature fusion network efficiently combines image compression models while reducing the number of parameters. Furthermore, we demonstrate the effectiveness of the proposed scalable coding method by evaluating the image compression performance in terms of decoded image quality and bitrate.
Systems and Control 14
☆ Design of Interacting Particle Systems for Fast and Efficient Reinforcement Learning
This paper is concerned with the design of algorithms based on systems of interacting particles to represent, approximate, and learn the optimal control law for reinforcement learning (RL). The primary contribution of the present paper is to show that convergence rates can be accelerated dramatically through careful design of interactions between particles. Theory focuses on the linear quadratic stochastic optimal control problem for which a complete and novel theory is presented. Apart from the new algorithm, sample complexity bounds are obtained, and it is shown that the mean square error scales as $1/N$ where $N$ is the number of particles. The theoretical results and algorithms are illustrated with numerical experiments and comparisons with other recent approaches, where the faster convergence of the proposed algorithm is numerically demonstrated.
☆ Reachability Analysis for Linear Systems with Uncertain Parameters using Polynomial Zonotopes
In real world applications, uncertain parameters are the rule rather than the exception. We present a reachability algorithm for linear systems with uncertain parameters and inputs using set propagation of polynomial zonotopes. In contrast to previous methods, our approach is able to tightly capture the non-convexity of the reachable set. Building up on our main result, we show how our reachability algorithm can be extended to handle linear time-varying systems as well as linear systems with time-varying parameters. Moreover, our approach opens up new possibilities for reachability analysis of linear time-invariant systems, nonlinear systems, and hybrid systems. We compare our approach to other state of the art methods, with superior tightness on two benchmarks including a 9-dimensional vehicle platooning system. Moreover, as part of the journal extension, we investigate through a polynomial zonotope with special structure named multi-affine zonotopes and its optimization problem. We provide the corresponding optimization algorithm and experiment over the examples obatined from two benchmark systems, showing the efficiency and scalability comparing to the state of the art method for handling such type of set representation.
☆ Physics-Informed Deep Learning and Partial Transfer Learning for Bearing Fault Diagnosis in the Presence of Highly Missing Data
One of the most significant obstacles in bearing fault diagnosis is a lack of labeled data for various fault types. Also, sensor-acquired data frequently lack labels and have a large amount of missing data. This paper tackles these issues by presenting the PTPAI method, which uses a physics-informed deep learning-based technique to generate synthetic labeled data. Labeled synthetic data makes up the source domain, whereas unlabeled data with missing data is present in the target domain. Consequently, imbalanced class problems and partial-set fault diagnosis hurdles emerge. To address these challenges, the RF-Mixup approach is used to handle imbalanced classes. As domain adaptation strategies, the MK-MMSD and CDAN are employed to mitigate the disparity in distribution between synthetic and actual data. Furthermore, the partial-set challenge is tackled by applying weighting methods at the class and instance levels. Experimental outcomes on the CWRU and JNU datasets indicate that the proposed approach effectively addresses these problems.
☆ LEO Satellite Networks Assisted Geo-distributed Data Processing
Nowadays, the increasing deployment of edge clouds globally provides users with low-latency services. However, connecting an edge cloud to a core cloud via optic cables in terrestrial networks poses significant barriers due to the prohibitively expensive building cost of optic cables. Fortunately, emerging Low Earth Orbit (LEO) satellite networks (e.g., Starlink) offer a more cost-effective solution for increasing edge clouds, and hence large volumes of data in edge clouds can be transferred to a core cloud via those networks for time-sensitive big data tasks processing, such as attack detection. However, the state-of-the-art satellite selection algorithms bring poor performance for those processing via our measurements. Therefore, we propose a novel data volume aware satellite selection algorithm, named DVA, to support such big data processing tasks. DVA first takes into account both the data size in edge clouds and satellite capacity to finalize the selection, thereby preventing congestion in the access network and reducing transmitting duration. Extensive simulations validate that DVA has a significantly lower average access network duration than the state-of-the-art satellite selection algorithms in a LEO satellite emulation platform.
comment: 6 pages, 5 figures
♻ ☆ Global Geolocated Realtime Data of Interfleet Urban Transit Bus Idling
Urban transit bus idling is a contributor to ecological stress, economic inefficiency, and medically hazardous health outcomes due to emissions. The global accumulation of this frequent pattern of undesirable driving behavior is enormous. In order to measure its scale, we propose GRD-TRT- BUF-4I (Ground Truth Buffer for Idling) an extensible, realtime detection system that records the geolocation and idling duration of urban transit bus fleets internationally. Using live vehicle locations from General Transit Feed Specification (GTFS) Realtime, the system detects approximately 200,000 idling events per day from over 50 cities across North America, Europe, Oceania, and Asia. This realtime data was created to dynamically serve operational decision-making and fleet management to reduce the frequency and duration of idling events as they occur, as well as to capture its accumulative effects. Civil and Transportation Engineers, Urban Planners, Epidemiologists, Policymakers, and other stakeholders might find this useful for emissions modeling, traffic management, route planning, and other urban sustainability efforts at a variety of geographic and temporal scales.
comment: 34 pages, 12 figures, 36 tables, 100 data sources (including links). Under Review at Nature Scientific Data
♻ ☆ A Tutorial on the Non-Asymptotic Theory of System Identification
This tutorial serves as an introduction to recently developed non-asymptotic methods in the theory of -- mainly linear -- system identification. We emphasize tools we deem particularly useful for a range of problems in this domain, such as the covering technique, the Hanson-Wright Inequality and the method of self-normalized martingales. We then employ these tools to give streamlined proofs of the performance of various least-squares based estimators for identifying the parameters in autoregressive models. We conclude by sketching out how the ideas presented herein can be extended to certain nonlinear identification problems.
♻ ☆ Intelligent Energy Management with IoT Framework in Smart Cities Using Intelligent Analysis: An Application of Machine Learning Methods for Complex Networks and Systems
This study confronts the growing challenges of energy consumption and the depletion of energy resources, particularly in the context of smart buildings. As the demand for energy increases alongside the necessity for efficient building maintenance, it becomes imperative to explore innovative energy management solutions. We present a comprehensive review of Internet of Things (IoT)-based frameworks aimed at smart city energy management, highlighting the pivotal role of IoT devices in addressing these issues due to their compactness, sensing, measurement, and computing capabilities. Our review methodology encompasses a thorough analysis of existing literature on IoT architectures and frameworks for intelligent energy management applications. We focus on systems that not only collect and store data but also support intelligent analysis for monitoring, controlling, and enhancing system efficiency. Additionally, we examine the potential for these frameworks to serve as platforms for the development of third-party applications, thereby extending their utility and adaptability. The findings from our review indicate that IoT-based frameworks offer significant potential to reduce energy consumption and environmental impact in smart buildings. Through the adoption of intelligent mechanisms and solutions, these frameworks facilitate effective energy management, leading to improved system efficiency and sustainability. Considering these findings, we recommend further exploration and adoption of IoT-based wireless sensing systems in smart buildings as a strategic approach to energy management. Our review underscores the importance of incorporating intelligent analysis and enabling the development of third-party applications within the IoT framework to efficiently meet the evolving energy demands and maintenance challenges
♻ ☆ Event-Triggered Islanding in Inverter-Based Grids
The decentralization of modern power systems challenges the hierarchical structure of the electric grid and necessitates automated schemes to manage adverse conditions. This work proposes an adaptive isolation methodology that can divide a grid into autonomous islands, ensuring stable and economical operation amid deliberate (e.g., cyberattacks) or unintentional abnormal events. The adaptive isolation logic is event-triggered to prevent false positives, enhance detection accuracy, and reduce computational overhead. A measurement-based stable kernel representation (SKR) triggering mechanism initially inspects distributed generation controllers for abnormal behavior. The SKR then alerts a machine learning (ML) ensemble classifier to assess whether the system behavior remains within acceptable operational limits. The event-triggered adaptive isolation framework is evaluated using the IEEE RTS-24 and 118-bus systems. Simulation results demonstrate that the proposed framework detects anomalous behavior with 100% accuracy in real-time, i.e., within 22 msec. Supply-adequate partitions are identified outperforming traditional islanding detection and formation techniques while minimizing operating costs.
♻ ☆ Schrödinger Bridge with Quadratic State Cost is Exactly Solvable
Schr\"odinger bridge is a diffusion process that steers a given distribution to another in a prescribed time while minimizing the effort to do so. It can be seen as the stochastic dynamical version of the optimal mass transport, and has growing applications in generative diffusion models and stochastic optimal control. In this work, we propose a regularized variant of the Schr\"odinger bridge with a quadratic state cost-to-go that incentivizes the optimal sample paths to stay close to a nominal level. Unlike the conventional Schr\"odinger bridge, the regularization induces a state-dependent rate of killing and creation of probability mass, and its solution requires determining the Markov kernel of a reaction-diffusion partial differential equation. We derive this Markov kernel in closed form. Our solution recovers the heat kernel in the vanishing regularization (i.e., diffusion without reaction) limit, thereby recovering the solution of the conventional Schr\"odinger bridge. Our results enable the use of dynamic Sinkhorn recursion for computing the Schr\"odinger bridge with a quadratic state cost-to-go, which would otherwise be challenging to use in this setting. We deduce properties of the new kernel and explain its connections with certain exactly solvable models in quantum mechanics.
♻ ☆ $H_{\infty}$-Optimal Estimator Synthesis for Coupled Linear 2D PDEs using Convex Optimization
Any suitably well-posed PDE in two spatial dimensions can be represented as a Partial Integral Equation (PIE) -- with system dynamics parameterized using Partial Integral (PI) operators. Furthermore, $L_2$-gain analysis of PDEs with a PIE representation can be posed as a linear operator inequality, which can be solved using convex optimization. In this paper, these results are used to derive a convex-optimization-based test for constructing an $H_{\infty}$-optimal estimator for 2D PDEs. In particular, we first use PIEs to represent an arbitrary well-posed 2D PDE where sensor measurements occur along some boundary of the domain. An associated Luenberger-type estimator is then parameterized using a PI operator $\mathcal{L}$ as the observer gain. Examining the error dynamics of this estimator, it is proven that an upper bound on the $H_{\infty}$-norm of these error dynamics can be minimized by solving a linear operator inequality on PI operator variables. Finally, an analytical formula is proposed for inversion of a class of 2D PI operators, which is then used to reconstruct the Luenberger gain $\mathcal{L}$. Results are implemented in the PIETOOLS software suite -- applying the methodology and simulating the resulting observer for an unstable 2D heat equation with boundary observations.
♻ ☆ Rollover Prevention for Mobile Robots with Control Barrier Functions: Differentiator-Based Adaptation and Projection-to-State Safety
This paper develops rollover prevention guarantees for mobile robots using control barrier function (CBF) theory, and demonstrates the method experimentally. We consider a safety measure based on a zero moment point condition through the lens of CBFs. However, these conditions depend on time-varying and noisy parameters. To address this issue, we present a differentiator-based safety-critical controller that estimates these parameters and pairs Input-to-State Stable (ISS) differentiator dynamics with CBFs to achieve rigorous safety guarantees. Additionally, to ensure safety in the presence of disturbances, we utilize a time-varying extension of Projection-to-State Safety (PSSf). The effectiveness of the proposed method is demonstrated via experiments on a tracked robot with a rollover potential on steep slopes.
♻ ☆ Toward Enhanced Reinforcement Learning-Based Resource Management via Digital Twin: Opportunities, Applications, and Challenges
This article presents a digital twin (DT)-enhanced reinforcement learning (RL) framework aimed at optimizing performance and reliability in network resource management, since the traditional RL methods face several unified challenges when applied to physical networks, including limited exploration efficiency, slow convergence, poor long-term performance, and safety concerns during the exploration phase. To deal with the above challenges, a comprehensive DT-based framework is proposed to enhance the convergence speed and performance for unified RL-based resource management. The proposed framework provides safe action exploration, more accurate estimates of long-term returns, faster training convergence, higher convergence performance, and real-time adaptation to varying network conditions. Then, two case studies on ultra-reliable and low-latency communication (URLLC) services and multiple unmanned aerial vehicles (UAV) network are presented, demonstrating improvements of the proposed framework in performance, convergence speed, and training cost reduction both on traditional RL and neural network based Deep RL (DRL). Finally, the article identifies and explores some of the research challenges and open issues in this rapidly evolving field.
comment: 7pages, 6figures
♻ ☆ A cutting-surface consensus approach for distributed robust optimization of multi-agent systems
A novel and fully distributed optimization method is proposed for the distributed robust convex program (DRCP) over a time-varying unbalanced directed network under the uniformly jointly strongly connected (UJSC) assumption. Firstly, a tractable approximated DRCP (ADRCP) is introduced by discretizing the semi-infinite constraints into a finite number of inequality constraints and restricting the right-hand side of the constraints with a positive parameter. This problem is iteratively solved by a distributed projected gradient algorithm proposed in this paper, which is based on epigraphic reformulation and subgradient projected algorithms. Secondly, a cutting-surface consensus approach is proposed for locating an approximately optimal consensus solution of the DRCP with guaranteed feasibility. This approach is based on iteratively approximating the DRCP by successively reducing the restriction parameter of the right-hand constraints and populating the cutting-surfaces into the existing finite set of constraints. Thirdly, to ensure finite-time termination of the distributed optimization, a distributed termination algorithm is developed based on consensus and zeroth-order stopping conditions under UJSC graphs. Fourthly, it is proved that the cutting-surface consensus approach terminates finitely and yields a feasible and approximate optimal solution for each agent. Finally, the effectiveness of the approach is illustrated through a numerical example.
comment: Submitted to IEEE Transactions on Automatic Control
♻ ☆ A New Computational Approach for Solving Linear Bilevel Programs Based on Parameter-Free Disjunctive Decomposition
Linear bilevel programs (linear BLPs) have been widely used in computational mathematics and optimization in several applications. Single-level reformulation for linear BLPs replaces the lower-level linear program with its Karush-Kuhn-Tucker optimality conditions and linearizes the complementary slackness conditions using the big-M technique. Although the approach is straightforward, it requires finding the big-M whose computation is recently shown to be NP-hard. This paper presents a disjunctive-based decomposition algorithm which does not need finding the big-Ms whereas guaranteeing that obtained solution is optimal. Our experience shows promising performance of our algorithm.
comment: 16 pages, 3 figures
Multiagent Systems 5
☆ City-LEO: Toward Transparent City Management Using LLM with End-to-End Optimization
Existing operations research (OR) models and tools play indispensable roles in smart-city operations, yet their practical implementation is limited by the complexity of modeling and deficiencies in optimization proficiency. To generate more relevant and accurate solutions to users' requirements, we propose a large language model (LLM)-based agent ("City-LEO") that enhances the efficiency and transparency of city management through conversational interactions. Specifically, to accommodate diverse users' requirements and enhance computational tractability, City-LEO leverages LLM's logical reasoning capabilities on prior knowledge to scope down large-scale optimization problems efficiently. In the human-like decision process, City-LEO also incorporates End-to-end (E2E) model to synergize the prediction and optimization. The E2E framework be conducive to coping with environmental uncertainties and involving more query-relevant features, and then facilitates transparent and interpretable decision-making process. In case study, we employ City-LEO in the operations management of e-bike sharing (EBS) system. The numerical results demonstrate that City-LEO has superior performance when benchmarks against the full-scale optimization problem. With less computational time, City-LEO generates more satisfactory and relevant solutions to the users' requirements, and achieves lower global suboptimality without significantly compromising accuracy. In a broader sense, our proposed agent offers promise to develop LLM-embedded OR tools for smart-city operations management.
comment: 26 pages, 8 figures, 5 tables
☆ TorchOpera: A Compound AI System for LLM Safety
We introduce TorchOpera, a compound AI system for enhancing the safety and quality of prompts and responses for Large Language Models. TorchOpera ensures that all user prompts are safe, contextually grounded, and effectively processed, while enhancing LLM responses to be relevant and high quality. TorchOpera utilizes the vector database for contextual grounding, rule-based wrappers for flexible modifications, and specialized mechanisms for detecting and adjusting unsafe or incorrect content. We also provide a view of the compound AI system to reduce the computational cost. Extensive experiments show that TorchOpera ensures the safety, reliability, and applicability of LLMs in real-world settings while maintaining the efficiency of LLM responses.
♻ ☆ Mini Honor of Kings: A Lightweight Environment for Multi-Agent Reinforcement Learning
Games are widely used as research environments for multi-agent reinforcement learning (MARL), but they pose three significant challenges: limited customization, high computational demands, and oversimplification. To address these issues, we introduce the first publicly available map editor for the popular mobile game Honor of Kings and design a lightweight environment, Mini Honor of Kings (Mini HoK), for researchers to conduct experiments. Mini HoK is highly efficient, allowing experiments to be run on personal PCs or laptops while still presenting sufficient challenges for existing MARL algorithms. We have tested our environment on common MARL algorithms and demonstrated that these algorithms have yet to find optimal solutions within this environment. This facilitates the dissemination and advancement of MARL methods within the research community. Additionally, we hope that more researchers will leverage the Honor of Kings map editor to develop innovative and scientifically valuable new maps. Our code and user manual are available at: https://github.com/tencent-ailab/mini-hok.
♻ ☆ Evaluating Supply Chain Resilience During Pandemic Using Agent-based Simulation
Recent pandemics have highlighted vulnerabilities in our global economic systems, especially supply chains. Possible future pandemic raises a dilemma for businesses owners between short-term profitability and long-term supply chain resilience planning. In this study, we propose a novel agent-based simulation model integrating extended Susceptible-Infected-Recovered (SIR) epidemiological model and supply and demand economic model to evaluate supply chain resilience strategies during pandemics. Using this model, we explore a range of supply chain resilience strategies under pandemic scenarios using in silico experiments. We find that a balanced approach to supply chain resilience performs better in both pandemic and non-pandemic times compared to extreme strategies, highlighting the importance of preparedness in the form of a better supply chain resilience. However, our analysis shows that the exact supply chain resilience strategy is hard to obtain for each firm and is relatively sensitive to the exact profile of the pandemic and economic state at the beginning of the pandemic. As such, we used a machine learning model that uses the agent-based simulation to estimate a near-optimal supply chain resilience strategy for a firm. The proposed model offers insights for policymakers and businesses to enhance supply chain resilience in the face of future pandemics, contributing to understanding the trade-offs between short-term gains and long-term sustainability in supply chain management before and during pandemics.
♻ ☆ A cutting-surface consensus approach for distributed robust optimization of multi-agent systems
A novel and fully distributed optimization method is proposed for the distributed robust convex program (DRCP) over a time-varying unbalanced directed network under the uniformly jointly strongly connected (UJSC) assumption. Firstly, a tractable approximated DRCP (ADRCP) is introduced by discretizing the semi-infinite constraints into a finite number of inequality constraints and restricting the right-hand side of the constraints with a positive parameter. This problem is iteratively solved by a distributed projected gradient algorithm proposed in this paper, which is based on epigraphic reformulation and subgradient projected algorithms. Secondly, a cutting-surface consensus approach is proposed for locating an approximately optimal consensus solution of the DRCP with guaranteed feasibility. This approach is based on iteratively approximating the DRCP by successively reducing the restriction parameter of the right-hand constraints and populating the cutting-surfaces into the existing finite set of constraints. Thirdly, to ensure finite-time termination of the distributed optimization, a distributed termination algorithm is developed based on consensus and zeroth-order stopping conditions under UJSC graphs. Fourthly, it is proved that the cutting-surface consensus approach terminates finitely and yields a feasible and approximate optimal solution for each agent. Finally, the effectiveness of the approach is illustrated through a numerical example.
comment: Submitted to IEEE Transactions on Automatic Control
Information Retrieval 12
☆ ptt5-v2: A Closer Look at Continued Pretraining of T5 Models for the Portuguese Language
Despite advancements in Natural Language Processing (NLP) and the growing availability of pretrained models, the English language remains the primary focus of model development. Continued pretraining on language-specific corpora provides a practical solution for adapting models to other languages. However, the impact of different pretraining settings on downstream tasks remains underexplored. This work introduces $\texttt{ptt5-v2}$, investigating the continued pretraining of T5 models for Portuguese. We first develop a baseline set of settings and pretrain models with sizes up to 3B parameters. Finetuning on three Portuguese downstream tasks (assin2 STS, assin2 RTE, and TweetSentBR) yields SOTA results on the latter two. We then explore the effects of different pretraining configurations, including quality filters, optimization strategies, and multi-epoch pretraining. Perhaps surprisingly, their impact remains subtle compared to our baseline. We release $\texttt{ptt5-v2}$ pretrained checkpoints and the finetuned MonoT5 rerankers on HuggingFace at https://huggingface.co/collections/unicamp-dl/ptt5-v2-666538a650188ba00aa8d2d0 and https://huggingface.co/collections/unicamp-dl/monoptt5-66653981877df3ea727f720d.
☆ HiddenTables & PyQTax: A Cooperative Game and Dataset For TableQA to Ensure Scale and Data Privacy Across a Myriad of Taxonomies EMNLP 2023
A myriad of different Large Language Models (LLMs) face a common challenge in contextually analyzing table question-answering tasks. These challenges are engendered from (1) finite context windows for large tables, (2) multi-faceted discrepancies amongst tokenization patterns against cell boundaries, and (3) various limitations stemming from data confidentiality in the process of using external models such as gpt-3.5-turbo. We propose a cooperative game dubbed "HiddenTables" as a potential resolution to this challenge. In essence, "HiddenTables" is played between the code-generating LLM "Solver" and the "Oracle" which evaluates the ability of the LLM agents to solve Table QA tasks. This game is based on natural language schemas and importantly, ensures the security of the underlying data. We provide evidential experiments on a diverse set of tables that demonstrate an LLM's collective inability to generalize and perform on complex queries, handle compositional dependencies, and align natural language to programmatic commands when concrete table schemas are provided. Unlike encoder-based models, we have pushed the boundaries of "HiddenTables" to not be limited by the number of rows - therefore we exhibit improved efficiency in prompt and completion tokens. Our infrastructure has spawned a new dataset "PyQTax" that spans across 116,671 question-table-answer triplets and provides additional fine-grained breakdowns & labels for varying question taxonomies. Therefore, in tandem with our academic contributions regarding LLMs' deficiency in TableQA tasks, "HiddenTables" is a tactile manifestation of how LLMs can interact with massive datasets while ensuring data security and minimizing generation costs.
comment: In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023)
♻ ☆ Khmer Semantic Search Engine (KSE): Digital Information Access and Document Retrieval
The search engine process is crucial for document content retrieval. For Khmer documents, an effective tool is needed to extract essential keywords and facilitate accurate searches. Despite the daily generation of significant Khmer content, Cambodians struggle to find necessary documents due to the lack of an effective semantic searching tool. Even Google does not deliver high accuracy for Khmer content. Semantic search engines improve search results by employing advanced algorithms to understand various content types. With the rise in Khmer digital content such as reports, articles, and social media feedback enhanced search capabilities are essential. This research proposes the first Khmer Semantic Search Engine (KSE), designed to enhance traditional Khmer search methods. Utilizing semantic matching techniques and formally annotated semantic content, our tool extracts meaningful keywords from user queries, performs precise matching, and provides the best matching offline documents and online URLs. We propose three semantic search frameworks: semantic search based on a keyword dictionary, semantic search based on ontology, and semantic search based on ranking. Additionally, we developed tools for data preparation, including document addition and manual keyword extraction. To evaluate performance, we created a ground truth dataset and addressed issues related to searching and semantic search. Our findings demonstrate that understanding search term semantics can lead to significantly more accurate results.
♻ ☆ Understanding Performance of Long-Document Ranking Models through Comprehensive Evaluation and Leaderboarding
We evaluated 20+ Transformer models for ranking of long documents (including recent LongP models trained with FlashAttention) and compared them with a simple FirstP baseline, which applies the same model to the truncated input (at most 512 tokens). We used MS MARCO Documents v1 as a primary training set and evaluated both the zero-shot transferred and fine-tuned models. On MS MARCO, TREC DLs, and Robust04 no long-document model outperformed FirstP by more than 5% in NDCG and MRR (when averaged over all test sets). We conjectured this was not due to models' inability to process long context, but due to a positional bias of relevant passages, whose distribution was skewed towards the beginning of documents. We found direct evidence of this bias in some test sets, which motivated us to create MS MARCO FarRelevant (based on MS MARCO Passages) where the relevant passages were not present among the first 512 tokens. Unlike standard collections where we saw both little benefit from incorporating longer contexts and limited variability in model performance (within a few %), experiments on MS MARCO FarRelevant uncovered dramatic differences among models. The FirstP models performed roughly at the random-baseline level in both zero-shot and fine-tuning scenarios. Simple aggregation models including MaxP and PARADE Attention had good zero-shot accuracy, but benefited little from fine-tuning. Most other models had poor zero-shot performance (sometimes at a random baseline level), but outstripped MaxP by as much as 13-28% after fine-tuning. Thus, the positional bias not only diminishes benefits of processing longer document contexts, but also leads to model overfitting to positional bias and performing poorly in a zero-shot setting when the distribution of relevant passages changes substantially. We make our software and data available.
♻ ☆ Fetch-A-Set: A Large-Scale OCR-Free Benchmark for Historical Document Retrieval
This paper introduces Fetch-A-Set (FAS), a comprehensive benchmark tailored for legislative historical document analysis systems, addressing the challenges of large-scale document retrieval in historical contexts. The benchmark comprises a vast repository of documents dating back to the XVII century, serving both as a training resource and an evaluation benchmark for retrieval systems. It fills a critical gap in the literature by focusing on complex extractive tasks within the domain of cultural heritage. The proposed benchmark tackles the multifaceted problem of historical document analysis, including text-to-image retrieval for queries and image-to-text topic extraction from document fragments, all while accommodating varying levels of document legibility. This benchmark aims to spur advancements in the field by providing baselines and data for the development and evaluation of robust historical document retrieval systems, particularly in scenarios characterized by wide historical spectrum.
comment: Preprint for the manuscript accepted for publication in the DAS2024 LNCS proceedings
♻ ☆ RATSF: Empowering Customer Service Volume Management through Retrieval-Augmented Time-Series Forecasting
An efficient customer service management system hinges on precise forecasting of service volume. In this scenario, where data non-stationarity is pronounced, successful forecasting heavily relies on identifying and leveraging similar historical data rather than merely summarizing periodic patterns. Existing models based on RNN or Transformer architectures may struggle with this flexible and effective utilization. To tackle this challenge, we initially developed the Time Series Knowledge Base (TSKB) with an advanced indexing system for efficient historical data retrieval. We also developed the Retrieval Augmented Cross-Attention (RACA) module, a variant of the cross-attention mechanism within Transformer's decoder layers, designed to be seamlessly integrated into the vanilla Transformer architecture to assimilate key historical data segments. The synergy between TSKB and RACA forms the backbone of our Retrieval-Augmented Time Series Forecasting (RATSF) framework. Based on the above two components, RATSF not only significantly enhances performance in the context of Fliggy hotel service volume forecasting but also adapts flexibly to various scenarios and integrates with a multitude of Transformer variants for time-series forecasting. Extensive experimentation has validated the effectiveness and generalizability of this system design across multiple diverse contexts.
♻ ☆ A Systematic Investigation of Distilling Large Language Models into Cross-Encoders for Passage Re-ranking
Cross-encoders distilled from large language models (LLMs) are often more effective re-rankers than cross-encoders fine-tuned on manually labeled data. However, the distilled models usually do not reach their teacher LLM's effectiveness. To investigate whether best practices for fine-tuning cross-encoders on manually labeled data (e.g., hard-negative sampling, deep sampling, and listwise loss functions) can help to improve LLM ranker distillation, we construct and release a new distillation dataset: Rank-DistiLLM. In our experiments, cross-encoders trained on Rank-DistiLLM reach the effectiveness of LLMs while being orders of magnitude more efficient. Our code and data is available at https://github.com/webis-de/msmarco-llm-distillation.
♻ ☆ Set-Encoder: Permutation-Invariant Inter-Passage Attention for Listwise Passage Re-Ranking with Cross-Encoders
Existing cross-encoder re-rankers can be categorized as pointwise, pairwise, or listwise models. Pair- and listwise models allow passage interactions, which usually makes them more effective than pointwise models but also less efficient and less robust to input order permutations. To enable efficient permutation-invariant passage interactions during re-ranking, we propose a new cross-encoder architecture with inter-passage attention: the Set-Encoder. In Cranfield-style experiments on TREC Deep Learning and TIREx, the Set-Encoder is as effective as state-of-the-art listwise models while improving efficiency and robustness to input permutations. Interestingly, a pointwise model is similarly effective, but when additionally requiring the models to consider novelty, the Set-Encoder is more effective than its pointwise counterpart and retains its advantageous properties compared to other listwise models. Our code and models are publicly available at https://github.com/webis-de/set-encoder.
♻ ☆ Pre-training Cross-lingual Open Domain Question Answering with Large-scale Synthetic Supervision
Cross-lingual open domain question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation in the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural questions to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.
♻ ☆ RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads KDD 2023
To increase brand awareness, many advertisers conclude contracts with advertising platforms to purchase traffic and then deliver advertisements to target audiences. In a whole delivery period, advertisers usually desire a certain impression count for the ads, and they also expect that the delivery performance is as good as possible (e.g., obtaining high click-through rate). Advertising platforms employ pacing algorithms to satisfy the demands via adjusting the selection probabilities to traffic requests in real-time. However, the delivery procedure is also affected by the strategies from publishers, which cannot be controlled by advertising platforms. Preloading is a widely used strategy for many types of ads (e.g., video ads) to make sure that the response time for displaying after a traffic request is legitimate, which results in delayed impression phenomenon. Traditional pacing algorithms cannot handle the preloading nature well because they rely on immediate feedback signals, and may fail to guarantee the demands from advertisers. In this paper, we focus on a new research problem of impression pacing for preloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It learns a pacing agent that sequentially produces selection probabilities in the whole delivery period. To jointly optimize the two objectives of impression count and delivery performance, RLTP employs tailored reward estimator to satisfy the guaranteed impression count, penalize the over-delivery and maximize the traffic value. Experiments on large-scale industrial datasets verify that RLTP outperforms baseline pacing algorithms by a large margin. We have deployed the RLTP framework online to our advertising platform, and results show that it achieves significant uplift to core metrics including delivery completion rate and click-through rate.
comment: KDD 2023 (Applied Data Science Track). The first two authors contributed equally
♻ ☆ Evaluating Supply Chain Resilience During Pandemic Using Agent-based Simulation
Recent pandemics have highlighted vulnerabilities in our global economic systems, especially supply chains. Possible future pandemic raises a dilemma for businesses owners between short-term profitability and long-term supply chain resilience planning. In this study, we propose a novel agent-based simulation model integrating extended Susceptible-Infected-Recovered (SIR) epidemiological model and supply and demand economic model to evaluate supply chain resilience strategies during pandemics. Using this model, we explore a range of supply chain resilience strategies under pandemic scenarios using in silico experiments. We find that a balanced approach to supply chain resilience performs better in both pandemic and non-pandemic times compared to extreme strategies, highlighting the importance of preparedness in the form of a better supply chain resilience. However, our analysis shows that the exact supply chain resilience strategy is hard to obtain for each firm and is relatively sensitive to the exact profile of the pandemic and economic state at the beginning of the pandemic. As such, we used a machine learning model that uses the agent-based simulation to estimate a near-optimal supply chain resilience strategy for a firm. The proposed model offers insights for policymakers and businesses to enhance supply chain resilience in the face of future pandemics, contributing to understanding the trade-offs between short-term gains and long-term sustainability in supply chain management before and during pandemics.
♻ ☆ InteraRec: Screenshot Based Recommendations Using Multimodal Large Language Models
Weblogs, comprised of records detailing user activities on any website, offer valuable insights into user preferences, behavior, and interests. Numerous recommendation algorithms, employing strategies such as collaborative filtering, content-based filtering, and hybrid methods, leverage the data mined through these weblogs to provide personalized recommendations to users. Despite the abundance of information available in these weblogs, identifying and extracting pertinent information and key features from them necessitate extensive engineering endeavors. The intricate nature of the data also poses a challenge for interpretation, especially for non-experts. In this study, we introduce a sophisticated and interactive recommendation framework denoted as InteraRec, which diverges from conventional approaches that exclusively depend on weblogs for recommendation generation. InteraRec framework captures high-frequency screenshots of web pages as users navigate through a website. Leveraging state-of-the-art multimodal large language models (MLLMs), it extracts valuable insights into user preferences from these screenshots by generating a textual summary based on predefined keywords. Subsequently, an LLM-integrated optimization setup utilizes this summary to generate tailored recommendations. Through our experiments, we demonstrate the effectiveness of InteraRec in providing users with valuable and personalized offerings. Furthermore, we explore the integration of session-based recommendation systems into the InteraRec framework, aiming to enhance its overall performance. Finally, we curate a new dataset comprising of screenshots from product web pages on the Amazon website for the validation of the InteraRec framework. Detailed experiments demonstrate the efficacy of the InteraRec framework in delivering valuable and personalized recommendations tailored to individual user preferences.
Multimedia 2
☆ High-level Codes and Fine-grained Weights for Online Multi-modal Hashing Retrieval
In the real world, multi-modal data often appears in a streaming fashion, and there is a growing demand for similarity retrieval from such non-stationary data, especially at a large scale. In response to this need, online multi-modal hashing has gained significant attention. However, existing online multi-modal hashing methods face challenges related to the inconsistency of hash codes during long-term learning and inefficient fusion of different modalities. In this paper, we present a novel approach to supervised online multi-modal hashing, called High-level Codes, Fine-grained Weights (HCFW). To address these problems, HCFW is designed by its non-trivial contributions from two primary dimensions: 1) Online Hashing Perspective. To ensure the long-term consistency of hash codes, especially in incremental learning scenarios, HCFW learns high-level codes derived from category-level semantics. Besides, these codes are adept at handling the category-incremental challenge. 2) Multi-modal Hashing Aspect. HCFW introduces the concept of fine-grained weights designed to facilitate the seamless fusion of complementary multi-modal data, thereby generating multi-modal weights at the instance level and enhancing the overall hashing performance. A comprehensive battery of experiments conducted on two benchmark datasets convincingly underscores the effectiveness and efficiency of HCFW.
comment: 32 pages, 4 figures
♻ ☆ Towards Real-World Stickers Use: A New Dataset for Multi-Tag Sticker Recognition
In real-world conversations, the diversity and ambiguity of stickers often lead to varied interpretations based on the context, necessitating the requirement for comprehensively understanding stickers and supporting multi-tagging. To address this challenge, we introduce StickerTAG, the first multi-tag sticker dataset comprising a collected tag set with 461 tags and 13,571 sticker-tag pairs, designed to provide a deeper understanding of stickers. Recognizing multiple tags for stickers becomes particularly challenging due to sticker tags usually are fine-grained attribute aware. Hence, we propose an Attentive Attribute-oriented Prompt Learning method, ie, Att$^2$PL, to capture informative features of stickers in a fine-grained manner to better differentiate tags. Specifically, we first apply an Attribute-oriented Description Generation (ADG) module to obtain the description for stickers from four attributes. Then, a Local Re-attention (LoR) module is designed to perceive the importance of local information. Finally, we use prompt learning to guide the recognition process and adopt confidence penalty optimization to penalize the confident output distribution. Extensive experiments show that our method achieves encouraging results for all commonly used metrics.
Systems and Control 7
☆ Polynomial Chaos-based Stochastic Model Predictive Control: An Overview and Future Research Directions
This article is devoted to providing a review of mathematical formulations in which Polynomial Chaos Theory (PCT) has been incorporated into stochastic model predictive control (SMPC). In the past decade, PCT has been shown to provide a computationally tractable way to perform complete and accurate uncertainty propagation through (smooth) nonlinear dynamic systems. As such, it represents a very useful computational tool for accelerating the computations needed in SMPC with time invariant uncertainties. It turns out that it can also be used to reduce complexity of chance constraints, which are an important component of SMPC. In this paper, we provide an overview of PCT and discuss how it can be applied in such time invariant settings.
☆ Inverse Kinematics with Vision-Based Constraints
This paper introduces the Visual Inverse Kinematics problem (VIK) to fill the gap between robot Inverse Kinematics (IK) and visual servo control. Different from the IK problem, the VIK problem seeks to find robot configurations subject to vision-based constraints, in addition to kinematic constraints. In this work, we develop a formulation of the VIK problem with a Field of View (FoV) constraint, enforcing the visibility of an object from a camera on the robot. Our proposed solution is based on the idea of adding a virtual kinematic chain connecting the physical robot and the object; the FoV constraint is then equivalent to a joint angle kinematic constraint. Along the way, we introduce multiple vision-based cost functions to fulfill different objectives. We solve this formulation of the VIK problem using a method that involves a semidefinite program (SDP) constraint followed by a rank minimization algorithm. The performance of this method for solving the VIK problem is validated through simulations.
☆ Intermittent Encryption Strategies for Anti-Eavesdropping Estimation
In this paper, an anti-eavesdropping estimation problem is investigated. A linear encryption scheme is utilized, which first linearly transforms innovation via an encryption matrix and then encrypts some components of the transformed innovation. To reduce the computation and energy resources consumed by the linear encryption scheme, both stochastic and deterministic intermittent strategies which perform the linear encryption scheme only at partial moments are developed. When the system is stable, it is shown that the mean squared error (MSE) of the eavesdropper converges under any stochastic or deterministic intermittent strategy. Also, an analytical encryption matrix that maximizes the steady-state of the MSE is designed. When the system is unstable, the eavesdropper's MSE can be unbounded with arbitrary positive encryption probabilities and decision functions if encryption matrices are chosen appropriately. Then, the relationship between the aforementioned encryption parameters and the eavesdropper's MSE is analyzed. Moreover, a single intermittent strategy which only encrypts one message is discussed. This strategy can be unavailable for stable systems, but can make the eavesdropper's MSE unbounded in unstable systems for the encrypted message satisfies a linear matrix inequality (LMI) condition. The effectiveness of the proposed methods is verified in the simulation.
comment: 12 pages, 5 figures
☆ HumanPlus: Humanoid Shadowing and Imitation from Humans
One of the key arguments for building robots that have similar form factors to human beings is that we can leverage the massive human data for training. Yet, doing so has remained challenging in practice due to the complexities in humanoid perception and control, lingering physical gaps between humanoids and humans in morphologies and actuation, and lack of a data pipeline for humanoids to learn autonomous skills from egocentric vision. In this paper, we introduce a full-stack system for humanoids to learn motion and autonomous skills from human data. We first train a low-level policy in simulation via reinforcement learning using existing 40-hour human motion datasets. This policy transfers to the real world and allows humanoid robots to follow human body and hand motion in real time using only a RGB camera, i.e. shadowing. Through shadowing, human operators can teleoperate humanoids to collect whole-body data for learning different tasks in the real world. Using the data collected, we then perform supervised behavior cloning to train skill policies using egocentric vision, allowing humanoids to complete different tasks autonomously by imitating human skills. We demonstrate the system on our customized 33-DoF 180cm humanoid, autonomously completing tasks such as wearing a shoe to stand up and walk, unloading objects from warehouse racks, folding a sweatshirt, rearranging objects, typing, and greeting another robot with 60-100% success rates using up to 40 demonstrations. Project website: https://humanoid-ai.github.io/
comment: project website: https://humanoid-ai.github.io/
☆ Fast Geometric Learning of MIMO Signal Detection over Grassmannian Manifolds
Domain or statistical distribution shifts are a key staple of the wireless communication channel, because of the dynamics of the environment. Deep learning (DL) models for detecting multiple-input multiple-output (MIMO) signals in dynamic communication require large training samples (in the order of hundreds of thousands to millions) and online retraining to adapt to domain shift. Some dynamic networks, such as vehicular networks, cannot tolerate the waiting time associated with gathering a large number of training samples or online fine-tuning which incurs significant end-to-end delay. In this paper, a novel classification technique based on the concept of geodesic flow kernel (GFK) is proposed for MIMO signal detection. In particular, received MIMO signals are first represented as points on Grassmannian manifolds by formulating basis of subspaces spanned by the rows vectors of the received signal. Then, the domain shift is modeled using a geodesic flow kernel integrating the subspaces that lie on the geodesic to characterize changes in geometric and statistical properties of the received signals. The kernel derives low-dimensional representations of the received signals over the Grassman manifolds that are invariant to domain shift and is used in a geometric support vector machine (G-SVM) algorithm for MIMO signal detection in an unsupervised manner. Simulation results reveal that the proposed method achieves promising performance against the existing baselines like OAMPnet and MMNet with only 1,200 training samples and without online retraining.
♻ ☆ Auto-Encoding Bayesian Inverse Games
When multiple agents interact in a common environment, each agent's actions impact others' future decisions, and noncooperative dynamic games naturally capture this coupling. In interactive motion planning, however, agents typically do not have access to a complete model of the game, e.g., due to unknown objectives of other players. Therefore, we consider the inverse game problem, in which some properties of the game are unknown a priori and must be inferred from observations. Existing maximum likelihood estimation (MLE) approaches to solve inverse games provide only point estimates of unknown parameters without quantifying uncertainty, and perform poorly when many parameter values explain the observed behavior. To address these limitations, we take a Bayesian perspective and construct posterior distributions of game parameters. To render inference tractable, we employ a variational autoencoder (VAE) with an embedded differentiable game solver. This structured VAE can be trained from an unlabeled dataset of observed interactions, naturally handles continuous, multi-modal distributions, and supports efficient sampling from the inferred posteriors without computing game solutions at runtime. Extensive evaluations in simulated driving scenarios demonstrate that the proposed approach successfully learns the prior and posterior game parameter distributions, provides more accurate objective estimates than MLE baselines, and facilitates safer and more efficient game-theoretic motion planning.
♻ ☆ The role of local bounds on neighborhoods in the network for scale-free state synchronization of multi-agent systems
This paper provides necessary and sufficient conditions for the existence of solutions to the state synchronization problem of homogeneous multi-agent systems (MAS) via scale-free linear dynamic non-collaborative protocol for both continuous- and discrete-time. These conditions guarantee for which class of MAS, one can achieve scale-free state synchronization. We investigate protocol design with and without utilizing local bounds on neighborhood. The results show that the availability of local bounds on neighborhoods plays a key role.
comment: This paper was submitted to IJRNC on Aug. 3, 2023 and resubmitted on June 15, 2024. Now, it is under review at the fifth round
Multiagent Systems 2
☆ Emerging Safety Attack and Defense in Federated Instruction Tuning of Large Language Models
Federated learning (FL) enables multiple parties to collaboratively fine-tune an large language model (LLM) without the need of direct data sharing. Ideally, by training on decentralized data that is aligned with human preferences and safety principles, federated instruction tuning can result in an LLM that could behave in a helpful and safe manner. In this paper, we for the first time reveal the vulnerability of safety alignment in FedIT by proposing a simple, stealthy, yet effective safety attack method. Specifically, the malicious clients could automatically generate attack data without involving manual efforts and attack the FedIT system by training their local LLMs on such attack data. Unfortunately, this proposed safety attack not only can compromise the safety alignment of LLM trained via FedIT, but also can not be effectively defended against by many existing FL defense methods. Targeting this, we further propose a post-hoc defense method, which could rely on a fully automated pipeline: generation of defense data and further fine-tuning of the LLM. Extensive experiments show that our safety attack method can significantly compromise the LLM's safety alignment (e.g., reduce safety rate by 70\%), which can not be effectively defended by existing defense methods (at most 4\% absolute improvement), while our safety defense method can significantly enhance the attacked LLM's safety alignment (at most 69\% absolute improvement).
comment: 18 pages
♻ ☆ Auto-Encoding Bayesian Inverse Games
When multiple agents interact in a common environment, each agent's actions impact others' future decisions, and noncooperative dynamic games naturally capture this coupling. In interactive motion planning, however, agents typically do not have access to a complete model of the game, e.g., due to unknown objectives of other players. Therefore, we consider the inverse game problem, in which some properties of the game are unknown a priori and must be inferred from observations. Existing maximum likelihood estimation (MLE) approaches to solve inverse games provide only point estimates of unknown parameters without quantifying uncertainty, and perform poorly when many parameter values explain the observed behavior. To address these limitations, we take a Bayesian perspective and construct posterior distributions of game parameters. To render inference tractable, we employ a variational autoencoder (VAE) with an embedded differentiable game solver. This structured VAE can be trained from an unlabeled dataset of observed interactions, naturally handles continuous, multi-modal distributions, and supports efficient sampling from the inferred posteriors without computing game solutions at runtime. Extensive evaluations in simulated driving scenarios demonstrate that the proposed approach successfully learns the prior and posterior game parameter distributions, provides more accurate objective estimates than MLE baselines, and facilitates safer and more efficient game-theoretic motion planning.
Information Retrieval 7
☆ SparseCL: Sparse Contrastive Learning for Contradiction Retrieval
Contradiction retrieval refers to identifying and extracting documents that explicitly disagree with or refute the content of a query, which is important to many downstream applications like fact checking and data cleaning. To retrieve contradiction argument to the query from large document corpora, existing methods such as similarity search and crossencoder models exhibit significant limitations. The former struggles to capture the essence of contradiction due to its inherent nature of favoring similarity, while the latter suffers from computational inefficiency, especially when the size of corpora is large. To address these challenges, we introduce a novel approach: SparseCL that leverages specially trained sentence embeddings designed to preserve subtle, contradictory nuances between sentences. Our method utilizes a combined metric of cosine similarity and a sparsity function to efficiently identify and retrieve documents that contradict a given query. This approach dramatically enhances the speed of contradiction detection by reducing the need for exhaustive document comparisons to simple vector calculations. We validate our model using the Arguana dataset, a benchmark dataset specifically geared towards contradiction retrieval, as well as synthetic contradictions generated from the MSMARCO and HotpotQA datasets using GPT-4. Our experiments demonstrate the efficacy of our approach not only in contradiction retrieval with more than 30% accuracy improvements on MSMARCO and HotpotQA across different model architectures but also in applications such as cleaning corrupted corpora to restore high-quality QA retrieval. This paper outlines a promising direction for improving the accuracy and efficiency of contradiction retrieval in large-scale text corpora.
☆ QDA-SQL: Questions Enhanced Dialogue Augmentation for Multi-Turn Text-to-SQL
Fine-tuning large language models (LLMs) for specific domain tasks has achieved great success in Text-to-SQL tasks. However, these fine-tuned models often face challenges with multi-turn Text-to-SQL tasks caused by ambiguous or unanswerable questions. It is desired to enhance LLMs to handle multiple types of questions in multi-turn Text-to-SQL tasks. To address this, we propose a novel data augmentation method, called QDA-SQL, which generates multiple types of multi-turn Q\&A pairs by using LLMs. In QDA-SQL, we introduce a novel data augmentation method incorporating validation and correction mechanisms to handle complex multi-turn Text-to-SQL tasks. Experimental results demonstrate that QDA-SQL enables fine-tuned models to exhibit higher performance on SQL statement accuracy and enhances their ability to handle complex, unanswerable questions in multi-turn Text-to-SQL tasks. The generation script and test set are released at https://github.com/mcxiaoxiao/QDA-SQL.
comment: 13 pages, 7 figures
☆ ADSNet: Cross-Domain LTV Prediction with an Adaptive Siamese Network in Advertising KDD 2024
Advertising platforms have evolved in estimating Lifetime Value (LTV) to better align with advertisers' true performance metric. However, the sparsity of real-world LTV data presents a significant challenge to LTV predictive model(i.e., pLTV), severely limiting the their capabilities. Therefore, we propose to utilize external data, in addition to the internal data of advertising platform, to expand the size of purchase samples and enhance the LTV prediction model of the advertising platform. To tackle the issue of data distribution shift between internal and external platforms, we introduce an Adaptive Difference Siamese Network (ADSNet), which employs cross-domain transfer learning to prevent negative transfer. Specifically, ADSNet is designed to learn information that is beneficial to the target domain. We introduce a gain evaluation strategy to calculate information gain, aiding the model in learning helpful information for the target domain and providing the ability to reject noisy samples, thus avoiding negative transfer. Additionally, we also design a Domain Adaptation Module as a bridge to connect different domains, reduce the distribution distance between them, and enhance the consistency of representation space distribution. We conduct extensive offline experiments and online A/B tests on a real advertising platform. Our proposed ADSNet method outperforms other methods, improving GINI by 2$\%$. The ablation study highlights the importance of the gain evaluation strategy in negative gain sample rejection and improving model performance. Additionally, ADSNet significantly improves long-tail prediction. The online A/B tests confirm ADSNet's efficacy, increasing online LTV by 3.47$\%$ and GMV by 3.89$\%$.
comment: Accepted to KDD 2024
☆ TokenRec: Learning to Tokenize ID for LLM-based Generative Recommendation
There is a growing interest in utilizing large-scale language models (LLMs) to advance next-generation Recommender Systems (RecSys), driven by their outstanding language understanding and in-context learning capabilities. In this scenario, tokenizing (i.e., indexing) users and items becomes essential for ensuring a seamless alignment of LLMs with recommendations. While several studies have made progress in representing users and items through textual contents or latent representations, challenges remain in efficiently capturing high-order collaborative knowledge into discrete tokens that are compatible with LLMs. Additionally, the majority of existing tokenization approaches often face difficulties in generalizing effectively to new/unseen users or items that were not in the training corpus. To address these challenges, we propose a novel framework called TokenRec, which introduces not only an effective ID tokenization strategy but also an efficient retrieval paradigm for LLM-based recommendations. Specifically, our tokenization strategy, Masked Vector-Quantized (MQ) Tokenizer, involves quantizing the masked user/item representations learned from collaborative filtering into discrete tokens, thus achieving a smooth incorporation of high-order collaborative knowledge and a generalizable tokenization of users and items for LLM-based RecSys. Meanwhile, our generative retrieval paradigm is designed to efficiently recommend top-$K$ items for users to eliminate the need for the time-consuming auto-regressive decoding and beam search processes used by LLMs, thus significantly reducing inference time. Comprehensive experiments validate the effectiveness of the proposed methods, demonstrating that TokenRec outperforms competitive benchmarks, including both traditional recommender systems and emerging LLM-based recommender systems.
♻ ☆ DEGAP: Dual Event-Guided Adaptive Prefixes for Templated-Based Event Argument Extraction with Slot Querying
Recent advancements in event argument extraction (EAE) involve incorporating useful auxiliary information into models during training and inference, such as retrieved instances and event templates. These methods face two challenges: (1) the retrieval results may be irrelevant and (2) templates are developed independently for each event without considering their possible relationship. In this work, we propose DEGAP to address these challenges through a simple yet effective components: dual prefixes, i.e. learnable prompt vectors, where the instance-oriented prefix and template-oriented prefix are trained to learn information from different event instances and templates. Additionally, we propose an event-guided adaptive gating mechanism, which can adaptively leverage possible connections between different events and thus capture relevant information from the prefix. Finally, these event-guided prefixes provide relevant information as cues to EAE model without retrieval. Extensive experiments demonstrate that our method achieves new state-of-the-art performance on four datasets (ACE05, RAMS, WIKIEVENTS, and MLEE). Further analysis shows the impact of different components.
♻ ☆ DIET: Customized Slimming for Incompatible Networks in Sequential Recommendation KDD 2024
Due to the continuously improving capabilities of mobile edges, recommender systems start to deploy models on edges to alleviate network congestion caused by frequent mobile requests. Several studies have leveraged the proximity of edge-side to real-time data, fine-tuning them to create edge-specific models. Despite their significant progress, these methods require substantial on-edge computational resources and frequent network transfers to keep the model up to date. The former may disrupt other processes on the edge to acquire computational resources, while the latter consumes network bandwidth, leading to a decrease in user satisfaction. In response to these challenges, we propose a customizeD slImming framework for incompatiblE neTworks(DIET). DIET deploys the same generic backbone (potentially incompatible for a specific edge) to all devices. To minimize frequent bandwidth usage and storage consumption in personalization, DIET tailors specific subnets for each edge based on its past interactions, learning to generate slimming subnets(diets) within incompatible networks for efficient transfer. It also takes the inter-layer relationships into account, empirically reducing inference time while obtaining more suitable diets. We further explore the repeated modules within networks and propose a more storage-efficient framework, DIETING, which utilizes a single layer of parameters to represent the entire network, achieving comparably excellent performance. The experiments across four state-of-the-art datasets and two widely used models demonstrate the superior accuracy in recommendation and efficiency in transmission and storage of our framework.
comment: Accepted by KDD 2024
♻ ☆ MARVEL: Unlocking the Multi-Modal Capability of Dense Retrieval via Visual Module Plugin
This paper proposes Multi-modAl Retrieval model via Visual modulE pLugin (MARVEL), which learns an embedding space for queries and multi-modal documents to conduct retrieval. MARVEL encodes queries and multi-modal documents with a unified encoder model, which helps to alleviate the modality gap between images and texts. Specifically, we enable the image understanding ability of the well-trained dense retriever, T5-ANCE, by incorporating the visual module's encoded image features as its inputs. To facilitate the multi-modal retrieval tasks, we build the ClueWeb22-MM dataset based on the ClueWeb22 dataset, which regards anchor texts as queries, and extracts the related text and image documents from anchor-linked web pages. Our experiments show that MARVEL significantly outperforms the state-of-the-art methods on the multi-modal retrieval dataset WebQA and ClueWeb22-MM. MARVEL provides an opportunity to broaden the advantages of text retrieval to the multi-modal scenario. Besides, we also illustrate that the language model has the ability to extract image semantics and partly map the image features to the input word embedding space. All codes are available at https://github.com/OpenMatch/MARVEL.
Multimedia 3
☆ MINT: a Multi-modal Image and Narrative Text Dubbing Dataset for Foley Audio Content Planning and Generation
Foley audio, critical for enhancing the immersive experience in multimedia content, faces significant challenges in the AI-generated content (AIGC) landscape. Despite advancements in AIGC technologies for text and image generation, the foley audio dubbing remains rudimentary due to difficulties in cross-modal scene matching and content correlation. Current text-to-audio technology, which relies on detailed and acoustically relevant textual descriptions, falls short in practical video dubbing applications. Existing datasets like AudioSet, AudioCaps, Clotho, Sound-of-Story, and WavCaps do not fully meet the requirements for real-world foley audio dubbing task. To address this, we introduce the Multi-modal Image and Narrative Text Dubbing Dataset (MINT), designed to enhance mainstream dubbing tasks such as literary story audiobooks dubbing, image/silent video dubbing. Besides, to address the limitations of existing TTA technology in understanding and planning complex prompts, a Foley Audio Content Planning, Generation, and Alignment (CPGA) framework is proposed, which includes a content planning module leveraging large language models for complex multi-modal prompts comprehension. Additionally, the training process is optimized using Proximal Policy Optimization based reinforcement learning, significantly improving the alignment and auditory realism of generated foley audio. Experimental results demonstrate that our approach significantly advances the field of foley audio dubbing, providing robust solutions for the challenges of multi-modal dubbing. Even when utilizing the relatively lightweight GPT-2 model, our framework outperforms open-source multimodal large models such as LLaVA, DeepSeek-VL, and Moondream2. The dataset is available at https://github.com/borisfrb/MINT .
☆ Object-Attribute-Relation Representation based Video Semantic Communication
With the rapid growth of multimedia data volume, there is an increasing need for efficient video transmission in applications such as virtual reality and future video streaming services. Semantic communication is emerging as a vital technique for ensuring efficient and reliable transmission in low-bandwidth, high-noise settings. However, most current approaches focus on joint source-channel coding (JSCC) that depends on end-to-end training. These methods often lack an interpretable semantic representation and struggle with adaptability to various downstream tasks. In this paper, we introduce the use of object-attribute-relation (OAR) as a semantic framework for videos to facilitate low bit-rate coding and enhance the JSCC process for more effective video transmission. We utilize OAR sequences for both low bit-rate representation and generative video reconstruction. Additionally, we incorporate OAR into the image JSCC model to prioritize communication resources for areas more critical to downstream tasks. Our experiments on traffic surveillance video datasets assess the effectiveness of our approach in terms of video transmission performance. The empirical findings demonstrate that our OAR-based video coding method not only outperforms H.265 coding at lower bit-rates but also synergizes with JSCC to deliver robust and efficient video transmission.
♻ ☆ Multimodal Infusion Tuning for Large Models
Recent advancements in large-scale models have showcased remarkable generalization capabilities in various tasks. However, integrating multimodal processing into these models presents a significant challenge, as it often comes with a high computational burden. To address this challenge, we introduce a new parameter-efficient multimodal tuning strategy for large models in this paper, referred to as Multimodal Infusion Tuning (MiT). MiT leverages decoupled self-attention mechanisms within large language models to effectively integrate information from diverse modalities such as images and acoustics. In MiT, we also design a novel adaptive rescaling strategy at the attention head level, which optimizes the representation of infused multimodal features. Notably, all foundation models are kept frozen during the tuning process to reduce the computational burden and only 2.5\% parameters are tunable. We conduct experiments across a range of multimodal tasks, including image-related tasks like referring segmentation and non-image tasks such as sentiment analysis. Our results showcase that MiT achieves state-of-the-art performance in multimodal understanding while significantly reducing computational overhead(10\% of previous methods). Moreover, our tuned model exhibits robust reasoning abilities even in complex scenarios.